Reliability/DFX Engineer

OpenAISan Francisco, CA
98d

About The Position

OpenAI’s Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI’s supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI. We are seeking a highly skilled cross-stack engineer with deep expertise in making ML systems reliable at scale. This hands-on individual contributor will sit within our hardware team and work closely with chip design, platform design, hardware health, and the broader industry ecosystem to architect, implement, and deploy reliable next-generation AI accelerator systems. This engineer will evaluate system and chip architecture holistically, identify high-ROI opportunities to improve reliability and availability across the stack, and translate those opportunities into strategy and silicon features.

Requirements

  • BS with 15+ years, MS with 10+ years, or PhD with 3+ years of relevant industry experience focused on reliability across the chip/platform stack.
  • Hands-on experience with RTL design and DFT is required; physical implementation and/or silicon ATE experience is preferred.
  • Detailed understanding of ML chip and platform architecture and ML workload characteristics is required.
  • Strong fundamentals in reliability modeling, with hands-on skills in empirical data analysis.

Responsibilities

  • Oversee DFX architecture, implementation, and execution in silicon from concept to high-volume deployment, and propose high-ROI features to enhance reliability and fault tolerance.
  • Build system-level reliability models grounded in empirical data to guide organization-wide DFX and reliability strategy.
  • Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and implementation of digital/mixed-signal IP, firmware/system software, and DFX methodology.
  • Partner with hardware health and platform design teams to continuously improve reliability and fault tolerance in NPI and HVM.
  • Serve as the DFX/reliability champion and evangelist to align the broader industry ecosystem with OpenAI’s requirements and roadmap.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service