Test Engineer, Hardware

OpenAISan Francisco, CA
4hHybrid

About The Position

OpenAI’s Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI’s supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI. In this role: As a Hardware Test Engineer, you will work on Machine Learning/AI hardware system projects to craft the solutions for current and future data center deployments. You will bring a strong understanding of hardware system testing, excellent project management skills, and the ability to collaborate across multiple teams to ensure efficient lab operations. You will be responsible for designing, implementing, and executing comprehensive test plans that ensure the reliability, performance, and scalability of our supercomputing hardware systems. You will develop detailed test plans and methodologies tailored to hardware components, including processors, memory modules, custom accelerators and interconnects. You will collaborate with hardware design, manufacturing, firmware teams and vendors to identify, analyze, and resolve issues affecting hardware, power, thermal and high-speed interconnects. You will perform in-depth debugging on the hardware system Excellent analytical skills to diagnose hardware issues, troubleshoot problems, and propose solutions. Ability to interpret complex test data, identify trends, and draw meaningful conclusions. High-speed links, with a focus on SerDes (Serializer/Deserializer) technology to assess signal integrity, error rates, and overall link performance. You will collaborate with the lab manager to maintain the equipment and hardware systems, including oscilloscopes, thermal test chambers, liquid cooling systems, and other measurement devices. You will utilize advanced diagnostic and measurement tools to capture performance metrics and analyze system behavior. You will develop and maintain automation scripts to streamline testing processes. Proficiency in Python, Bash and test automation frameworks to develop automated test scripts is highly desired. Effective communication to clearly document test results, report defects, and collaborate with cross-functional teams.

Requirements

  • At least 10 years of industry experience, including experience testing or supporting hardware design teams for datacenter applications
  • Proven experience in hardware testing and validation, with hands-on experience with oscilloscopes, power supplies, analyzers, and other test instruments.
  • Strong understanding of electrical, power and thermal testing methodologies.
  • Expertise with protocol of common interfaces such as SPI, I2C, USB, DDRx, and have experience characterizing and verifying compliance of Datacenter Ethernet interfaces.
  • Hands-on experience with high-speed electrical and thermal test equipment such as oscilloscopes, VNA, TDR, thermal cameras, heat sinks, and test chambers.
  • Excellent analytical, problem-solving, and troubleshooting skills.

Nice To Haves

  • Experience in using Cadence Allegro and automation / scripting.
  • Strong bias toward action, and won’t take no for an answer.
  • Experience and good knowledge of system testing from xPUs, board, rack level to data center level
  • Strong intrinsic desire to learn and fill in missing skills; and an equally strong talent for sharing that information clearly and concisely with others.
  • Comfortable with ambiguity and rapidly changing conditions.
  • Proficiency in Python, Bash and test automation frameworks to develop automated test scripts is highly desired.

Responsibilities

  • Designing, implementing, and executing comprehensive test plans that ensure the reliability, performance, and scalability of our supercomputing hardware systems.
  • Developing detailed test plans and methodologies tailored to hardware components, including processors, memory modules, custom accelerators and interconnects.
  • Collaborating with hardware design, manufacturing, firmware teams and vendors to identify, analyze, and resolve issues affecting hardware, power, thermal and high-speed interconnects.
  • Performing in-depth debugging on the hardware system
  • Interpreting complex test data, identify trends, and draw meaningful conclusions.
  • Assessing signal integrity, error rates, and overall link performance.
  • Maintaining the equipment and hardware systems, including oscilloscopes, thermal test chambers, liquid cooling systems, and other measurement devices.
  • Utilizing advanced diagnostic and measurement tools to capture performance metrics and analyze system behavior.
  • Developing and maintaining automation scripts to streamline testing processes.
  • Documenting test results, report defects, and collaborate with cross-functional teams.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service