Product Quality Engineer, AI/ML, Hardware, Google Cloud

Google LLCAtlanta, GA
42d$165,000 - $245,000

About The Position

Be part of a team that pushes boundaries, developing custom silicon solutions that power the future of Google's direct-to-consumer products. You'll contribute to the innovation behind products loved by millions worldwide. Your expertise will shape the next generation of hardware experiences, delivering unparalleled performance, efficiency, and integration. Google Cloud is powered by advanced compute, network, storage, and Artificial Intelligence (AI) platforms, built on one of the world's largest and most sophisticated Technical Infrastructures (TI). The Cloud Supply Chain and Operations (CSCO) teams are responsible for the fast and efficient deployment of this infrastructure. The Global Hardware Quality and Reliability (GHQR) team ensures predictable quality and reliability across all hardware components, systems including Tensor Processing Unit/Graphics Processing Unit (TPU/GPU) AI platforms and data center infrastructure. This hardware is the foundation of Google Cloud and its AI/ML capabilities, directly contributing to Google's engaged edge. In this role, you will own the quality and reliability strategy for Google's TPU/GPU-based AI/ML platforms. You will be the quality expert, collaborating with cross-functional partners in Design, Manufacturing, and Operations to embed quality into every product. You will also analyze data, drive root cause analysis, and influence process improvements. The AI and Infrastructure team is redefining what's possible. We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and velocity. Our customers include Googlers, Google Cloud customers, and billions of Google users worldwide. We're the driving force behind Google's groundbreaking innovations, empowering the development of our cutting-edge AI models, delivering unparalleled computing power to global services, and providing the essential platforms that enable developers to build the future. From software to hardware our teams are shaping the future of world-leading hyperscale computing, with key teams working on the development of our TPUs, Vertex AI for Google Cloud, Google Global Networking, Data Center operations, systems research, and much more.

Requirements

  • Bachelor's degree in Electrical Engineering, Computer Engineering, Materials Science, Industrial Engineering, a related technical field, or equivalent practical experience.
  • 10 years of experience in Hardware Quality, Reliability, Product Engineering, or a similar role focused on electronic systems (e.g., servers, accelerators, networking equipment).
  • 8 years of experience leading cross-functional teams to solve technical problems and drive quality improvements.
  • Experience with AI/ML system architectures, including TPU/GPU based platforms, key components (e.g., high-speed interconnects, power delivery), and characteristic failure modes.

Nice To Haves

  • Master's or PhD degree in Electrical Engineering or a related field.
  • Certification in Certified Reliability/Quality Engineer (CRE/CQE).
  • 12 years of experience in Quality/Reliability, with substantial direct experience in GPU/TPU or other AI/ML accelerator hardware.
  • Experience in a technical leadership role with defining quality strategy, and collaborating with executive stakeholders.
  • Experience in a customer-facing quality role with managing executive communications and escalations for technical issues, with the ability to travel as required.
  • Excellent hardware and software debugging skills, with experience in analyzing system logs, manufacturing test data, and diagnostic outputs to pinpoint root causes.

Responsibilities

  • Define and own the quality and reliability strategy for TPU/GPU hardware across its entire life-cycle, from design through field support.
  • Lead the resolution of systemic quality issues in manufacturing and the field, driving Root Cause and Corrective Actions (RCCA) using structured methodologies.
  • Collaborate with engineering teams to influence design specifications, qualification plans, and test coverage to ensure product robustness and mitigate early risks.
  • Establish and monitor key quality KPIs (e.g., Average Severity Rate (ASR), Average Failure Rate (AFR), etc.). Analyze manufacturing and field data to develop predictive models and drive improvement in design and processes.
  • Act as the primary point for customer quality, managing escalations and integrating feedback. Oversee quality and corrective actions with suppliers, including Return Material Authorizations (RMA) and Process Change Notifications (PCN) qualification.

Benefits

  • bonus
  • equity
  • benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Web Search Portals, Libraries, Archives, and Other Information Services

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service