Senior HW Quality Engineer

Microsoft•Atlanta, GA

About The Position

Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft’s expanding Cloud Infrastructure and responsible for powering Microsoft’s “Intelligent Cloud” mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive, and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for a passionate Senior HW Quality Engineer to help achieve that mission. As Microsoft's cloud business continues to grow the ability to deploy new offerings and hardware infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for hardware manufacturing, improving the planning process, quality, delivery, scale and sustainability related to Microsoft cloud hardware.

Requirements

Doctorate Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 2+ years technical engineering experience OR Master's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 4+ years technical engineering experience OR Bachelor's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 5+ years technical engineering experience OR 12+ years relevant technical engineering experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice To Haves

Master’s degree in Electrical Engineering, Computer HW, or System Engineering.
Leadership skills and ability to collaborate with diverse teams and drive a call to action.
10+ years of experience in working with the modern server architectures and/or their subsystems – including GPU, CPU, AI hardware, Memory and methods for root cause analysis and debugging.
Project management skills.

Responsibilities

This role is responsible for end‑to‑end hardware quality engineering for large‑scale AI and GPU‑based data center fleets, with a focus on early failure detection, root cause analysis, and operational quality enablement.
Lead deep‑dive investigations into complex hardware failures across AI/GPU platforms, including no-boot, GPU throttling, power delivery issues, and BMC related issues.
Perform single‑node and rack‑level validation to confirm hardware remediation effectiveness before fleet re‑entry.
Analyze large volumes of hardware telemetry, logs, and diagnostics data to identify systemic failure patterns and repeat offenders.
Define and drive lightweight diagnostics and telemetry workflows that allow technicians and operations teams to validate repairs before ticket closure, reducing repeat failures and rework.
Partner with diagnostics and platform teams to enable out‑of‑band telemetry collection (e.g., BMC SEL, PCIe health indicators), integrate diagnostics into existing operational tools and workflows.
Own fleet‑level quality assessments for AI and GPU deployments, including early deployment phases and ongoing production monitoring.
Drive improvements to Failure detection latency, root cause attribution accuracy and preventative quality controls (left‑shifting detection into earlier lifecycle stages).
Sync with Firmware and electrical engineering teams on corrective actions (e.g., CPLD firmware changes, power quality investigations).
Collaborate with supply chain and spares teams to ensure hardware availability aligns with failure remediation needs.
Ability to work with Data center operations leadership to ensure solutions scale globally and align with operational SLAs.
Produce and maintain technical documentation, failure mode analyses (FMEA), and quality playbooks for sustained operational excellence.