Sr. Hardware Reliability Engineer, Infrastructure Reliability & Quality

Amazon•Herndon, VA

108d•Onsite

About The Position

AWS Infrastructure Services is responsible for the design, planning, delivery, and operation of all AWS global infrastructure, ensuring the continuous operation of data centers and the equipment within them. This role involves collaborating with a diverse team of engineers and specialists to maintain high standards of safety and security while optimizing capacity and cost for customers. The position focuses on proactively identifying, assessing, and mitigating reliability risks for datacenter infrastructure equipment, conducting root cause analysis for critical failures, and driving continuous improvements to enhance datacenter availability. The role requires close collaboration with internal and external partners, including suppliers, to define product specifications and manage risk assessment plans. Success in this role depends on being ownership-minded, independent, and results-oriented within a collaborative environment.

Requirements

Experience in industrial or commercial engineering in mission critical facilities including but not limited to: data centers, power generation or oil and gas facilities
3+ years of root cause analysis and troubleshooting or problem solving experience
Bachelor's degree in Electrical or Mechanical Engineering, Engineering Technology, Reliability Engineering
8+ years of Reliability Engineering work experience in high reliability industry
3+ years experience with accelerated life testing, stress analysis and finite element analysis

Nice To Haves

Experience influencing internal and external stakeholders
Experience in Data Center Engineering Operations, with a deep understanding of electrical and mechanical data center infrastructure
10+ years of work experience in reliability risk identification and assessment from component to system level applying analytical, experimental and statistical approaches to evaluate product design and manufacture quality/reliability levels
Experience with proactive and effective reliability approaches in a cost-effective manner throughout product design, manufacture and deployment stages
Proven experience in working with external design and manufacturing supply chain partners.
Ph.D. in Mechanical Engineering, material science, physics or equivalent.

Responsibilities

Drive DFR (Design for Reliability) methodology to proactively design-in reliability in New Product Designs
Drive reliability/quality qualification of third-party critical infrastructure equipment for use in AWS data centers
Oversee factory and site testing of third-party equipment in all LLE categories (Liquid Cooling, generator, chiller, air handler, etc.)
Guide and support Root Cause Analysis of field failures performed by internal teams, the OEM, and external laboratories. Validate conclusions and ensure highest standards are used in testing and remediation.
Make recommendations about AWS infrastructure maintenance and equipment replacement based on reliability data
Provide feedback to sourcing/procurement teams for evaluation of vendor performance
Analyze internal reliability data and create metrics to drive highest reliability at lowest cost
Support DFMEAs on as needed basis
Develop end of life strategy for critical infra equipment

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave
sign-on payments
restricted stock units (RSUs)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume