Sr. Hardware Reliability Engineer, Infrastructure Reliability & Quality

Amazon•Herndon, VA

79d•Onsite

About The Position

AWS Infrastructure Services is responsible for the design, planning, delivery, and operation of all AWS global infrastructure, ensuring the continuous operation of data centers and the equipment within them. The team supports all AWS data centers, including servers, storage, networking, power, and cooling systems. They tackle complex challenges in a dynamic environment and are seeking talented individuals to join their diverse team of engineers, specialists, and managers. The role involves collaborating across AWS to maintain high standards of safety and security while delivering scalable and cost-effective infrastructure. The team fosters an inclusive culture that encourages bold ideas and empowers employees to see them through to completion.

Requirements

Experience in industrial or commercial engineering in mission-critical facilities including but not limited to: data centers, power generation or oil and gas facilities.
Bachelor's or Master’s degree in Reliability Engineering, Physics, Mechanical or Materials Engineering or related field.
8+ years of Reliability Engineering work experience in a high reliability industry.
3+ years experience with accelerated life testing, stress analysis and finite element analysis.

Nice To Haves

10+ years of work experience in reliability risk identification and assessment from component to system level applying analytical, experimental and statistical approaches to evaluate product design and manufacture quality/reliability levels.
Experience with proactive and effective reliability approaches in a cost-effective manner throughout product design, manufacture and deployment stages.
Proven experience in working with external design and manufacturing supply chain partners.
Excellent verbal and written communication skills.
Ability to travel within US and internationally.

Responsibilities

Proactively driving the reliability risk identification, assessment, and mitigation for datacenter infrastructure equipment (e.g., Air Handling Units, LV Generators, MV Transformers, Chillers).
Performing root cause analysis of critical equipment failures and driving continuous improvements to enhance datacenter availability for AWS customers.
Working closely with internal and external partners, including suppliers, to drive product specification, risk identification plans, and execution.
Using a Physics-of-Failure based approach to develop and implement analytical and empirical methods for product quality/reliability risk identification and assessment during design, manufacturing, and deployment stages.
Driving AWS application-specific requirements for lifecycle environmental and operational stress-driven risk analysis (thermal, electrical, chemical, mechanical) to identify overstress and fatigue-related product weaknesses.
Evaluating product design quality/reliability risks and assessing electronics manufacturing process-related quality/reliability issues.
Utilizing knowledge of statistical techniques and models to analyze test and field data.
Driving critical component identification, vendor selection, and qualification requirements.
Establishing critical to quality and reliability metrics using knowledge of electronic component production process capability and system-level performance requirements.
Developing datacenter system-level reliability models and related reliability quantification and risk analysis for datacenter configuration optimization.
Monitoring product performance in the field during the sustaining stage, driving root cause analysis of critical failures, and implementing corrective and preventive actions.
Driving effective vendor auditing and quarterly review processes to foster continuous improvements in datacenter availability.
Driving DFR (Design for Reliability) methodology to proactively design-in reliability in New Product Designs.
Driving reliability/quality qualification of third-party critical infrastructure equipment for use in AWS data centers.
Overseeing factory and site testing of third-party equipment in all LLE categories (Liquid Cooling, generator, chiller, air handler, etc.).
Guiding and supporting Root Cause Analysis of field failures performed by internal teams, the OEM, and external laboratories, validating conclusions and ensuring highest standards in testing and remediation.
Making recommendations about AWS infrastructure maintenance and equipment replacement based on reliability data.
Providing feedback to sourcing/procurement teams for vendor performance evaluation.
Analyzing internal reliability data and creating metrics to drive highest reliability at lowest cost.
Supporting DFMEAs on an as-needed basis.
Developing end-of-life strategies for critical infra equipment.

Benefits

Health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
Paid time off
Parental leave
Sign-on payments
Restricted stock units (RSUs)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume