Sr. Infrastructure Reliability Engineer, Infrastructure Reliability & Quality

AmazonHerndon, VA
$136,600 - $184,800Onsite

About The Position

AWS Infrastructure Services is responsible for the design, planning, delivery, and operation of all AWS global infrastructure, ensuring the continuous availability of cloud services for customers. This role involves working with a diverse team of engineers and specialists to maintain high standards of safety, security, and cost-efficiency. The Senior Infrastructure Reliability Engineer will proactively identify, assess, and mitigate reliability risks for datacenter infrastructure equipment, conduct root cause analysis for critical equipment failures, and drive continuous improvements to enhance datacenter availability. Collaboration with internal and external partners, including suppliers, is key to defining product specifications and risk management plans. The ideal candidate will be ownership-minded, independent, action-oriented, and results-driven, thriving in a collaborative environment.

Requirements

  • 6+ years of industrial or commercial engineering in mission critical facilities including but not limited to: data centers, power generation or oil and gas facilities experience
  • Knowledge of critical data center mechanical and electrical equipment
  • Experience in data center design, construction, operations, or facility maintenance
  • Bachelor's degree in Electrical Engineering, Mechanical Engineering, or a related field
  • Experience in using Physics-of-Failure based approach to develop and implement both analytical and empirical approaches for product quality/reliability risk identification and assessment during product design, manufacture as well as deployment stages.
  • Ability to drive AWS application-specific requirements in carrying out both lifecycle environmental and operational stress driven risk analysis, including thermal, electrical, chemical and mechanical stresses so to identify overstress and fatigue-related product weaknesses.
  • Capability of evaluating not only product design quality/reliability risks, but also have the skills and experiences in assessing electronics manufacture process related quality/reliability issues.
  • Knowledge of statistical techniques and models is required to analyze test as well as field data.
  • Familiarity with system reliability engineering tools, such as reliability block diagram, statistical modeling and data analytics.
  • Strong skill-set in problem analysis and solving.
  • Strong communication and vendor management skills.

Nice To Haves

  • Experience carrying design concepts through exploration, development, and into deployment or mass production
  • Experience reading, interpreting, and creating construction drawings, specifications, and submittal documents
  • Master's in Reliability Engineering, Physics, Electrical, Mechanical or Materials Engineering or a related field
  • 8+ years of work experience in reliability risk identification and assessment from component to system level applying analytical, experimental and statistical approaches to evaluate product design and manufacture quality/reliability levels
  • Experience with proactive and effective reliability approaches in a cost-effective manner throughout product design, manufacture and deployment stages.
  • Proven track record of success in product reliability leadership, business negotiations and program management.

Responsibilities

  • Proactively driving the reliability risk identification, assessment, and mitigation for datacenter infrastructure equipment (e.g., LV Generator, MV Transformers, LV SWGR, Breakers, UPS, HV Transformers, In-rack Power shelf).
  • Performing root cause analysis of critical equipment failures.
  • Driving continuous improvements to enhance datacenter availability for AWS customers.
  • Working closely with internal and external partners, including suppliers, to drive product specification, risk identification plan, and execution.
  • Developing and implementing analytical and empirical approaches for product quality/reliability risk identification and assessment using a Physics-of-Failure based approach during product design, manufacture, and deployment stages.
  • Driving AWS application-specific requirements in lifecycle environmental and operational stress-driven risk analysis (thermal, electrical, chemical, mechanical stresses) to identify overstress and fatigue-related product weaknesses.
  • Assessing electronics manufacture process-related quality/reliability issues.
  • Analyzing test and field data using statistical techniques and models.
  • Driving critical component identification, vendor selection, and qualification requirements.
  • Establishing critical to quality and reliability metrics using knowledge of process capability for electronic component production and system-level performance requirements.
  • Developing datacenter system-level reliability models and related reliability quantification and risk analysis for datacenter configuration optimization.
  • Monitoring product performance in the field during the sustaining stage.
  • Driving root cause analysis of critical failures and associated corrective and preventive actions.
  • Driving effective vendor auditing and quarterly review processes to improve datacenter availability.
  • Traveling within the US and internationally.

Benefits

  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service