About The Position

AWS Infrastructure Services is responsible for the design, planning, delivery, and operation of all AWS global infrastructure, ensuring the continuous operation of data centers and the equipment within them. This role involves collaborating with a diverse team of engineers and specialists to maintain high standards of safety and security while optimizing capacity and cost for customers. The position focuses on proactively identifying, assessing, and mitigating reliability risks for datacenter infrastructure equipment, conducting root cause analysis for critical failures, and driving continuous improvements to enhance datacenter availability. The role requires close collaboration with internal and external partners, including suppliers, to define product specifications and manage risk assessment plans. Success in this role depends on being ownership-minded, independent, and results-oriented within a collaborative environment.

Requirements

  • Experience in industrial or commercial engineering in mission critical facilities including but not limited to: data centers, power generation or oil and gas facilities
  • 3+ years of root cause analysis and troubleshooting or problem solving experience
  • Bachelor's degree in Electrical or Mechanical Engineering, Engineering Technology, Reliability Engineering
  • 8+ years of Reliability Engineering work experience in high reliability industry
  • 3+ years experience with accelerated life testing, stress analysis and finite element analysis

Nice To Haves

  • Experience influencing internal and external stakeholders
  • Experience in Data Center Engineering Operations, with a deep understanding of electrical and mechanical data center infrastructure
  • 10+ years of work experience in reliability risk identification and assessment from component to system level applying analytical, experimental and statistical approaches to evaluate product design and manufacture quality/reliability levels
  • Experience with proactive and effective reliability approaches in a cost-effective manner throughout product design, manufacture and deployment stages
  • Proven experience in working with external design and manufacturing supply chain partners.
  • Ph.D. in Mechanical Engineering, material science, physics or equivalent.

Responsibilities

  • Drive DFR (Design for Reliability) methodology to proactively design-in reliability in New Product Designs
  • Drive reliability/quality qualification of third-party critical infrastructure equipment for use in AWS data centers
  • Oversee factory and site testing of third-party equipment in all LLE categories (Liquid Cooling, generator, chiller, air handler, etc.)
  • Guide and support Root Cause Analysis of field failures performed by internal teams, the OEM, and external laboratories. Validate conclusions and ensure highest standards are used in testing and remediation.
  • Make recommendations about AWS infrastructure maintenance and equipment replacement based on reliability data
  • Provide feedback to sourcing/procurement teams for evaluation of vendor performance
  • Analyze internal reliability data and create metrics to drive highest reliability at lowest cost
  • Support DFMEAs on as needed basis
  • Develop end of life strategy for critical infra equipment

Benefits

  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
  • sign-on payments
  • restricted stock units (RSUs)
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service