Sr. Hardware Reliability Engineer, Infrastructure Reliability & Quality

Amazon•Herndon, VA

100d

About The Position

As an Infrastructure Reliability Engineer specializing in Power Generation, you will be proactively driving the reliability risk identification, assessment, and mitigation for datacenter LV & MV generator systems. You will be responsible for root cause analysis of critical generator failures and drive continuous improvements to enhance datacenter availability for AWS customers. You will work closely with both internal teams and external partners including generator OEMs, fuel system suppliers, and service providers to drive key aspects of product specification, risk identification, and execution. You must be ownership minded, independent, action and results oriented to succeed in an open collaborative environment. The candidate should have experience applying Physics-of-Failure (PoF) based approaches to develop and implement both analytical and empirical methods for generator quality and reliability risk identification across design, manufacture, and deployment stages. The candidate should be able to drive AWS application-specific requirements for lifecycle environmental and operational stress analysis of generator systems. The candidate should be capable of evaluating not only generator design quality and reliability risks, but also have the skills and experience in assessing manufacturing process related quality issues for generator components and assemblies. Knowledge of statistical techniques and models is required to analyze generator test data and field performance data to identify trends and drive proactive risk mitigation. At the component level, the candidate will lead critical component identification for generator systems and define the associated vendor selection and qualification requirements. The candidate will be expected to use knowledge of production process capability and system-level performance requirements to establish critical-to-quality and critical-to-reliability metrics for generator components and subsystems. At the system level, the candidate will develop datacenter system-level reliability models incorporating generator configurations, including reliability block diagrams, statistical modeling, and data analytics to support datacenter configuration optimization. The candidate will be expected to be familiar with system reliability engineering tools and methodologies such as FMEA, fault tree analysis, Weibull analysis, and MTBF/MTTR calculations as applied to generator and power generation systems. During the sustaining stage, the candidate will be responsible for monitoring generator fleet performance in the field and will lead root cause analysis (RCA) for critical failures, driving associated corrective and preventive actions (CAPA). The individual will also lead effective vendor auditing and quarterly business reviews with generator OEMs and service partners to drive continuous improvement in generator reliability and datacenter availability. The successful candidate should be considered an expert in the reliability engineering field and have a proven track record of success in not only generator reliability leadership, but also business negotiations and program management. Strong skills in problem analysis and solving, communication, and vendor management are necessary. Candidates should also be able to travel within the US and internationally AWS Infrastructure Services owns the design, planning, delivery, and operation of all AWS global infrastructure. In other words, we’re the people who keep the cloud running. We support all AWS data centers and all of the servers, storage, networking, power, and cooling equipment that ensure our customers have continual access to the innovation they rely on. We work on the most challenging problems, with thousands of variables impacting the supply chain — and we’re looking for talented people who want to help. You’ll join a diverse team of software, hardware, and network engineers, supply chain specialists, security experts, operations managers, and other vital roles. You’ll collaborate with people across AWS to help us deliver the highest standards for safety and security while providing seemingly infinite capacity at the lowest possible cost for our customers. And you’ll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion. About the team AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud.

Requirements

Experience in industrial or commercial engineering in mission critical facilities including but not limited to: data centers, power generation or oil and gas facilities
Bachelor's or Master’s degree in Reliability Engineering, Physics, Electrical, Mechanical or Materials Engineering or related field
8+ years of Reliability Engineering work experience in high reliability industry
3+ years experience with accelerated life testing, stress analysis and finite element analysis

Nice To Haves

10+ years of work experience in reliability risk identification and assessment from component to system level applying analytical, experimental and statistical approaches to evaluate product design and manufacture quality/reliability levels
Experience in reliability engineering with a focus on power generation equipment, diesel or gas generators, or rotating machinery
Experience with proactive and effective reliability approaches in a cost-effective manner throughout product design, manufacture and deployment stages
Proven experience in working with external design and manufacturing supply chain partners.
Excellent verbal and written communication skills

Responsibilities

Drive DFR (Design for Reliability) methodology to proactively design-in reliability in New Product Designs
Drive reliability/quality qualification of third-party critical infrastructure equipment for use in AWS data centers
Oversee factory and site testing of third-party equipment in all LLE categories (Liquid Cooling, generator, chiller, air handler, etc.)
Guide and support Root Cause Analysis of field failures performed by internal teams, the OEM, and external laboratories. Validate conclusions and ensure highest standards are used in testing and remediation.
Make recommendations about AWS infrastructure maintenance and equipment replacement based on reliability data
Provide feedback to sourcing/procurement teams for evaluation of vendor performance
Analyze internal reliability data and create metrics to drive highest reliability at lowest cost
Support DFMEAs on as needed basis
Develop end of life strategy for critical infra equipment

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume