The Reliability Engineer is accountable for facility infrastructure reliability across mission critical data center systems (power, cooling, controls). You will design, implement, and continuously improve asset strategies and work management processes to achieve uptime, safety, and cost objectives. Core work includes reliability analytics, PM optimization, MOP/SOP governance, change management, root cause analysis (RCA), and program execution for critical spares, condition monitoring, and lifecycle asset management. Reliability Strategy & Asset Care Develop and maintain equipment strategies (criticality, failure modes, maintenance prescriptions) for power and cooling systems. Own PM quality and audit activities; eliminate ineffective tasks and deploy optimized prescriptions. Work Management Excellence Author, review, and govern SOPs/MOPs/EOPs and change packages; ensure adherence through training and approvals. Partner with site teams to maintain CMMS schedules and O&M plans; lead reliability investigations and corrective actions. Condition Monitoring & Analytics Implement oil/coolant analysis, thermography, vibration, and battery monitoring; trend data to preempt failures. Critical Spares & Lifecycle Management Establish and maintain critical spares lists and stocking strategies; track gaps and remedial actions. Support lifecycle asset management processes to guide replacements and capital planning. RCA & Continuous Improvement Lead post incident RCAs and FMEA; publish learnings and update procedures. People & Certification Collaborate with CE leaders to uphold operator certification and training standards; mentor technicians on reliability methods.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
251-500 employees