Senior Reliability Engineer

CyrusOneMidland, MI
1d$140,000 - $170,000

About The Position

The Senior Reliability Engineer serves as a subject-matter expert and strategic technical authority for infrastructure reliability across a portfolio of mission-critical data center sites. This role leads the design, governance, and continuous improvement of reliability strategies for power, cooling, and control systems, applying advanced engineering judgment, analytics, and risk-based decision-making. The Senior Reliability Engineer independently evaluates complex reliability risks, prioritizes initiatives under uncertainty, and influences operational, maintenance, and capital decisions that materially impact uptime, safety, and lifecycle cost. This role operates with minimal oversight and is expected to shape standards, mentor others, and elevate reliability capability across the organization.

Requirements

  • 10+ years of experience in reliability engineering, maintenance engineering, or facilities engineering within mission-critical environments.
  • Demonstrated leadership of complex, multi-system reliability programs with measurable business impact.
  • Expert-level knowledge of RCM, FMEA, RCA, and maintenance optimization methodologies.
  • Deep technical understanding of mission-critical infrastructure, including UPS, generators, switchgear, chillers, cooling towers, CRAH/CRAC, and BMS/EPMS.
  • Proven experience governing SOP/MOP/EOP programs and assessing operational change risk in live environments.
  • Advanced ability to analyze condition-monitoring, CMMS, and operational datasets and convert insights into strategic actions.
  • Proficiency in data analysis and visualization tools (Excel, Power BI, or similar).
  • Ability to apply statistical techniques or reliability modeling to support risk-informed decision-making under uncertainty.
  • Strong executive-level communication skills; able to influence senior leaders and defend technical positions.
  • Bachelor’s degree in Mechanical, Electrical, or Industrial Engineering (or equivalent experience).

Nice To Haves

  • Experience designing and scaling enterprise critical spares and lifecycle asset management programs.
  • Hands-on experience with predictive analytics, failure modeling, or reliability simulations.
  • Proficiency with Python, R, or similar tools for advanced reliability analytics.
  • Working knowledge of SQL or other data query languages.
  • Strong familiarity with NFPA, IEEE, ASHRAE, and other relevant codes and standards.
  • Experience presenting reliability risk, capital tradeoffs, and investment recommendations to executive audiences.
  • Preferred: CMRP, CRE, or similar advanced reliability or maintenance certification.

Responsibilities

  • Enterprise Reliability Strategy & Asset Care Architect and govern portfolio-level, risk-based asset strategies for mission-critical power and cooling infrastructure.
  • Apply advanced RCM principles to define maintenance and inspection strategies aligned to failure risk, system criticality, and redundancy posture.
  • Evaluate and balance tradeoffs between maintenance investment, operational risk, spares coverage, redundancy, and capital replacement.
  • Establish and maintain enterprise PM quality standards, including audits, task effectiveness reviews, and elimination of low-value maintenance.
  • Operational Governance & Change Risk Management Serve as a final technical authority for high-risk SOPs, MOPs, EOPs, and operational change packages.
  • Perform system-level risk assessments for planned work, incidents, and abnormal operating conditions.
  • Guide site teams in CMMS data integrity, work management maturity, and adherence to approved operating procedures.
  • Lead or oversee complex reliability investigations involving multiple systems, teams, or contributing factors.
  • Advanced Analytics & Condition Monitoring Design and mature predictive condition-monitoring programs across the portfolio (oil analysis, thermography, vibration, battery monitoring, controls analytics).
  • Develop and interpret leading reliability indicators and degradation trends to anticipate failures before impact.
  • Apply statistical analysis, reliability modeling, and engineering judgment to evaluate failure likelihood and consequence.
  • Translate analytical insights into strategic maintenance, operational mitigations, or capital recommendations.
  • Critical Spares & Lifecycle Strategy Define and govern enterprise critical spares strategies, accounting for supplier risk, lead times, and system exposure.
  • Identify systemic spares gaps and drive remediation plans in partnership with Supply Chain and Operations.
  • Lead lifecycle asset assessments to guide long-range capital planning and replacement prioritization.
  • Provide data-driven input to business cases supporting capital investments and infrastructure upgrades.
  • Incident Leadership, RCA & Continuous Improvement Lead high-impact post-incident RCAs and FMEAs, ensuring depth of analysis beyond proximate causes.
  • Identify and address latent design, procedural, and organizational contributors to reliability events.
  • Ensure lessons learned result in durable changes to standards, procedures, maintenance strategies, or training.
  • Champion continuous improvement initiatives that measurably reduce risk and failure recurrence across sites.
  • Technical Leadership & Capability Development Act as a mentor and technical escalation point for Reliability Engineers, site engineers, and CE leaders.
  • Coach teams on reliability methods, risk-based decision-making, and interpretation of condition-monitoring data.
  • Influence and evolve enterprise reliability standards, playbooks, and operating philosophies.
  • Partner with leadership to strengthen operator certification, training rigor, and operational discipline.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service