Field Reliability Engineer

Cerebras SystemsSunnyvale, CA
12d$150,000 - $250,000

About The Position

Quality, reliability, and uptime are foundational to scaling Cerebras systems and impact. We are looking for engineers passionate about diagnosing complex field failures, extracting insights from large-scale telemetry and service datasets, and partnering across hardware, software, operations, and supply chain teams to improve reliability at fleet scale. This role blends deep engineering domain knowledge with data analytics and reliability statistics to drive continuous improvement across Cerebras’ growing deployed base.

Requirements

  • Bachelor’s degree in Electrical Engineering, Materials Science, Mechanical Engineering, or a related field.
  • 5+ years of industry experience in reliability engineering, hardware quality, or field failure analysis.
  • Strong proficiency in applied statistics and reliability methods (e.g., Weibull/survival analysis modeling, accelerated aging models).
  • Experience applying Weibull analysis and fleet-scale failure modeling to drive reliability priorities and quantify risk.
  • Working knowledge of Python and SQL for data extraction, cleaning, time-series analysis, reliability modeling, and visualization.
  • Demonstrated ability to build structured problem-solving approaches and lead cross-functional teams through complex root-cause investigations.
  • Excellent communication skills, with the ability to distill complex data and engineering concepts into clear, concise insights for technical and executive audiences.

Nice To Haves

  • Physics-of-failure knowledge related to datacenter compute: thermal cycling, solder/interconnect fatigue, power electronics degradation, connector reliability, and cooling system failure modes.
  • Familiarity with the design and manufacturing process for IC packaging, server hardware, and PCBA.
  • Understanding of datacenter operating conditions: airflow, thermal management, power quality, workload variation, and system-level interactions.
  • Experience analyzing large-scale system telemetry, preferably from instrumented hardware fleets.

Responsibilities

  • Use reliability statistics (e.g., Weibull and other parametric/non-parametric survival models) to identify and address trends, risks, and fleet-level performance of Cerebras’ datacenter compute hardware
  • Lead physics-of-failure–based root-cause investigations using telemetry, log data, stress/usage analysis, and engineering intuition.
  • Build and maintain statistical and large-scale data analyses (e.g., event logs, thermal/power telemetry, workload patterns).
  • Develop reliability forecasts to inform design decisions, manufacturing quality, capacity planning, service readiness, and supply chain strategy.
  • Build warranty cost and failure-forecast models by integrating failure rates, usage profiles, reliability statistics, and component risk factors.
  • Analyze real-world stress, workload, thermal, and environmental conditions to refine design requirements, qualification plans, and reliability tests.
  • Partner cross-functionally to prioritize issues, align mitigations, drive corrective actions, and turn learnings into design/process guidelines to prevent issue recurrence.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service