Failure Analysis Engineer

Hyve SolutionsFremont, CA

About The Position

We are seeking a highly analytical Failure Analysis Engineer to support the investigation of hardware failures in rack systems, server platforms, and data center infrastructure products. This role is responsible for diagnosing complex electrical, mechanical, thermal, and system-level failures throughout the product lifecycle, including manufacturing, qualification, customer returns, and field reliability.

Requirements

  • Bachelor’s degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or a related engineering discipline.
  • 3+ years of experience in failure analysis, hardware validation, quality engineering, reliability engineering, or manufacturing engineering.
  • Experience supporting enterprise servers, rack systems, storage platforms, networking equipment, or data center infrastructure.
  • Strong understanding of server architecture including: CPUs, GPUs, Memory (DDR4/DDR5), PCIe architecture, NVMe storage, Ethernet networking, BMC/IPMI management and Power distribution systems, but not limited to,
  • Experience troubleshooting complex hardware failures at the system and board level.
  • Knowledge of schematic review and hardware debugging techniques.
  • Ability to interpret manufacturing and test logs to identify failure mechanisms.
  • Excellent analytical, communication, and technical documentation skills.

Nice To Haves

  • Key Competencies Strong analytical and troubleshooting skills
  • Cross-functional collaboration
  • Data-driven decision making
  • Technical writing and presentation
  • Continuous improvement mindset
  • Ability to manage multiple high-priority investigations in a fast-paced environment

Responsibilities

  • Perform failure isolation at the component, subsystem, and rack level, as well as root cause analysis on failures involving server systems, rack-level assemblies, storage platforms, networking hardware, and associated components.
  • Investigate failures from manufacturing, system integration, reliability testing, customer returns (RMA), and field deployments.
  • Analyze electrical, mechanical, thermal, and firmware-related failures using structured troubleshooting methodologies.
  • Utilize laboratory equipment including oscilloscopes, digital multimeters, power analyzers, thermal cameras, logic analyzers, X-ray systems, optical microscopes, and environmental test equipment.
  • Conduct board-level debugging of server motherboards, backplanes, power distribution boards (PDBs), power supplies, GPU modules, CPUs, DIMMs, NICs, storage devices, and PCIe components.
  • Work closely with Design Engineering, Manufacturing Engineering, Quality, Reliability, Supplier Quality, Test Engineering, and Operations to identify corrective actions.
  • Lead Root Cause Analysis (RCA) activities using 8D, 5-Why, Fishbone Diagram, Fault Tree Analysis (FTA), and Failure Modes and Effects Analysis (FMEA).
  • Develop and publish detailed failure analysis reports, including technical findings, corrective actions, and preventive recommendations.
  • Support reliability qualification testing, HALT/HASS, thermal validation, vibration testing, and environmental stress testing.
  • Identify recurring failure trends through statistical analysis and recommend design or process improvements.
  • Drive corrective and preventive actions (CAPA) to improve product reliability and manufacturing yield.
  • Collaborate with suppliers to investigate component-level failures and improve incoming material quality.
  • Support customer escalations by providing technical expertise during failure investigations.

Benefits

  • Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability or protected veteran status.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service