RAS Validation Lead

Cowboy Space Corp.•Seattle, WA

1d•$150,000 - $225,000•Hybrid

About The Position

Deploying high-performance GPU compute in Low Earth Orbit introduces a fundamentally different fault landscape than ground-based datacenter operation. This role sits at the frontier of that problem. When a fault occurs 500km above Earth, the system must detect it, classify it, contain it, and recover from it autonomously. You will own the end-to-end RAS validation strategy for GPU server systems, working directly with GPU and HBM silicon partners to analyze failures, characterize fault propagation paths, and ensure detection and recovery mechanisms function correctly. The right candidate combines deep knowledge of processor and memory architecture with hands-on system-level validation experience and the ability to drive partner engagements to resolution. This role is located in San Carlos or Seattle.

Requirements

5+ years of experience in hardware validation, platform reliability engineering, or silicon validation on server-class compute systems.
Deep understanding of CPU and GPU architecture, including memory subsystems (DDR, HBM), cache hierarchies, and interconnect fabrics (PCIe, NVLink, XGMI).
Strong knowledge of RAS concepts: error detection and correction (ECC), fault containment, error propagation, machine check architecture (MCA/MCI), and recovery mechanisms.
Hands-on experience with fault injection methodologies at hardware, firmware, and software levels.
Familiarity with system management interfaces including BMC, IPMI, Redfish, and MCTP/PLDM.
Experience working directly with silicon vendors or ODM partners on hardware failure analysis and RAS gap closure.
Strong scripting skills in Python or equivalent for test automation and log analysis.

Responsibilities

Lead RAS validation strategy and execution for GPU server platforms, including fault injection, detection coverage, and recovery verification.
Partner directly with GPU system designers to analyze hardware failures, review silicon errata, and align on fault handling requirements for DDR, HBM, CPU, and GPU subsystems.
Characterize fault propagation paths from hardware detection through firmware and OS layers, and validate that error signals are correctly classified, logged, and acted upon.
Validate BMC and out-of-band management visibility into hardware health events via IPMI, Redfish, and MCTP/PLDM protocols.
Debug complex failure modes spanning GPU and CPU architecture, memory subsystems, PCIe/NVLink fabric, and system management firmware.
Drive root-cause analysis for RAS failures discovered during validation and work with partners to provide input on platform design decisions that affect fault detection and serviceability.
Define RAS coverage metrics and maintain traceability from hardware fault models to test coverage.
Collaborate with firmware, software, and platform teams to validate OS-level error handling, ACPI error interfaces (EINJ, BERT,HEST), and runtime error recovery flows.