High Performance Compute Responsible Engineer

Relativity SpaceLong Beach, CA
1d

About The Position

Own the complete storage platform software stack for a space-based data center: custom Linux kernel drivers, OpenZFS pool design, NFS data serving, and automated fault recovery, shipping a platform that preserves up to a petabyte of mission data through years of radiation exposure. Design and implement custom Linux kernel drivers for NVMe fault recovery and GPIO overcurrent protection, working across PCI/PCIe, block layer, and interrupt subsystems to detect and recover from radiation-induced upsets without data loss. Lead the ZFS pool topology architectural decisions by building quantitative reliability models that balance upset probability, resilver risk, and capacity over a 6+ year mission, then validate through fault injection testing. Develop the integration layer between NVMe controller reset and ZFS, ensuring that a drive recovering from a transient fault re-enters the storage pool cleanly, bridging driver-level recovery with filesystem-level fault tolerance. Execute a rapid prototypes on commodity hardware, from first boot through sustained 10 Gbps writes with automated fault recovery, de-risking the architecture before committing to the target platform, then carry the design through integration and launch.

Requirements

  • 3+ years writing Linux kernel code, actual driver development involving PCI/PCIe devices, block storage, or interrupt-driven hardware, with meaningful time spent in kernel space
  • You've debugged with ftrace and crash dumps, not just application-level tooling
  • OS internals are everyday working knowledge: virtual memory, interrupt context constraints, synchronization primitives, and the I/O stack. You can trace a write() from userspace through the block layer to hardware and explain each stage.
  • Storage stack depth beyond raw kernel development.
  • Experience with enterprise storage software (ZFS, copy-on-write filesystems, or RAID systems), high-throughput NFS, or NVMe protocol-level work.
  • Depth in at least one layer of the storage stack, whether that's filesystem internals, block device management, or storage protocol implementation.
  • Architecture fundamentals that driver work demands: DMA coherency, MMIO semantics, PCIe enumeration, and cache behavior. You know what goes wrong when DMA buffer management is off.
  • Applied data structures and systems reasoning. You understand why ZFS uses Merkle trees, how NVMe submission/completion queue ring buffers work, and when to reach for a hash table versus a radix tree in your own code.
  • Failure as a first-class design concern. You write error paths before happy paths. You reason about what breaks after 10,000 iterations over years of operation. You model component failure probability quantitatively rather than by intuition.

Nice To Haves

  • Experience in reliability engineering where you've modeled hardware failure rates and designed software recovery around them, whether that's storage firmware, autonomous vehicle data systems, large-scale distributed infrastructure, or embedded platforms.
  • Familiarity with embedded Linux build systems (Yocto or Buildroot) and cross-compilation.
  • Hardware lab comfort: serial consoles, logic analyzers, and willingness to debug PCIe enumeration failures on a prototype board alongside the electrical engineers.

Responsibilities

  • Own the complete storage platform software stack for a space-based data center: custom Linux kernel drivers, OpenZFS pool design, NFS data serving, and automated fault recovery, shipping a platform that preserves up to a petabyte of mission data through years of radiation exposure.
  • Design and implement custom Linux kernel drivers for NVMe fault recovery and GPIO overcurrent protection, working across PCI/PCIe, block layer, and interrupt subsystems to detect and recover from radiation-induced upsets without data loss.
  • Lead the ZFS pool topology architectural decisions by building quantitative reliability models that balance upset probability, resilver risk, and capacity over a 6+ year mission, then validate through fault injection testing.
  • Develop the integration layer between NVMe controller reset and ZFS, ensuring that a drive recovering from a transient fault re-enters the storage pool cleanly, bridging driver-level recovery with filesystem-level fault tolerance.
  • Execute a rapid prototypes on commodity hardware, from first boot through sustained 10 Gbps writes with automated fault recovery, de-risking the architecture before committing to the target platform, then carry the design through integration and launch.

Benefits

  • Relativity Space offers competitive salary and equity, a generous PTO and sick leave policy, parental leave, an annual learning and development stipend, and more!
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service