Staff Embedded Software Engineer

Relativity Space•Long Beach, CA

20h

About The Position

Relativity Space is building rockets to serve today’s needs and tomorrow’s breakthroughs. The Terran R vehicle will deliver customer payloads to orbit, meeting the growing demand for launch capacity. This role is within the Interplanetary Sciences Program, established to expand access to scientific exploration across our solar system. The program's mission is to make planetary research faster, more affordable, and more capable than ever before by rethinking how science missions are designed, built, and operated. The storage platform is a foundational element for accelerating scientific discovery, enabling science instruments to write data, onboard AI to read it, and communication subsystems to downlink it to Earth. The engineer in this role will define and build the storage architecture, making foundational architectural decisions regarding redundancy topology, replication strategy, failure domain boundaries, consistency guarantees, and write-lifecycle management. This includes building and testing prototypes on commodity hardware, developing low-level systems code such as storage drivers and filesystem integration, and implementing fault recovery systems that survive radiation upsets across the full mission lifetime. The role involves carrying the design from proof-of-concept on commodity hardware through integration on flight hardware, validating architectural assumptions with fault injection testing. Responsibilities include owning the redundancy and replication architecture, defining consistency models, acceptable data loss during radiation-induced crashes, and the storage platform's contractual guarantees in degraded states. The engineer will also select the filesystem and design the pool architecture, validate through quantitative reliability modeling, define the write-endurance budget, design interface contracts with other subsystems, and build fault detection and recovery paths at the hardware boundary.

Requirements

Demonstrated ability to make and defend architectural tradeoffs in writing: design documents, RFCs, or equivalents that other engineers built against.
7+ years experience designing software systems for high reliability over long operational lifetimes: defining redundancy topology, failure domain boundaries, and degraded-mode behavior.
Track record of reasoning about failure modes before they occur: identifying what breaks, defining impact radius, and designing recovery paths at the system level, not just the component level.
Experience working at or near the storage system boundary in kernel, driver, firmware, storage infrastructure, or equivalent developments where you had to reason about hardware behavior and failure modes.
Experience with storage systems that maintain data integrity under faults (copy-on-write filesystems, log-structured storage, RAID architectures, or replication systems).

Nice To Haves

Familiarity with distributed storage replication models: synchronous vs. async, quorum systems, chain replication, and opinions about when each is appropriate.
Experience designing storage or data systems that must remain available and consistent across independent failure domains.
Experience defining interface contracts between storage platforms and upstream consumers — databases, data pipelines, application frameworks.
Depth in one or more of: filesystem internals, block layer and device management, storage protocol implementation, or fault-tolerant storage infrastructure.
Strong working knowledge of storage data structures and systems reasoning — Merkle trees, NVMe submission/completion queue ring buffers, hash tables, radix trees.
Hands-on experience at the driver/hardware boundary: DMA coherency, MMIO semantics, PCIe enumeration, and cache behavior.
Experience testing storage systems under fault injection: PCIe/NVMe resets, error storms, low-level tracing (ftrace/perf/bpftrace), and crash dump analysis (kdump/vmcore).

Responsibilities

Define the storage architecture and build it.
Make foundational architectural decisions (redundancy topology, replication strategy, failure domain boundaries, consistency guarantees, write-lifecycle management).
Build and test prototypes on commodity hardware.
Build the low-level systems code: storage drivers, filesystem integration, and fault recovery systems.
Carry the design from proof-of-concept on commodity hardware through integration on flight hardware.
Validate architectural assumptions with fault injection testing.
Own the redundancy and replication architecture across multiple NAS units on two independent hardware strings.
Define the consistency model for cross-string replication.
Define the precise bound on acceptable data loss during a radiation-induced crash.
Define the storage platform's contractual guarantees in every degraded state.
Codify the failure mode matrix that drives implementation decisions.
Select the filesystem and design the pool architecture, confirming or revising the current ZFS baseline and owning the final configuration.
Validate through quantitative reliability modeling that balance upset probability, rebuild risk, write endurance, and usable capacity over the full mission.
Define the write-endurance budget for a multi-year operational lifetime.
Design the interface contracts between the storage platform and the science instrument, compute, and communication subsystems.
Build the storage fault detection and recovery path at the hardware boundary (e.g., kernel driver, block layer, or firmware level).
Build automated fault recovery that handles every scenario in the failure mode matrix.
Validate through sustained fault injection campaigns on the hardware-in-the-loop testbed.