Staff GPU Systems Engineer, Space Computing

Relativity Space•Long Beach, CA

2h•$181,000 - $248,500•Onsite

About The Position

Relativity Space is building rockets to serve today’s needs and tomorrow’s breakthroughs. The Terran R vehicle will deliver customer payloads to orbit, meeting the growing demand for launch capacity. The Interplanetary Sciences Program was established to expand access to scientific exploration across our solar system. Its mission is to make planetary research faster, more affordable, and more capable than ever before by rethinking how science missions are designed, built, and operated. The program aims to enable scientists to send instruments to distant worlds without decades of development or prohibitive costs. By creating a sustainable model for interplanetary exploration, we are transforming space science from an occasional event into a continuous process of discovery that accelerates knowledge, broadens participation, and inspires the next generation of explorers. This role will own the GPU compute environment for a space-based data center, including setup, driver integration, container runtime, job scheduling, and performance optimization. The goal is to build the platform that enables onboard AI/ML inference and SAR reprocessing millions of miles from the nearest sysadmin. The role involves profiling and optimizing compute performance across the full stack, building power and thermal-aware compute scheduling, developing compute health monitoring and upset recovery mechanisms, and integrating GPU drivers with the payload Linux image.

Requirements

BS/MS in Computer Science or Electrical Engineering and 5+ years of relevant experience
Hands-on experience with GPU programming and compute frameworks — CUDA, ROCm, or OpenCL — with real performance profiling and optimization work, not just running tutorials
Strong Linux systems administration and performance tuning skills: you've diagnosed I/O bottlenecks, tuned memory management, and understood why a workload isn't hitting expected throughput
Experience with container technologies (Docker, Podman, or lightweight alternatives) and HPC job scheduling concepts
Working proficiency in Python for tooling, scripting, and ML framework integration, with C/C++ skills for performance-critical system components

Nice To Haves

Experience with HPC cluster administration, ML infrastructure, or cloud GPU compute platforms at scale
Deep familiarity with ML framework runtime requirements — PyTorch or TensorFlow deployment, model serving, and inference optimization
Knowledge of GPU compute architectures at the hardware level: CUDA cores, compute units, memory hierarchies, and how they affect real workload performance
Experience with high-throughput data movement and storage I/O optimization — NFS tuning, buffer management, and sustaining multi-gigabit throughput
Background in power-managed computing: duty cycling, thermal throttling, and workload scheduling under variable power constraints
Experience designing checkpoint/restart or fault-tolerant batch processing systems — space experience not required, similar problems exist in large-scale distributed infrastructure and autonomous systems

Responsibilities

Own the GPU compute environment for a space-based data center — setup, driver integration, container runtime, job scheduling, and performance optimization — building the platform that enables onboard AI/ML inference and SAR reprocessing millions of miles from the nearest sysadmin
Profile and optimize compute performance across the full stack: GPU utilization, memory bandwidth, I/O throughput, and storage interface performance, squeezing maximum science return from constrained power and thermal budgets that shift between sunlit burst processing and eclipse idle periods
Build power and thermal-aware compute scheduling that orchestrates batch workloads around orbital constraints, coordinating with the storage platform to sustain 10 Gbps data movement between NAS and compute nodes during processing windows
Develop compute health monitoring and upset recovery mechanisms — checkpoint/restart strategies, GPU fault detection, and automated recovery — so a radiation-induced upset means a restarted job, not a lost processing window
Integrate GPU drivers with the payload Linux image in coordination with the Platform RE, manage the container runtime for compute workloads, and ensure the platform reliably runs ML frameworks and SAR processing pipelines maintained by the broader operations team