Staff GPU Systems Engineer, Space Computing

Relativity SpaceLong Beach, CA
$181,000 - $248,500Onsite

About The Position

Relativity Space is building rockets to serve today’s needs and tomorrow’s breakthroughs. The Terran R vehicle will deliver customer payloads to orbit, meeting the growing demand for launch capacity. The Interplanetary Sciences Program was established to expand access to scientific exploration across our solar system. Its mission is to make planetary research faster, more affordable, and more capable than ever before by rethinking how science missions are designed, built, and operated. The program aims to enable scientists to send instruments to distant worlds without decades of development or prohibitive costs. By creating a sustainable model for interplanetary exploration, we are transforming space science from an occasional event into a continuous process of discovery that accelerates knowledge, broadens participation, and inspires the next generation of explorers. This role will own the GPU compute environment for a space-based data center, including setup, driver integration, container runtime, job scheduling, and performance optimization. The goal is to build the platform that enables onboard AI/ML inference and SAR reprocessing millions of miles from the nearest sysadmin. The role involves profiling and optimizing compute performance across the full stack, building power and thermal-aware compute scheduling, developing compute health monitoring and upset recovery mechanisms, and integrating GPU drivers with the payload Linux image.

Requirements

  • BS/MS in Computer Science or Electrical Engineering and 5+ years of relevant experience
  • Hands-on experience with GPU programming and compute frameworks — CUDA, ROCm, or OpenCL — with real performance profiling and optimization work, not just running tutorials
  • Strong Linux systems administration and performance tuning skills: you've diagnosed I/O bottlenecks, tuned memory management, and understood why a workload isn't hitting expected throughput
  • Experience with container technologies (Docker, Podman, or lightweight alternatives) and HPC job scheduling concepts
  • Working proficiency in Python for tooling, scripting, and ML framework integration, with C/C++ skills for performance-critical system components

Nice To Haves

  • Experience with HPC cluster administration, ML infrastructure, or cloud GPU compute platforms at scale
  • Deep familiarity with ML framework runtime requirements — PyTorch or TensorFlow deployment, model serving, and inference optimization
  • Knowledge of GPU compute architectures at the hardware level: CUDA cores, compute units, memory hierarchies, and how they affect real workload performance
  • Experience with high-throughput data movement and storage I/O optimization — NFS tuning, buffer management, and sustaining multi-gigabit throughput
  • Background in power-managed computing: duty cycling, thermal throttling, and workload scheduling under variable power constraints
  • Experience designing checkpoint/restart or fault-tolerant batch processing systems — space experience not required, similar problems exist in large-scale distributed infrastructure and autonomous systems

Responsibilities

  • Own the GPU compute environment for a space-based data center — setup, driver integration, container runtime, job scheduling, and performance optimization — building the platform that enables onboard AI/ML inference and SAR reprocessing millions of miles from the nearest sysadmin
  • Profile and optimize compute performance across the full stack: GPU utilization, memory bandwidth, I/O throughput, and storage interface performance, squeezing maximum science return from constrained power and thermal budgets that shift between sunlit burst processing and eclipse idle periods
  • Build power and thermal-aware compute scheduling that orchestrates batch workloads around orbital constraints, coordinating with the storage platform to sustain 10 Gbps data movement between NAS and compute nodes during processing windows
  • Develop compute health monitoring and upset recovery mechanisms — checkpoint/restart strategies, GPU fault detection, and automated recovery — so a radiation-induced upset means a restarted job, not a lost processing window
  • Integrate GPU drivers with the payload Linux image in coordination with the Platform RE, manage the container runtime for compute workloads, and ensure the platform reliably runs ML frameworks and SAR processing pipelines maintained by the broader operations team

Benefits

  • Competitive salary and equity
  • Generous PTO and sick leave policy
  • Parental leave
  • Annual learning and development stipend
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service