Principal Architect, System Software - Orbital Data Center

NVIDIA•Us, CA

17h•Onsite

About The Position

NVIDIA is seeking a strong technical architect to own end-to-end system software architecture for Space-1 and successor orbital platforms. This role involves architecting the full stack, from application to libraries, and managing the data center stack to BMC and BIOS firmware, manageability, and telemetry through the host OS, GPU and CPU drivers, and CUDA. The goal is to deliver a production-ready inference platform that operates reliably in the harsh environment of low-Earth orbit. The architect will collaborate with the orbital hardware system architecture team, drive customer use cases with constellation operators, align architecture with mission requirements, and bring orbital AI products to market.

Requirements

15+ years of relevant experience in server/platform system software — spanning compute libraries, BMC firmware, BIOS, host OS, drivers, and manageability
BS, MS, or PhD in EE/CS or related field of education (or equivalent experience).
Working experience in building AI infrastructure and systems in space.
Proven record of architecting and delivering platform software for large-scale data centers or mission-critical embedded systems.
Strong knowledge of server architecture, data center manageability, and full-stack integration of firmware with OS and accelerator software.
Hands-on experience with data center health management workflows, telemetry, and fault management at scale.
Solid understanding of hardware management interfaces (USB, SMBus/I2C, PCIe) and proficiency with modern management protocols including Redfish, MCTP, and PLDM.
Strong and demonstrable skill in C/C++ and Python.
Experience programming and debugging server platforms, including pre-silicon and platform bring-up environments.
Experience in SCM (e.g. Git, Perforce) and project management tools like Jira.
Excellent written and oral communication skills, good work ethics, high sense of team-work, love to produce quality work, and commitment to finish your tasks every single day.
Self-starter who loves to find creative solutions to complicated problems and hands on with coding.

Nice To Haves

Experience architecting platform software for space, aerospace, defense, or other radiation, thermal, and vibration-constrained environments — including SEU/SEFI mitigation, ECC strategy, TID/SEE qualification, and rad-hard design partitioning.
Being a part of a start up or initiative directly related to space data centers.
Hands-on experience with autonomous, remote, or unreachable data center operations — in-orbit or in-field firmware update, dual-module redundancy, and recovery without physical access.
Hands-on with x86 or ARM (Grace/Vera) system architecture and the NVIDIA AI software stack (CUDA, DCGM, DOCA/OFED, GPU drivers, DGX OS).
Familiarity with NSA PHIPs security, post-quantum networking, and aerospace standards (VPX, MIL-STD shock/vibe, NASA EEE-INST-002).
Proven technical leadership driving large complex programs with 50+ engineers across firmware, OS, driver, and AI stack teams.
Skilled in reviewing hardware schematics and PCB layouts for debugging, design verification, and collaboration with hardware engineers.

Responsibilities

Own system architecture for inference stack and other applications running on this class of products and make it resilient to any fault happening in space.
Co-architect with the orbital hardware system architecture team to define interfaces, partitioning, and trade-offs across silicon, board, firmware, OS, and AI workload layers for 5-year LEO missions.
Own end-to-end system software architecture for Space-1 and successor Orbital Data Center modules — covering data center stack, BMC firmware, BIOS, host OS, GPU/CPU drivers, CUDA, DCGM, and manageability telemetry as a single integrated stack.
Define the manageability architecture for an unreachable, autonomous data center: remote bring-up, in-orbit firmware update, dual-module redundancy, fault containment, recovery from SEU/SEFI events, and telemetry for fleets ranging from tens to millions of nodes.
Architect rad-tolerant system software behaviors — ECC handling, memory scrubbing, latch-up mitigation, deterministic recovery, and graceful degradation through 5 years and up to ~8,000 thermal cycles in dawn–dusk sun-synchronous orbit.
Drive Redfish, MCTP, PLDM, and constellation-level management protocols across BMC, BIOS, and host software so customers can operate orbital fleets with the same tools they use on the ground.
Define minimum BMC feature set, pin budget, boot architecture (rugged M.2 / VPX-class options), and dual-module redundancy strategy in partnership with platform and mechanical engineering.
Partner with cloud and constellation customers (SpaceX, Blue Origin, Starcloud, Planet, Cowboy Space, and others) to translate mission requirements — orbit, duty cycle, NSA PHIPs security, post-quantum networking (CX9), inference SLAs — into actionable platform software architecture.
Drive reliability and optimization in the system software architecture from an orbital data center viewpoint, including correct operation through eclipse periods and idle-power retention strategies.
Work closely with the bring-up team and resolve issues at Speed of Light from first silicon through first launch.
Own quality, reliability, and telemetry performance of the system software delivered with each ODC module shipped to customers.