NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product workloads. The company seeks an expert Technical Program Manager (IC5) to lead strategic programs emphasizing resilience, reliability, and goodput. This role requires collaboration across multiple teams. It involves driving improvements in resilience, service stability, and operational scale. The TPM also guides architectural decisions related to resilience reference architecture. The TPM leads programs spanning DGXC infrastructure, Resilience Tools, and core platform services to deliver fault-tolerant, high-availability training and inference environments at scale. We are looking for a TPM who is analytical, technically skilled, and comfortable working with cloud infrastructure, software, operations, and environments driven by data and research. You will work closely with engineering, SRE, operations, and researchers to develop scalable resilience strategies, improve operational performance, and assist in building open, modular software components and reference stacks for DGX Cloud at scale.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior