Director of Capacity Engineering – DGX Cloud

NVIDIASanta Clara, CA
70d$284,000 - $425,500

About The Position

Join the team building the backbone of the world’s most sophisticated AI cloud. NVIDIA DGX Cloud delivers multi-exascale, GPU-accelerated computing on demand. We are looking for a senior engineering leader to own capacity strategy, fleet reliability, and operational excellence as DGX Cloud scales globally. If you thrive with large-scale infrastructure challenges and want to invent the future of AI computing, we’d love to hear from you!

Requirements

  • 12+ overall years in large-scale infrastructure or site-reliability engineering, with 5+ years in senior leadership.
  • Bachelors or Masters in an engineering field or equivalent experience.
  • Deep understanding of GPU-accelerated compute, including DGX systems, NVLink/NVSwitch fabrics, InfiniBand/Ethernet networking, and high-performance storage.
  • Shown success in capacity planning and fleet consistency across multi-region or multi-cloud environments.
  • Expertise in driver/firmware management (CUDA stack, NCCL, OS/kernel dependencies) and distributed training workloads.
  • Proven track record to deliver against strict availability and performance SLOs at hyperscale.

Nice To Haves

  • Experience with hybrid cloud deployments and hyperscale partnerships.
  • Familiarity with Kubernetes GPU scheduling, and AI/ML workload patterns.
  • Track record of influencing hardware/system roadmaps (DGX, Grace Hopper, next-gen GPUs) based on capacity insights.
  • Strong interpersonal skills to align executives, engineers, and partners around ambitious capacity targets.

Responsibilities

  • Lead end-to-end capacity strategy and forecasting for DGX Cloud across regions and cloud partners (Azure, OCI, GCP, etc.).
  • Define and implement golden-image standards for DGX nodes: firmware, CUDA/NVIDIA drivers, NCCL/InfiniBand, NVLink/NVSwitch fabrics.
  • Invent and operate automated maintenance and upgrade frameworks with near-zero customer impact, including guardrails, rollback plans, and buffer management.
  • Own service-level objectives (SLOs) for GPU availability, efficiency, and training/inference reliability; drive continuous improvement and root-cause analysis.
  • Guide development of orchestration tools and APIs coordinated with NVIDIA tools and DGX Cloud provisioning systems.
  • Partner with DGX Cloud software, data-center engineering, supply chain, and finance to align capacity, cost, and rollout priorities.
  • Recruit, mentor, and lead an elite team of capacity engineers, SREs, and tooling developers.

Benefits

  • Base salary range is 284,000 USD - 425,500 USD.
  • Eligible for equity and benefits.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service