About The Position

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. We are looking for a Principal Software Engineer to join our DGX Cloud team and build the foundational systems that drive NVIDIA’s high-performance GPU infrastructure. You will play a meaningful role in crafting scalable automation solutions, integrating diverse systems, and enabling seamless workflows across global cloud operations. As a Principal Engineer in DGX Cloud, you will be at the pinnacle of technical leadership. You will directly craft the platform that fuels the future of AI and cloud computing.

Requirements

  • 16+ years of progressive industry experience
  • Master's or Bachelor's degree, or equivalent experience defining and shipping complex distributed systems.
  • Deep, hands-on expertise in establishing, operating, and scaling services in a fast paced, high-reliability environment.
  • Thrive in ambiguous, fast paced environments by rapidly testing ideas, iterating toward working solutions, and then hardening the winners into reliable, scalable systems.
  • Outstanding proficiency in modern systems programming languages such as Go, Java, or Python.
  • Proven track record of defining, owning, and evolving the architecture of high-scale distributed systems, including advanced patterns for APIs, control planes, and data pipelines.
  • Deep understanding of global cloud infrastructure (AWS, GCP, Azure) and container ecosystems (Docker, Kubernetes).
  • Demonstrated ability to drive technical strategy and influence outcomes across organizational boundaries.
  • Outstanding ability to communicate complex technical concepts, drive organizational consensus, and mentor high-performing engineers.

Nice To Haves

  • A history of successfully leading the development and adoption of organization-wide workflow orchestration systems for petabyte-scale infrastructure.
  • Experience in a Principal/Staff+ capacity, delivering measurable improvements in operational efficiency, reliability, and security across a large engineering org.
  • Deep familiarity with the operational and deployment aspects of the NVIDIA AI/ML software stack (CUDA, cuDNN, containerization).
  • Patent contributions or a strong publication record in areas related to distributed systems, cloud computing, or infrastructure automation.

Responsibilities

  • Lead the build and development of next-generation APIs, state management, and workflow orchestration systems that automate fleet lifecycle operations at a massive scale.
  • Drive technical alignment across dependent systems and partner teams to ensure cohesive integration, clear interfaces, and reliable end-to-end workflows, with a strong focus on delivery.
  • Act as a force-multiplier by coaching, mentoring, and encouraging senior engineers, elevating the technical standards and guidelines across the organization.
  • Maintain an incredible focus on the customer experience and product requirements, translating deep technical insight into high-impact business solutions.
  • Partner with executive and engineering leadership to codify critical business processes into self-measuring, scalable, and operationally consistent platforms, drastically reducing manual toil.
  • Direct the integration strategy for key technologies, including common AI schedulers (e.g., Kubernetes, Slurm) and innovative observability systems (e.g., Prometheus, OpenTelemetry, Grafana).

Benefits

  • highly competitive salaries
  • comprehensive benefits package
  • equity
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service