Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud

NVIDIA•Santa Clara, CA

2d•Hybrid

About The Position

The DGX Cloud organization at NVIDIA brings together cutting-edge hardware and software innovation to deliver industry-leading accelerated computing for the world’s most ambitious AI workloads. We’re a team of innovative engineers dedicated to solving some of the world’s biggest challenges, constantly driving advancements, and impacting millions of lives worldwide! We are looking for an outstanding Senior Systems Software Engineer with deep experience in distributed systems, open-source technologies such as Kubernetes and containers, and a strong background in systems performance and scalability. The ideal candidate brings broad, end-to-end experience across the stack, from containers and orchestration to cloud platforms, along with the technical depth to dive deep and solve complex, real-world problems. In this pivotal role, you will take on the challenge of scaling AI infrastructure while optimizing total cost of ownership - driving down cost per token to unlock the next generation of AI innovation and AI factories!

Requirements

At least 8 years of experience with a background in Computer Architecture, Networking, Storage systems, Accelerators
Bachelors/Masters in Engineering or equivalent experience (preferably, Electrical Engineering, Computer Engineering, or Computer Science)
Expertise in Kubernetes and familiarity with related CNCF projects
Expertise in working with large scale parallel and distributed accelerator-based systems
Expertise optimizing performance and AI workloads on large scale systems
Experience with performance modeling and benchmarking at scale
Proficiency in Golang/Python
Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI for example)

Nice To Haves

Strong operational experience with any one of the Kubernetes distributions
Prior experience scaling Kubernetes clusters to ultra-large node and object counts
Demonstrated history of working in the open-source community
Excellent communication and interpersonal abilities
PhD in relevant areas

Responsibilities

Drive deep, end-to-end performance and scale characterization across the DGX Cloud software stack, fearlessly chasing issues from high-level software all the way down to the metal.
Collaborate with AI researchers, developers and customers to develop innovative tests that simulate user workloads through comprehensive end-to-end automation, employing custom-built and innovative open-source tools and frameworks.
Deep dive into performance and scale issues with the intent of discovering their root causes in complex distributed systems.
Design and develop monitoring and reporting tools for performance and scale testing and analysis.
Actively engage with upstream communities to validate performance and scalability early, shaping design and development decisions from the outset.
Triage, debug, and root cause issues related to operating Kubernetes clusters at ultra-large scale
Build a high-velocity framework that enables continuous, always-on performance and scale testing through a modern CI/CD pipeline.
Present your work and findings at internal and external venues.

Benefits

NVIDIA offers highly competitive salaries and a comprehensive benefits package.
As you plan your future, see what we can offer to you and your family www.nvidiabenefits.com/
You will also be eligible for equity and benefits.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume