About The Position

The DGX Cloud organization at NVIDIA brings together cutting-edge hardware and software innovation to deliver industry-leading accelerated computing for the world’s most ambitious AI workloads. We’re a team of innovative engineers dedicated to solving some of the world’s biggest challenges, constantly driving advancements, and impacting millions of lives worldwide! We are looking for an outstanding Senior Systems Software Engineer with deep experience in distributed systems, open-source technologies such as Kubernetes and containers, and a strong background in systems performance and scalability. The ideal candidate brings broad, end-to-end experience across the stack, from containers and orchestration to cloud platforms, along with the technical depth to dive deep and solve complex, real-world problems. In this pivotal role, you will take on the challenge of scaling AI infrastructure while optimizing total cost of ownership - driving down cost per token to unlock the next generation of AI innovation and AI factories!

Requirements

  • At least 8 years of experience with a background in Computer Architecture, Networking, Storage systems, Accelerators
  • Bachelors/Masters in Engineering or equivalent experience (preferably, Electrical Engineering, Computer Engineering, or Computer Science)
  • Expertise in Kubernetes and familiarity with related CNCF projects
  • Expertise in working with large scale parallel and distributed accelerator-based systems
  • Expertise optimizing performance and AI workloads on large scale systems
  • Experience with performance modeling and benchmarking at scale
  • Proficiency in Golang/Python
  • Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI for example)

Nice To Haves

  • Strong operational experience with any one of the Kubernetes distributions
  • Prior experience scaling Kubernetes clusters to ultra-large node and object counts
  • Demonstrated history of working in the open-source community
  • Excellent communication and interpersonal abilities
  • PhD in relevant areas

Responsibilities

  • Drive deep, end-to-end performance and scale characterization across the DGX Cloud software stack, fearlessly chasing issues from high-level software all the way down to the metal.
  • Collaborate with AI researchers, developers and customers to develop innovative tests that simulate user workloads through comprehensive end-to-end automation, employing custom-built and innovative open-source tools and frameworks.
  • Deep dive into performance and scale issues with the intent of discovering their root causes in complex distributed systems.
  • Design and develop monitoring and reporting tools for performance and scale testing and analysis.
  • Actively engage with upstream communities to validate performance and scalability early, shaping design and development decisions from the outset.
  • Triage, debug, and root cause issues related to operating Kubernetes clusters at ultra-large scale
  • Build a high-velocity framework that enables continuous, always-on performance and scale testing through a modern CI/CD pipeline.
  • Present your work and findings at internal and external venues.

Benefits

  • NVIDIA offers highly competitive salaries and a comprehensive benefits package.
  • As you plan your future, see what we can offer to you and your family www.nvidiabenefits.com/
  • You will also be eligible for equity and benefits.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service