About The Position

The NVIDIA DGX Cloud organization is seeking passionate software support engineers to partner closely with internal customers, providing support on internal platforms. This role requires a deep understanding of customer needs and application functionality, assisting with troubleshooting, and creating documentation to empower users for self-troubleshooting in an ambiguous and fast-moving environment. The support provided will enhance user experience and help shape the platform. Candidates are expected to have knowledge of supporting cloud-based deployments across compute, storage, and networking environments. NVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization, with the GPU at the heart of its products and services. The company's work in AI and digital twins is transforming industries and profoundly impacting society. NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer.

Requirements

  • BS/MS degree in Computer science or related areas (or equivalent experience)
  • 2+ yrs of experience with supporting distributed software systems, supporting end-user software platforms, and experience with Linux
  • Experience with Kubernetes, AWS, Azure, OCI, and GCP
  • Background of Infrastructure, Networking, Storage, and DevOps scripting/tooling
  • Understanding of data storage technologies (databases, file, block, blob)
  • Customer Service/Support Experience
  • Willingness to work up and down the stack as well as across multiple teams
  • Strong skills in troubleshooting and Communication

Nice To Haves

  • Experience with MLOps workflows or ML infrastructure
  • Familiarity with GPU workloads or distributed training systems
  • SLURM or HPC previous experience
  • Strong drive to work with internal customers and make them successful
  • A drive to improve process with strong organizational skills

Responsibilities

  • Partner with multiple internal teams to provide Tier 1 support for complex cloud platforms
  • Define and improve operational workflows (runbooks, escalation paths, support processes)
  • Triage/investigate root cause of customer issues and escalate as needed
  • File bugs and report issues while working closely with the Site Reliability team
  • Build tooling to improve customer support process and visibility
  • Deeply understand user workloads and use cases
  • Partner with multiple internal teams to give feedback to engineering teams and develop solutions to aid in their success
  • Be part of an on call rotation to support production systems

Benefits

  • equity
  • benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service