Senior Engineer, Performance - Cloud Software

NvidiaSanta Clara, CA
104d$144,000 - $230,000Remote

About The Position

NVIDIA is widely considered to be one of the technology world's most desirable employers, NVIDIA leads the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing (HPC) and Visualization. DGX Cloud provides a serverless generative AI infrastructure to the world enabling NVIDIA's AI supercomputer technologies to be used by anyone. DGX Cloud engineering has a mission to ensure our customers receive timely and quality-assured releases. We are seeking a Performance Engineer proficient in performance and scalability testing, identifying limitations across the Kubernetes (K8s) and application stack using industry standard tools and telemetry. If you excel in problem-solving, can think creatively on your feet, and enjoy working in a distributed team setting, we would love to have you join us!

Requirements

  • Bachelor's or Master's degree in Computer Science, Data Science, or a related field (or equivalent experience)
  • 5+ years in software engineering with a strong track record in performance or scalability of high-scale distributed systems
  • Are deeply comfortable with performance profiling tools and tracing systems
  • Be able to identify performance issues, root cause problems, and be able to come up with potential solutions
  • Experience optimizing performance across one or more layers of the stack (e.g., database, networking, storage, application runtime, GC tuning, Golang internals, GPU utilization)
  • Contributed to observability, benchmarking, or performance-focused infrastructure at scale
  • Strong understanding of OS internals, scheduling, memory management, and IO patterns
  • Have demonstrated success navigating ambiguity and aligning stakeholders around performance goals
  • Proficient in container-based infrastructure (Docker, Kubernetes, Helm)

Nice To Haves

  • Demonstrated ability to handle sophisticated technical environments while meeting or exceeding all security, reliability, scalability, and availability metrics
  • Strong and confirmed knowledge of modern architectures at scale

Responsibilities

  • Analyze and optimize performance across application, middleware, runtime, and infrastructure layers—networking, storage, GPU utilization, and beyond
  • Develop tooling and metrics that provide deep observability into system performance
  • Collaborate closely with infra, platform, runtime, and product teams to identify key performance goals and drive systemic improvements
  • Lead investigations into high-impact performance regressions or scalability issues in production
  • Influence architecture and design decisions to prioritize latency, throughput, and efficiency at scale
  • Drive performance testing strategies and help define SLAs/SLOs around latency and throughput for critical systems

Benefits

  • Competitive salaries
  • Generous benefits package
  • Equity eligibility

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Computer and Electronic Product Manufacturing

Education Level

Bachelor's degree

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service