About The Position

We're looking for a Principal Engineer to join our CSP Engagements team as the technical focal point for end-to-end performance, working directly with engineering teams of key CSP/hyperscale customers to ensure they achieve various performance targets on NVIDIA platforms. In this role, you will augment NVIDIA's performance and benchmark teams with a dedicated CSP-facing focus. You will drive work streams with CSP engineering teams to build shared understanding of platform performance characteristics, gather and incorporate their workload-specific feedback into NVIDIA's optimization priorities, and validate that performance targets are met in customer-representative configurations. Your cross-CSP visibility enables you to identify patterns and drive systemic improvements in documentation, configuration guidance, and tooling.

Requirements

  • 15+ years of experience in systems performance engineering, ideally in GPU/HPC/ML infrastructure.
  • BS or MS in Computer Science, Computer Engineering, or related field (or equivalent experience)
  • Proficiency in GPU workload profiling: nsight systems, nsight compute, DCGM metrics, or equivalent instrumentation
  • Understanding of distributed training performance dynamics: computation/communication overlap, pipeline bubbles, memory bandwidth utilization, collective efficiency
  • Statistical methods for performance analysis: regression detection, confidence intervals, A/B comparison at scale
  • Understanding of how the full software stack impacts performance: driver overhead, collective algorithm selection, memory allocation, scheduling, firmware power management
  • Strong data analysis and visualization skills (Python, pandas, dashboards).
  • Customer obsession — genuine passion for understanding why customers aren't achieving expected performance and driving solutions
  • Ability to communicate performance findings to both deep technical audiences and executive leadership
  • Demonstrated success influencing multiple engineering teams to prioritize performance improvements

Nice To Haves

  • Experience profiling and optimizing distributed training at 1000+ GPU scale (Megatron-LM, DeepSpeed, FSDP)
  • Background in ML infrastructure performance at a CSP/hyperscaler
  • Familiarity with NVIDIA platforms (DGX, HGX, NVLink topology) and profiling tools
  • Experience building automated performance regression detection systems for production environments
  • Understanding of inference workload performance dynamics (vLLM, TensorRT-LLM, SGLang, continuous batching)

Responsibilities

  • Drive performance characterization work streams with engineering teams of key CSP/hyperscale customers — ensuring they understand platform performance expectations, profiling methodology, and tuning options for their specific workloads
  • Gather and synthesize CSP performance feedback — identify gaps between expected and actual throughput, and champion optimization priorities back into NVIDIA's CUDA, NCCL, driver, and firmware teams
  • Ensure key open-source performance and stress tools (e.g., STREAM, GPU Burn, GPU BLAST) are updated and validated for the latest NVIDIA rack-scale systems, GPU architectures, and CPU platforms — so customers and internal teams have reliable baseline measurements from day one
  • Work closely with CSPs to ensure their own performance and validation tooling reflects the latest GPU capabilities, memory hierarchy changes, and platform-specific tuning parameters
  • Conduct cross-CSP performance comparison and pattern analysis — identify configuration, software, or workload differences that explain performance gaps between deployments
  • Collaborate with CSPs to ensure performance-related integration work (profiling infrastructure, benchmark harnesses, config validation) is ready ahead of deployment milestones
  • Define test strategies and tooling requirements for performance validation — both for NVIDIA internal certification and customer acceptance

Benefits

  • equity
  • benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service