About The Position

About Us : The AV Infrastructure org provides developer environments, cloud infrastructure, and ML/AI GPU platforms for AV research and development teams to build, test, and run faster in GM . The Role : GM is looking for a Senior Performance Engineer to join the AV Capacity and Performance Engineering team in the AV Infrastructure org to support our critical efforts in developing autonomous vehicles . The mission of the AVCPE team is to provide input into large scale ML infrastructure strategy, advise on key decisions affecting our cloud budget, identify and execute optimization projects, and provide capacity planning and engineering expertise to support GM’s efforts in developing autonomous vehicles (AV).

Requirements

  • Experience: 5+ years of professional experience in high-scale infrastructure or ML systems.
  • Education: Bachelor’s Degree in Computer Science , a related technical field, or equivalent practical experience.
  • Software Proficiency: Expert-level coding skills in Python and the ability to architect/debug within the PyTorch ecosystem.
  • Systems Engineering: Proven track record of resolving performance issues within large-scale distributed production environments.
  • Architectural Knowledge: Deep understanding of distributed systems, specifically modern ML system design and high-performance computing (HPC).
  • Containerization: Hands-on experience with Kubernetes for orchestrating complex workloads.
  • GPU Monitoring: Technical proficiency with Nvidia DCGM , nvidia-smi , and Grafana for real-time telemetry and observability.
  • Cloud Platforms: Extensive experience working within major cloud ecosystems ( AWS, GCP, or Azure ).

Nice To Haves

  • Advanced Experience: 8+ years of relevant industry experience.
  • Hardware Expertise: Working knowledge of Enterprise-grade Nvidia GPU architectures, including H100, B200, and GB200 .
  • Model Deployment: Experience deploying and scaling open-source models via the Hugging Face ecosystem.
  • Data Analytics: Proficiency in BigQuery for large-scale data analysis and reporting.
  • Profiling Tools: Practical experience utilizing Nvidia Nsight and Nsight Compute for kernel-level performance tuning.
  • Soft Skills: Strong technical communication skills with the ability to translate complex infrastructure needs into actionable business insights.

Responsibilities

  • Strategic Infrastructure Development: Adopt and run AV models to support GM’s long-term GPU system strategy and "evergreen" infrastructure roadmap.
  • Performance Optimization: Conduct deep-dive analyses of production workloads to identify bottlenecks and propose high-impact optimization strategies.
  • Cross-Functional Collaboration: Partner with AI/ML Research, Infrastructure Engineering, and Cloud Vendors to spearhead projects that enhance engineering velocity and cost-efficiency.
  • Proactive System Scaling: Identify opportunities for architectural improvements to ensure the scalability and reliability of large-scale ML training and inference environments.

Benefits

  • GM offers a variety of health and wellbeing benefit programs.
  • Benefit options include medical, dental, vision, Health Savings Account, Flexible Spending Accounts, retirement savings plan, sickness and accident benefits, life insurance, paid vacation & holidays, tuition assistance programs, employee assistance program, GM vehicle discounts and more.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service