General Motors-posted 4 months ago
$134,000 - $235,900/Yr
Full-time • Senior
Remote • Wichita, KS
Transportation Equipment Manufacturing

We are seeking an experienced engineer in ML Training Infrastructure with a strong ability to execute hands-on technical work. In this role, you will be responsible for designing and building scalable, reliable, and high-performance AI/ML platform infrastructure to support advanced AI research and model development initiatives. As a Senior ML System Engineer, you will collaborate closely with machine learning engineers, research scientists, and other partners to develop state-of-the-art AI solutions that enable the future of intelligent driving technologies across General Motors vehicles.

  • Participate in the design and development of scalable, reliable, high-performance ML framework to support model training at scale.
  • Participate in model training performance analysis and optimization solutions to scale distributed training workflows and maximize resource utilization across heterogeneous hardware environments, and save cost.
  • Raise the bar on system observability, debuggability, and operational excellence, and user experience.
  • Collaborate with cross-functional teams to integrate new features and technologies into the platform.
  • Bachelors or higher degree in Computer Science or equivalent major or equivalent experience.
  • 5+ years professional software engineering experience.
  • 2+ years specialized experience in AI/ML infrastructure, e.g., enabling distributed training for scaling large ML models.
  • Strong programming skills in Python, with proficiency in frameworks such as PyTorch (preferred), TensorFlow, or similar.
  • Experience with distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure).
  • Willingness to travel to Sunnyvale, CA as needed.
  • Comfortable working in highly ambiguous and dynamic environments.
  • Self-motivated, strong execution, impact-delivering oriented.
  • Extensive knowledge and experience with PyTorch 2.x+ and distributed training framework.
  • Experience with design and development of training framework that supports FSDP, Pipeline Parallelism and other scalable solutions to training large foundational models.
  • Experience with profiling, analysis, debugging and optimizing training and dataloading performance.
  • Experience with Apache Parquet, Apache Arrow, Ray, Ray Data.
  • Strong programming skills in C++.
  • Excellent communication skills to resolve controversial, make consensus, communicate risks and give constructive feedback.
  • Medical, dental, vision insurance.
  • Health Savings Account, Flexible Spending Accounts.
  • Retirement savings plan.
  • Sickness and accident benefits.
  • Life insurance.
  • Paid vacation & holidays.
  • Tuition assistance programs.
  • Employee assistance program.
  • GM vehicle discounts.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service