Principal Machine Learning Engineer

General MotorsSunnyvale, CA
83d

About The Position

We are seeking a Principal AI Engineer to lead the design and advancement of our AI platform. You will play a key role in shaping the infrastructure that powers large-scale training and cloud inference. This includes accelerating training throughput, scaling multi-modal models, and enabling the next generation of AI-driven driving systems. We're tackling challenges across distributed training, training efficiency, DDP/FSDP, data processing pipelines, and Pytorch model optimization. This is a highly impactful position where your technical leadership will define how we scale AI to achieve autonomy.

Requirements

  • Bachelor’s degree or higher in Computer Science, related field, or equivalent experience.
  • 8+ years of professional software engineering experience.
  • 4+ years of specialized experience in AI/ML domain (e.g., enabling distributed training for large-scale models).
  • Strong programming skills in Python, with proficiency in frameworks such as PyTorch (preferred) or TensorFlow.
  • Experience with distributed systems, GPU computing, and cloud environments (AWS, GCP, or Azure).
  • Comfortable operating in highly ambiguous and dynamic environments.
  • Willingness to travel to Sunnyvale, CA as needed.

Nice To Haves

  • Proven track record of self-motivation, execution, and delivering impact.
  • Deep expertise with PyTorch 2.x+ and distributed training frameworks.
  • Strong skills in profiling, analysis, debugging, and optimizing training performance (e.g., avoiding memory fragmentation, operation fusion).
  • Proficiency in C++ for performance-critical components.
  • Experience leading cross-functional projects and aligning diverse stakeholders on priorities.

Responsibilities

  • Architect, build, and optimize core AI/ML platform infrastructure to support massive-scale model training.
  • Collaborate with data scientists, ML engineers, and software developers to enable seamless workflows from research to production.
  • Drive efficiency in large-scale distributed training and data processing pipelines.
  • Establish best practices for reliability, scalability, and performance across the AI/ML platform.
  • Provide technical leadership and mentorship, guiding teams on platform design, architecture decisions, and emerging technologies.
  • Partner with cross-functional stakeholders to align platform capabilities with business needs and strategic AI initiatives.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service