Director of Machine Learning Engineering -- Training and Performance

Advanced Micro Devices, Inc.San Jose, CA
55d

About The Position

At AMD, our mission is to build great products that accelerate next-generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. AMD is seeking a Director of Machine Learning Engineering to join our Models and Applications organization. In this role, you will define and execute the technical vision for distributed training of large-scale generative AI and recommendation models on AMD GPUs. You'll guide a world-class engineering team focused on scaling AI training efficiency, optimizing model performance, and advancing AMD's leadership in AI systems. This position blends deep technical expertise with strategic leadership. You will partner closely with research, hardware, and software teams to shape the roadmap for AMD's AI training stack - driving innovation at both the model and application levels, influencing how next-generation AI models are trained and deployed efficiently on AMD platforms. The ideal candidate is a strategic technical leader with a strong foundation in distributed training and AI infrastructure, coupled with experience building or guiding high-impact ML applications such as recommendation systems and ranking models. You combine visionary thinking with execution excellence, thrive in cross-functional collaboration, and are passionate about scaling AI systems that fully leverage AMD GPU performance across both model and application layers.

Requirements

  • 10+ years in machine learning, distributed systems, or AI infrastructure; 5+ years in technical leadership or management roles.
  • Proven experience building and optimizing distributed training systems for large models.
  • Strong familiarity with ML frameworks (PyTorch, JAX, TensorFlow) and distributed frameworks (TorchTitan, Megatron-LM).
  • Hands-on expertise with LLMs, recommendation systems, or ranking models.
  • Proficiency in Python and C++, including performance profiling, debugging, and large-scale optimization.
  • Experience collaborating across hardware, compiler, and system software layers.
  • Excellent communication, leadership, and problem-solving skills with the ability to influence across organizations and external partners.
  • Master's or Ph.D. in Computer Science, Artificial Intelligence, Machine Learning, or a related field.

Nice To Haves

  • Prefer experience in both model and application-level development and optimization.

Responsibilities

  • Strategic Leadership & Vision: Define and drive AMD's distributed training strategy for large-scale generative and recommendation models. Align technical initiatives with broader AI platform goals and business impact.
  • Technical Direction & Innovation: Architect and optimize distributed training pipelines (Pre-training, SFT, RL etc.) for large-scale models. Explore new approaches for efficient training and inference of LLMs and ranking systems.
  • Execution & Delivery: Lead development of high-performance, reliable training pipelines that scale across thousands of GPUs. Ensure world-class efficiency, stability, and model convergence.
  • Cross-Functional Collaboration: Partner with compiler, runtime, system software, and hardware architecture teams to co-design solutions that maximize end-to-end performance.
  • Team Leadership & Development: Build, mentor, and empower a team of expert engineers focused on innovation, collaboration, and technical excellence.
  • Open Source & External Engagement: Drive AMD's engagement in open-source communities through contributions to frameworks such as PyTorch, JAX, TorchTitan, and Megatron-LM. Represent AMD's leadership in AI system design across industry and research communities.
  • Research & Trends: Stay ahead of emerging advances in distributed training, LLMs, recommendation systems, and AI infrastructure - and translate them into scalable engineering practices.

Benefits

  • AMD benefits at a glance.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Director

Industry

Computer and Electronic Product Manufacturing

Education Level

Ph.D. or professional degree

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service