PMTS Software Development Eng.

Advanced Micro Devices, Inc•San Jose, CA

8h•Hybrid

About The Position

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. The AI system optimization team at AMD is looking for a specialized Principal level engineer who is passionate about enabling innovative and efficient Generative AI training/inferencing at scale. You will be part of a core team of incredibly talented specialists and work on scaling training and inference for the latest Generative AI models. THE PERSON: The ideal candidate has deep technical understanding of image/video generation system, LLM parallelism, distributed inference framework, hands on experience with communication middleware, e.g., NCCL / RCCL, MPI and RoCE v2. This candidate should have experience training models at scale and is passionate about innovating efficient approaches to enable distributed training and inference at scale on AMD devices. Why Join Us? Exciting Opportunities: As a Senior member on the team, you will be at the forefront of innovation, working with the latest Gen AI models and algorithms. You will have the opportunity to shape the future of AI model training and inference optimizations across a variety of applications. Talented Team: Join a team of highly skilled industry specialists who are passionate about pushing the boundaries of AI. Collaborate with like-minded professionals and learn from the best in the field. Cutting-edge Technology: Work with state-of-the-art GenAI algorithms and software enabling you to stay ahead of the curve and drive advancements in AI model training at Scale and deployment. Impactful Work: Your contributions will directly influence how cutting-edge gen AI models across the industry are efficiently trained at scale as well as inferencing deployed to serve millions of customers, making a significant difference in various industries and applications.

Requirements

Strong technical expertise in communication middleware (e.g. NCCL/RCCL and MPI), and familiarity working with deep learning frameworks (e.g. Pytorch).
Strong technical expertise in benchmarking and performance optimization of distributed training and inference systems.
Several years of experience in AI, deep learning and related software development.

Nice To Haves

Expertise/publications in one of the areas preferred - efficient model architectures, optimized training, innovative parallelism strategies or communication framework.
Experience in Slurm and Kubernetes for managing the training and inference jobs over a cluster.
Excellent written, verbal, and presentation skills, ability to coordinate internally and externally.

Responsibilities

Propose and apply innovative techniques to support both training and inferencing including innovative communication architectures, parallelism strategies to train on large clusters.
Implement novel efficient architectures for Generative AI models for training and inference and showcase benefits on AMD
Work with open-source framework and community (e.g., PyTorch, SGLang, Hugging Face) to integrate AMD optimized models, libraries and publish training recipes.
Collaborate with software and hardware team to E2E co-optimize performance on current and future AMD solutions.
Publish and promote your work within AMD and at external venues.