Annapurna Labs designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time ago—even yesterday. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, and deliver results that help our customers change the world. AWS Neuron is the complete software stack for the AWS Trainium (Trn1/Trn2) and Inferentia (Inf1/Inf2) our cloud-scale Machine Learning accelerators. This role is for a Senior Machine Learning Engineer in the Distribute Training team for AWS Neuron, responsible for development, enablement and performance tuning of a wide variety of ML model families, including massive-scale Large Language Models (LLM) such as GPT-OSS, Quen and Llama, as well as Stable Diffusion, Vision Transformers (ViT) and many more. The ML Distributed Training team works side by side with chip architects, compiler engineers and runtime engineers to create, build and tune distributed training solutions with Trainium instances. Experience with training these large models using Pythorch is a must. Distributed training with awareness of strategies like FSDP (Fully-Sharded Data Parallel), PP, Context parallel. Distributed training libraries like torchtitan, torchtune , HF RL , DeepSeek etc are central to this and extending all of this for the Neuron based system is key focussing on enabling large scale training. Experience is post-training strategies like DPO/PPO/HF torch-tune will additional strength and aligns with team success.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Career Level
Senior
Education Level
Bachelor's degree