Software Engineer I - AI/ML, AWS Neuron Distributed Training

Amazon•Cupertino, CA

2d•$127,100 - $185,000•Onsite

About The Position

Annapurna Labs designs silicon and software that accelerates innovation. Our custom chips, accelerators, and software stacks enable us to tackle unprecedented technical challenges and deliver solutions that help customers change the world. AWS Neuron is the complete software stack powering AWS Trainium (Trn2/Trn3), our cloud scale Machine Learning accelerators and we are seeking a Senior Software Engineer to join our ML Distributed Training team. In this role, you will be responsible for the development, enablement, and performance optimization of large scale ML model training across diverse model families. This includes massive scale pre-training and post-training of LLMs with Dense and Mixture-of-Experts architectures, Multimodal models that are transformer and diffusion based, and Reinforcement Learning workloads. You will work at the intersection of ML research and high performance systems, collaborating closely with chip architects, compiler engineers, runtime engineers and AWS solution architects to deliver cost-effective, performant machine learning solutions on AWS Trainium based systems.

Requirements

Bachelor's degree or above in computer science, computer engineering, or related field, or Bachelor's degree
1+ years of programming experience with at least one software programming language (including academic projects, internships, or research)
Experience with software development practices including code reviews, source control, testing, and build processes
Experience with machine learning concepts and at least one ML framework (PyTorch, JAX, or TensorFlow)

Nice To Haves

Master's degree or above in computer science or equivalent
Experience with large-scale distributed training or LLM workloads
Experience with computer architecture or hardware-software co-optimization
Experience with distributed systems, libraries, or frameworks
Familiarity with end-to-end model training pipelines
Previous internship or research experience in ML infrastructure or systems software

Responsibilities

Contribute to the design and implementation of distributed training solutions for large-scale ML models running on Trainium instances.
Extend and optimize popular distributed training frameworks including FSDP, torchtitan, and Hugging Face libraries for the Neuron ecosystem.
Develop and optimize mixed-precision and low-precision training techniques, working with BF16, FP8, and emerging numerical formats to improve training throughput while maintaining model accuracy and convergence quality.
Implement precision-aware training strategies, loss scaling techniques, and careful gradient management to ensure training stability across reduced precision formats.
Profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware.
Partner with hardware, compiler, and runtime teams to understand system constraints and unlock new capabilities.
Collaborate with AWS solution architects and customers to support the deployment and optimization of training workloads at scale.

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume