Software Development Manager, AWS Neuron SDK - Distributed Training

AmazonCupertino, CA
$212,700 - $287,700Onsite

About The Position

AWS Neuron is a software stack for the Annapurna Inferentia and Trainium machine learning accelerators hosted inside AWS EC2 Trn1/2 and Inf1 servers. As the Principal Engineer for the Neuron Distributed Training team, you will be responsible for working hands-on with a strong team of engineers to help design and optimize ML on Neuron devices. Specifically focus on bringing up a coherent solution across the stack to increase the training resiliency for ultra clusters with thousands of nodes. You will Scale and Optimize the application stack for LLMs that leverage multi-modal modes of input/output-generation such as Text, Vision, Video, Audio etc. You will be responsible for the full development life cycle of providing Distributed Training support for multi-modal transformer models such as MM-Llama3.2, DiT/Pixart, CLIP etc. You will develop scalability features and performance optimizations in the Neuron ML Framework components to enable them make Trainium devices as the first-class citizens for ML Acceleration. Lead the way to ensure support for key ML functionality in a combined chip / software platform. Ensure the right thing is being built and delivered to customers. A successful candidate will have an established background in Scaling and Stabilizing Machine Learning Distributed Training components along-with a strong technical ability to work/deliver on a vertically integrated system stack that consists of a combinatorial matrix of hardware, frameworks, and workflows. Deep expertise in scaling model training across thousands of nodes a must along-with direct customer-facing experience and a strong motivation to achieve results.

Requirements

  • Knowledge of object-oriented design, data structures, and algorithms
  • Experience (non-internship) in professional software development
  • 3+ years of engineering team management experience
  • 7+ years of working directly within engineering teams experience
  • 3+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
  • 8+ years of leading the definition and development of multi tier web services experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience partnering with product or program management teams
  • ML/DL work experience, with focus on GenAI and LLMs.
  • Deep expertise in scaling model training across thousands of nodes a must along-with direct customer-facing experience and a strong motivation to achieve results.

Nice To Haves

  • Experience designing and building large-scale systems in a multi-tiered, distributed environment (Service Oriented Architecture)
  • Experience in Distributed Training on thousands of nodes.
  • Experience in communicating with users, other technical teams, and senior leadership to collect requirements, describe software product features, technical designs, and product strategy

Responsibilities

  • Lead the efforts building distributed training large cluster stability support into Pytorch, Jax using XLA and the Neuron compiler and runtime stacks.
  • Tune these models to ensure highest performance and maximize the efficiency of them running on the customer AWS Trainium TRN2+ servers.
  • Work hands-on with a strong team of engineers to help design and optimize ML on Neuron devices.
  • Focus on bringing up a coherent solution across the stack to increase the training resiliency for ultra clusters with thousands of nodes.
  • Scale and optimize the application stack for LLMs that leverage multi-modal modes of input/output-generation such as Text, Vision, Video, Audio etc.
  • Be responsible for the full development life cycle of providing Distributed Training support for multi-modal transformer models such as MM-Llama3.2, DiT/Pixart, CLIP etc.
  • Develop scalability features and performance optimizations in the Neuron ML Framework components to enable them make Trainium devices as the first-class citizens for ML Acceleration.
  • Lead the way to ensure support for key ML functionality in a combined chip / software platform.
  • Ensure the right thing is being built and delivered to customers.
  • Recruit, hire, mentor/coach and manage teams of Software Engineers to improve their skills, and make them more effective, product software engineers.

Benefits

  • sign-on payments
  • restricted stock units (RSUs)
  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service