Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

AmazonSeattle, WA
85d$151,300 - $261,500

About The Position

Annapurna Labs designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time ago—even yesterday. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, and deliver results that help our customers change the world. AWS Neuron is the complete software stack for the AWS Trainium (Trn1/Trn2) and Inferentia (Inf1/Inf2) our cloud-scale Machine Learning accelerators. This role is for a Senior Machine Learning Engineer in the Distribute Training team for AWS Neuron, responsible for development, enablement and performance tuning of a wide variety of ML model families, including massive-scale Large Language Models (LLM) such as GPT-OSS, Quen and Llama, as well as Stable Diffusion, Vision Transformers (ViT) and many more. The ML Distributed Training team works side by side with chip architects, compiler engineers and runtime engineers to create, build and tune distributed training solutions with Trainium instances. Experience with training these large models using Pythorch is a must. Distributed training with awareness of strategies like FSDP (Fully-Sharded Data Parallel), PP, Context parallel. Distributed training libraries like torchtitan, torchtune , HF RL , DeepSeek etc are central to this and extending all of this for the Neuron based system is key focussing on enabling large scale training. Experience is post-training strategies like DPO/PPO/HF torch-tune will additional strength and aligns with team success.

Requirements

  • Bachelor's degree in computer science or equivalent.
  • 5+ years of non-internship professional software development experience.
  • 5+ years of programming with at least one software programming language experience.
  • 5+ years of leading design or architecture of new and existing systems experience.
  • 5+ years of full software development life cycle experience.
  • Experience as a mentor, tech lead or leading an engineering team.
  • Experience in machine learning, large scale training with LLMs and expertise in Pytorch.

Nice To Haves

  • Master's degree in computer science or equivalent.
  • Experience in computer architecture.
  • Previous software engineering expertise with Pytorch/Jax/Tensorflow.
  • Experience with Distributed libraries and Frameworks.
  • End-to-end Model Training experience.

Responsibilities

  • Lead efforts to build distributed training support into PyTorch, the Neuron compiler, and runtime stacks.
  • Enable distributed training strategies and optimize models to achieve peak performance on AWS custom silicon, including Trainium servers.
  • Work effectively within cross-functional teams.
  • Deep dive into software development challenges.

Benefits

  • Medical, financial, and/or other benefits.
  • Equity, sign-on payments, and other forms of compensation may be provided as part of a total compensation package.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service