Software Engineer - AI/ML, AWS Neuron Distributed Training - Multimodal

AmazonCupertino, CA
$165,200 - $223,600Onsite

About The Position

This role is for a machine learning engineer in the Distribute Training team for AWS Neuron, responsible for development, enablement and performance tuning of a wide variety of ML model families, including massive-scale Large Language Models (LLM) such as GPT and Llama, as well as Stable Diffusion, Vision Transformers (ViT) and many more. The ML Distributed Training team works side by side with chip architects, compiler engineers and runtime engineers to create, build and tune distributed training solutions with Trainium instances. Experience with training these large models using Python is a must. FSDP (Fully-Sharded Data Parallel), Deepspeed and other distributed training libraries are central to this and extending all of this for the Neuron based system is key.

Requirements

  • Bachelor's degree in computer science or equivalent
  • 2+ years of computer science fundamentals (object-oriented design, data structures, algorithm design, problem solving and complexity analysis) experience
  • 2+ years of contributing to new and current systems architecture and design (architecture, design patterns, reliability and scaling) experience
  • Experience programming with at least one software programming language
  • Experience in machine learning, data mining, information retrieval, statistics or natural language processing
  • Experience with training large models using Python

Nice To Haves

  • Master's degree in computer science or equivalent
  • 2+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Experience in computer architecture
  • Previous software engineering expertise with Pytorch/Jax/Tensorflow, Distributed libraries and Frameworks, End-to-end Model Training.
  • Previous experience with training multi-modal models for understanding and generating images/videos/audios

Responsibilities

  • Help lead the efforts building distributed training support into Pytorch, Tensorflow using XLA and the Neuron compiler and runtime stacks.
  • Help tune these models to ensure highest performance and maximize the efficiency of them running on the custom AWS Trainium and Inferentia silicon and the Trn1, Inf1/2 servers.

Benefits

  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service