Software Engineer - AI/ML, AWS Neuron Distributed Training - Multimodal

Amazon•Cupertino, CA

1d•$165,200 - $223,600•Onsite

About The Position

This role is for a machine learning engineer in the Distribute Training team for AWS Neuron, responsible for development, enablement and performance tuning of a wide variety of ML model families, including massive-scale Large Language Models (LLM) such as GPT and Llama, as well as Stable Diffusion, Vision Transformers (ViT) and many more. The ML Distributed Training team works side by side with chip architects, compiler engineers and runtime engineers to create, build and tune distributed training solutions with Trainium instances. Experience with training these large models using Python is a must. FSDP (Fully-Sharded Data Parallel), Deepspeed and other distributed training libraries are central to this and extending all of this for the Neuron based system is key.

Requirements

Bachelor's degree in computer science or equivalent
2+ years of computer science fundamentals (object-oriented design, data structures, algorithm design, problem solving and complexity analysis) experience
2+ years of contributing to new and current systems architecture and design (architecture, design patterns, reliability and scaling) experience
Experience programming with at least one software programming language
Experience in machine learning, data mining, information retrieval, statistics or natural language processing
Experience with training large models using Python

Nice To Haves

Master's degree in computer science or equivalent
2+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Experience in computer architecture
Previous software engineering expertise with Pytorch/Jax/Tensorflow, Distributed libraries and Frameworks, End-to-end Model Training.
Previous experience with training multi-modal models for understanding and generating images/videos/audios

Responsibilities

Help lead the efforts building distributed training support into Pytorch, Tensorflow using XLA and the Neuron compiler and runtime stacks.
Help tune these models to ensure highest performance and maximize the efficiency of them running on the custom AWS Trainium and Inferentia silicon and the Trn1, Inf1/2 servers.

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume