Software Development Engineer, Neuron Collectives, Annapurna Labs

Amazon•Cupertino, CA

1d•Onsite

About The Position

Annapurna Labs, an integral part of AWS, develops critical hardware and software components for EC2 infrastructure, specializing in optimizing the AWS customer experience through the design of software, systems, and chips. The AWS Neuron Collectives team is seeking a Software Engineer to optimize collective operations for AWS Trainium, a key initiative powering frontier AI models. This role involves deep optimization of compute for specific topologies used in modern LLM training, close collaboration with the hardware team, and pushing for maximum performance using C/C++, interfacing with DMA and firmware, and investigating detailed topologies. You will analyze current collective algorithms using tools like Neuron Explorer, optimize them to fully utilize compute and bus bandwidth for data center scaling, and impact AI training at AWS scale while growing your technical expertise.

Requirements

Experience building complex software systems that have been successfully delivered to customers
Experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems
Bachelor's degree in computer science or equivalent
Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
Experience in development in the last 3 years, or experience in embedded development in C/C++

Nice To Haves

Master's degree in computer science or equivalent
Experience with hardware/software integration and real-time systems
Familiarity with collective communication algorithms (e.g., all-reduce, all-gather) or distributed training frameworks

Responsibilities

Enhance collective algorithms and topologies for optimal training performance
Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization
Monitor and analyze processor, DMA, firmware, and workload metrics
Optimize collective operations to scale AI compute across the data center
Work closely with the hardware team to co-optimize software and Trainium silicon
Develop and optimize C/C++ implementations of collective communication patterns
Investigate and implement improvements for specific training topologies used by modern LLMs
Build and maintain analysis frameworks and automation solutions

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave
sign-on payments
restricted stock units (RSUs)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume