Join the EC2 Machine Learning Systems team at Amazon Web Services (AWS) as a System Development Engineer III and lead the development of operational visibility and tooling for EC2 supercomputer instance families. In this role, you'll leverage your specialized knowledge of distributed systems to improve system automation and operational tooling between infrastructure hosting EC2 instances and back-end control plane infrastructure. This position offers a unique opportunity to work at the intersection of high-performance computing and machine learning infrastructure. You'll apply operations best practices at scale while developing tools and systems that enhance visibility, maintenance, and operations of customer-facing supercomputer instance types. Your work will directly impact how AWS customers leverage compute resources for their most demanding machine learning workloads. About the team The EC2 Nitro Machine Learning Systems team is responsible for development, operations, and maintenance of scale-out machine learning platforms used for training and inference workloads. We build and optimize the infrastructure that powers some of the most computationally intensive AI/ML workloads in the cloud. Our team is passionate about creating reliable, high-performance systems that enable customers to push the boundaries of what's possible with machine learning. Working with us means having the opportunity to influence the future of supercomputing in the cloud while solving complex technical challenges at massive scale. We collaborate closely with customers and internal teams to continuously improve our platforms and deliver innovations that accelerate machine learning workflows.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed