About The Position

EC2 Nitro drives the planet’s largest, fastest growing and most feature-rich compute cloud. Nitro is AWS ground-up design for virtualization at global scale built on a fully custom stack of hardware, firmware and applications. Nitro has enabled EC2 to support Intel, AMD and Amazon’s custom silicon - the Graviton processor family - while raising the industry bar for security and performance across our product line. This role involves integrating hardware, firmware, application software and services to deliver new virtualized and bare-metal compute platforms for companies from startups through the Fortune 500. The position is for an experienced leader to drive software development and scaling for new EC2 compute platforms, working with a broad and deep group of technical teams that develop hardware, firmware, systems and application software. The EC2 Machine Learning Supercompute team develops next generation Ultraserver platforms to power high-performance training and inference workloads.

Requirements

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Solid understanding of computer science fundamentals
  • Expertise in C, C++ or Rust development in a Linux environment
  • Experience with Linux package management
  • Experience with version control systems
  • Experience with automated build processes
  • Experience with software unit testing

Nice To Haves

  • In-depth knowledge of ML frameworks and cluster management
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • Experience in embedded development in C/C++

Responsibilities

  • Design and develop innovative technologies that power the infrastructure supporting machine learning workloads on Ultraservers
  • Lead technical projects establishing EC2 as the pioneer in cloud computing for ML workloads across diverse applications including LLMs, multimodal systems, and emerging model architectures.
  • Develop and maintain comprehensive regression testing systems that validate performance across major component releases including frameworks, firmware, drivers, and networking infrastructure.
  • Collaborate with hardware engineering teams to influence future platform designs based on performance insights gathered from state-of-the-art research and customer workloads.
  • Build customer relationships by investigating complex performance challenges, developing solutions, and publishing actionable best practices through multiple channels.

Benefits

  • Comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • Paid time off
  • Parental leave
  • Sign-on payments
  • Restricted stock units (RSUs)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service