About The Position

Lead the EC2 UltraServer Provisioning team in building and delivering NVIDIA-based UltraServers (GB200, GB300) through comprehensive ML provisioning workflows. Drive build quality improvements, reduce dwell times, and collaborate on unsellable reduction programs to ensure reliable, scalable delivery of ML infrastructure for Amazon's rapidly growing compute fleet. Key job responsibilities Team Leadership & People Management Lead and mentor a two-pizza team of 8-12 Software Development Engineers focused on EC2 server provisioning and infrastructure delivery Conduct regular 1:1s, provide continuous feedback, and drive career development for all team members Recruit, hire, and onboard top engineering talent to scale the team in alignment with organizational goals Foster an inclusive team culture that promotes innovation, collaboration, and operational excellence Manage performance reviews, compensation planning, and promotion processes Technical Strategy & Execution Own the technical roadmap and delivery of EC2 provisioning systems for compute infrastructure for NVIDIA UltraServer Platforms (GB200, GB300) Drive architectural decisions for scalable, reliable provisioning workflows that support Amazon's rapidly growing EC2 UltraServer fleet Partner with hardware engineering, data center operations, and EC2 service teams to ensure seamless server-to-host transitions Lead incident response and operational excellence initiatives, driving down provisioning failures and improving time-to-production metrics Establish and monitor key performance indicators (KPIs) including provisioning yield, time-to-sellable, and fleet availability Cross-Functional Collaboration Collaborate with senior leadership and peer SDMs to align team priorities with broader EC2 organizational objectives Work closely with Principal Engineers and Senior SDEs to define technical standards and best practices Partner with Program Management, Product, and Operations teams to deliver on committed timelines and capacity goals Represent the team in organizational planning, resource allocation discussions, and technical reviews Operational Excellence & Innovation Drive continuous improvement in software development practices, including CI/CD pipelines, testing frameworks, and deployment automation Champion operational metrics and mechanisms to improve system reliability, reduce toil, and accelerate delivery Balance short-term delivery commitments with long-term technical debt reduction and system modernization Promote a culture of experimentation and learning, encouraging the team to adopt new technologies and methodologies Business Impact & Communication Translate business requirements into technical execution plans with clear milestones and success criteria Communicate team progress, risks, and achievements to senior leadership through written narratives and business reviews Contribute to organizational strategy discussions, providing insights on technical capabilities and constraints Manage stakeholder expectations and build trust through consistent delivery and transparent communication About the team The EC2 UltraServer Provisioning team is a high-performing engineering organization responsible for delivering NVIDIA-based ML infrastructure at scale. We manage end-to-end provisioning workflows for GB200 and GB300 UltraServers, from host ingestion through testing, repair, and recovery. Our team drives operational excellence through continuous improvement of build quality metrics, reduction of dwell times, and collaboration on fleet-wide unsellable reduction initiatives. We work closely with hardware engineering, data center operations, and EC2 service teams to ensure reliable, efficient delivery of critical ML compute capacity. This is a high-impact role leading a two-pizza team of talented engineers solving complex technical challenges in one of Amazon's fastest-growing infrastructure domains.

Requirements

  • 3+ years of engineering team management experience
  • 7+ years of working directly within engineering teams experience
  • 3+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
  • 8+ years of leading the definition and development of multi tier web services experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience partnering with product or program management teams

Nice To Haves

  • Experience in communicating with users, other technical teams, and senior leadership to collect requirements, describe software product features, technical designs, and product strategy
  • Experience in recruiting, hiring, mentoring/coaching and managing teams of Software Engineers to improve their skills, and make them more effective, product software engineers

Responsibilities

  • Lead and mentor a two-pizza team of 8-12 Software Development Engineers focused on EC2 server provisioning and infrastructure delivery
  • Conduct regular 1:1s, provide continuous feedback, and drive career development for all team members
  • Recruit, hire, and onboard top engineering talent to scale the team in alignment with organizational goals
  • Foster an inclusive team culture that promotes innovation, collaboration, and operational excellence
  • Manage performance reviews, compensation planning, and promotion processes
  • Own the technical roadmap and delivery of EC2 provisioning systems for compute infrastructure for NVIDIA UltraServer Platforms (GB200, GB300)
  • Drive architectural decisions for scalable, reliable provisioning workflows that support Amazon's rapidly growing EC2 UltraServer fleet
  • Partner with hardware engineering, data center operations, and EC2 service teams to ensure seamless server-to-host transitions
  • Lead incident response and operational excellence initiatives, driving down provisioning failures and improving time-to-production metrics
  • Establish and monitor key performance indicators (KPIs) including provisioning yield, time-to-sellable, and fleet availability
  • Collaborate with senior leadership and peer SDMs to align team priorities with broader EC2 organizational objectives
  • Work closely with Principal Engineers and Senior SDEs to define technical standards and best practices
  • Partner with Program Management, Product, and Operations teams to deliver on committed timelines and capacity goals
  • Represent the team in organizational planning, resource allocation discussions, and technical reviews
  • Drive continuous improvement in software development practices, including CI/CD pipelines, testing frameworks, and deployment automation
  • Champion operational metrics and mechanisms to improve system reliability, reduce toil, and accelerate delivery
  • Balance short-term delivery commitments with long-term technical debt reduction and system modernization
  • Promote a culture of experimentation and learning, encouraging the team to adopt new technologies and methodologies
  • Translate business requirements into technical execution plans with clear milestones and success criteria
  • Communicate team progress, risks, and achievements to senior leadership through written narratives and business reviews
  • Contribute to organizational strategy discussions, providing insights on technical capabilities and constraints
  • Manage stakeholder expectations and build trust through consistent delivery and transparent communication

Benefits

  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service