About The Position

The Central Operations team within Amazon Web Services (AWS) Infrastructure is seeking a Senior Technical Program Manager to drive the health, stability, and operational excellence of new hardware deployments across our global data center fleet. This role uniquely blends technical program management with strategic account management to ensure our GenAI and high-performance computing infrastructure delivers maximum value to customers. As a Sr. TPM, you will be the technical advocate and strategic advisor for operational support of new AI/ML hardware platforms. You will serve as the central owner of operational health (failure rate, repair efficacy, repair dwell time, break/fix process improvement) while driving cross-functional initiatives to improve these key performance indicators. You will work at the intersection of hardware engineering, data center operations, and service teams like EC2—translating complex technical data into actionable insights and leading programs that accelerate capacity delivery while maintaining the highest standards of operational health. This is not a sales role, but rather an opportunity to be the 'voice of the customer' and the 'voice of operations' for critical infrastructure that powers AWS's most demanding workloads. You will craft and execute strategies to optimize new hardware deployments, proactively identify and remediate stability issues, and establish best practices that scale across AWS's global infrastructure.

Requirements

  • 5+ years of technical product or program management experience
  • 7+ years of working directly with engineering teams experience
  • Experience in root cause analysis and error correction, identifying changes to procedures and systems to implement long-term fixes and avoid repeating issues
  • Experience leading process improvements
  • Experience in written and verbal communication skills to communicate with technical and non-technical audiences, including senior leadership

Nice To Haves

  • Experience in technical account management, business relationship management, or consulting
  • Knowledge of Six Sigma tools, Lean techniques, PMP or similar standards preferred
  • Experience in server technologies such as, thermal, mechanical, power, and signal integrity
  • Experience managing UltraServer, high-performance computing, or AI/ML infrastructure deployments

Responsibilities

  • Own the end-to-end health and stability metrics for new AI/ML hardware platforms, establishing KPIs and routines that provide real-time visibility into operational performance
  • Drive deep-dive analyses on hardware failures to identify root causes and drive systematic improvements
  • Lead cross-functional investigations, experiments, and post-mortem processes, ensuring lessons learned translate into preventive measures and design improvements
  • Develop and maintain hardware health scorecards that inform leadership decisions on deployment readiness, capacity planning, and risk mitigation
  • Manage complex, multi-phase infrastructure projects involving hardware engineering, supply chain, data center operations, and software teams across multiple time zones
  • Establish and maintain program schedules, budgets, and resource plans, proactively identifying and mitigating risks to delivery timelines
  • Facilitate technical deep dive sessions to troubleshoot diagnostic and repair issues, remove blockers, and accelerate project delivery
  • Design and implement processes that eliminate non-value-add activities and optimize deployment velocity without compromising quality
  • Serve as the primary operational point of contact for new platforms across software and hardware teams, summarizing platform operational status and path-to-green
  • Build trusted advisor relationships with data center operations, hardware engineering, and service teams to understand their operational needs and technical challenges
  • Translate operational feedback and customer requirements into hardware and process improvement roadmaps, and engineering priorities
  • Provide strategic technical guidance on AI/ML deployment strategies, best practices, and operational procedures
  • Advocate for operational excellence, ensuring that hardware health considerations are integrated into capacity planning and service delivery decisions
  • Partner with hardware engineering teams to influence design decisions based on operational data and field performance
  • Collaborate with new product introduction and hardware engineering teams to ensure quality gates are met before launch
  • Work with monitoring and automation teams to implement appropriate signals to ensure customer commitments are met
  • Drive alignment across diverse stakeholders including engineering, operations, finance, and executive leadership
  • Present technical assessments and recommendations to senior leadership, clearly articulating trade-offs, risks, and business impact

Benefits

  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service