Sr. Technical Program Manager, Global Data Center Operations (Central Operations)

Amazon•Seattle, WA

About The Position

The Central Operations team within Amazon Web Services (AWS) Infrastructure is seeking a Senior Technical Program Manager to drive the health, stability, and operational excellence of new hardware deployments across our global data center fleet. This role uniquely blends technical program management with strategic account management to ensure our GenAI and high-performance computing infrastructure delivers maximum value to customers. As a Sr. TPM, you will be the technical advocate and strategic advisor for operational support of new AI/ML hardware platforms. You will serve as the central owner of operational health (failure rate, repair efficacy, repair dwell time, break/fix process improvement) while driving cross-functional initiatives to improve these key performance indicators. You will work at the intersection of hardware engineering, data center operations, and service teams like EC2—translating complex technical data into actionable insights and leading programs that accelerate capacity delivery while maintaining the highest standards of operational health. This is not a sales role, but rather an opportunity to be the 'voice of the customer' and the 'voice of operations' for critical infrastructure that powers AWS's most demanding workloads. You will craft and execute strategies to optimize new hardware deployments, proactively identify and remediate stability issues, and establish best practices that scale across AWS's global infrastructure.

Requirements

5+ years of technical product or program management experience
7+ years of working directly with engineering teams experience
Experience in root cause analysis and error correction, identifying changes to procedures and systems to implement long-term fixes and avoid repeating issues
Experience leading process improvements
Experience in written and verbal communication skills to communicate with technical and non-technical audiences, including senior leadership

Nice To Haves

Experience in technical account management, business relationship management, or consulting
Knowledge of Six Sigma tools, Lean techniques, PMP or similar standards preferred
Experience in server technologies such as, thermal, mechanical, power, and signal integrity
Experience managing UltraServer, high-performance computing, or AI/ML infrastructure deployments

Responsibilities

Own the end-to-end health and stability metrics for new AI/ML hardware platforms, establishing KPIs and routines that provide real-time visibility into operational performance
Drive deep-dive analyses on hardware failures to identify root causes and drive systematic improvements
Lead cross-functional investigations, experiments, and post-mortem processes, ensuring lessons learned translate into preventive measures and design improvements
Develop and maintain hardware health scorecards that inform leadership decisions on deployment readiness, capacity planning, and risk mitigation
Manage complex, multi-phase infrastructure projects involving hardware engineering, supply chain, data center operations, and software teams across multiple time zones
Establish and maintain program schedules, budgets, and resource plans, proactively identifying and mitigating risks to delivery timelines
Facilitate technical deep dive sessions to troubleshoot diagnostic and repair issues, remove blockers, and accelerate project delivery
Design and implement processes that eliminate non-value-add activities and optimize deployment velocity without compromising quality
Serve as the primary operational point of contact for new platforms across software and hardware teams, summarizing platform operational status and path-to-green
Build trusted advisor relationships with data center operations, hardware engineering, and service teams to understand their operational needs and technical challenges
Translate operational feedback and customer requirements into hardware and process improvement roadmaps, and engineering priorities
Provide strategic technical guidance on AI/ML deployment strategies, best practices, and operational procedures
Advocate for operational excellence, ensuring that hardware health considerations are integrated into capacity planning and service delivery decisions
Partner with hardware engineering teams to influence design decisions based on operational data and field performance
Collaborate with new product introduction and hardware engineering teams to ensure quality gates are met before launch
Work with monitoring and automation teams to implement appropriate signals to ensure customer commitments are met
Drive alignment across diverse stakeholders including engineering, operations, finance, and executive leadership
Present technical assessments and recommendations to senior leadership, clearly articulating trade-offs, risks, and business impact

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume