Technical Program Manager- AI Cluster Engineering

Advanced Micro Devices, Inc•Austin, TX

1d•Onsite

About The Position

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. We are seeking an experienced Technical Program Manager to drive end-to-end execution of AI cluster engineering programs spanning GPU platforms, rack-scale solutions, high-speed networking, and datacenter AI infrastructure. You will work cross-functionally to translate customer and internal requirements into executable plans, manage risks and dependencies, and deliver scalable, production-ready solutions across GPU → rack → cluster deployments. You are a hands-on TPM who thrives in complex, fast-moving ecosystems, and can connect deep technical details to crisp program plans, executive reporting, and customer outcomes. You will partner with cross functional teams to help server integration to rack and cluster-level validation You bring strong ownership, structured execution, and the ability to lead through influence across engineering, operations, vendors, and customers.

Requirements

Proven program management experience delivering complex, cross-functional hardware/software infrastructure programs (server/rack/cluster environments).
Strong understanding of datacenter building blocks and lifecycle: servers, racks, clusters, HW/FW/SW integration, and readiness/validation flows.
Demonstrated ability to build and run schedules, manage risks, lead matrix teams, and communicate clearly to engineering and executive audiences.
Strong working knowledge of program tools (e.g., Jira/Confluence/Microsoft Office) and dashboard-based execution management.
Bachelor’s or master’s degree in computer/electrical engineering (or equivalent).

Nice To Haves

AI cluster networking domain experience (NICs, switching/optics/cabling, topologies).
Familiarity with rack/pod management and operations concerns (telemetry, health monitoring, power control, FW provisioning, management networks).
Experience leading programs for integration of servers in OEMs, ODMs, or data centers.
Demonstrated horizontal leadership across large matrix organizations.
Formal PM education/certification (PMP / Scrum Master) preferred.

Responsibilities

Define, plan, and drive program plans for AI infrastructure systems validation and readiness, including server integration, rack bring-up, and cluster-scale deployment readiness.
Create and maintain core PM artifacts: schedules, dependency maps, resource forecasts, risk/issue logs, and program dashboards/status reports.
Identify and drive mitigation plans for issues/risks, including cross-team escalations and corrective actions across multiple engineering areas.
Own program execution for rack- and cluster-network enablement, including topology decisions, switching/optics/cabling readiness, and validation schedules for scale-out operation.
Drive alignment on advanced AI networking requirements such as network architecture, and reliability impacts that require mitigation.
Partner with internal/external stakeholders to track and close network blockers.
Lead cross-functional delivery for rack solutions that integrate CPU + GPU+ NICs, ensuring end-to-end readiness across hardware, firmware, and management interfaces.
Drive requirements capture and execution planning for rack-scale deployments (rack density, rack form factor, power targets, whips, liquid cooling etc.) and ensure integration plans are validated with engineering and operations.
Own program coordination for pod/rack manageability solutions, aligning requirements and milestones for inventory, health monitoring, cluster provisioning, and observability across large-scale deployments.
Coordinate with platform/automation teams on cluster provisioning and orchestration.
Drive readiness for rack-level automation and regression workflows (scripts, log mapping, infrastructure automation planning), planning execution to de-risk hardware arrival timing.
Partner with CI/CD and FW automation stakeholders to align on deliverables, and validation gates.