Platform PPA Debug Engineer

Advanced Micro Devices, IncAustin, TX

About The Position

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. About the Team The Data Center GPU Power and Performance Attainment (PPA) Team is a hardware‑focused lab organization responsible for optimizing power, performance, and performance‑per‑watt across AMD’s Data Center GPU products. The team works at the intersection of silicon, systems, firmware, and workloads, driving post‑silicon validation, power feature tuning, and product readiness for large‑scale AI and HPC deployments. The Opportunity Serve as Platform PPA debug engineer for Instinct Datacenter GPUs, driving GPU/system/board-level triage of Power, Performance, Thermal and VF issues and resolution using lab reproduction plus HW/FW/SW telemetry and logs in partnership with cross-functional teams. The Person An engineer with deep expertise in datacenter platform power/performance/thermal debug and optimization. Hands-on in the lab and effective across hardware, firmware/BIOS, and software teams, they use Linux logs and telemetry to drive issues to root-cause and closure.

Requirements

  • 8+ years of experience in silicon power, performance, and thermal characterization, debug, validation or customer engineering roles.
  • Solid understanding of semiconductors, CPU/GPU architecture, and power management features, including power, thermal, VF, and performance aspects of design and validation.
  • Experience with system/platform debug workflows and cross-functional issue triage across hardware, firmware/BIOS, and software.
  • Hands-on experience with lab debug tools (e.g., logic analyzers, oscilloscopes, power monitors) and with server platform bring-up/triage involving high-speed I/O (e.g., PCIe/CXL), power delivery, and board-level sequencing.
  • Proficiency in scripting (Python, Perl, shell) for automation, log parsing, and data analysis.
  • Familiarity with firmware and low-level software interactions with hardware (including BIOS and BMC interfaces).
  • Experience working with customer engineering and manufacturing teams.
  • Excellent communication and documentation skills, including executive reporting and leading cross-domain meetings.
  • Bachelors in Computer Engineering, Electrical Engineering, or Computer Science.

Nice To Haves

  • Experience with HPC/AI workloads and GPU performance benchmarks in datacenter environments (boards, systems, racks, clusters); familiarity with AI-assisted analysis/debug tooling is a plus.
  • MS Preferred.

Responsibilities

  • Lead GPU/system/board-level debug of power, performance, VF, and thermal issues reported by internal teams and external customers.
  • Analyze platform telemetry, Linux logs, and FW/BIOS signals to isolate failures that span hardware, firmware, and software.
  • Coordinate across architecture, design, validation, software, and customer engineering to drive root-cause and closure.
  • Develop and maintain debug methodologies and automation to accelerate root-cause analysis.
  • Resolve systemic issues that impact power, perf/watt or performance targets, and validate improvements.
  • Lead debug cadence (meetings, executive updates) to align stakeholders, communicate status/trends, and remove blockers.
  • Partner with the extended team in Malaysia to ensure global debug coverage and continuity.
  • Own customer escalations with the debug council/customer engineering, including test execution to confirm resolution and close issues.
  • Use manufacturing screens/data and failure analysis (lab, manufacturing, field returns) to identify root cause and drive corrective actions.
  • Mentor junior engineers on debug execution and best practices.
  • Document debug findings, resolutions, and learnings to improve internal reuse and next-generation test plans.

Benefits

  • Competitive compensation, benefits, and global career opportunities.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service