As a Principal GPU Validation Engineer, you will architect, validate, and debug large‑scale AMD GPU systems spanning device, node, chassis, and rack-level deployments. You will define system-level test strategies for multi-GPU, multi-node accelerator platforms, ensuring correctness, performance, scalability, and reliability across hardware and software boundaries. This role is deeply technical and hands-on, involving GPU bring-up, firmware/driver interaction, networking validation (RDMA), and large-scale cluster enablement. You will directly influence product readiness and future AMD GPU platform designs by providing system-level feedback into architecture, silicon features, and validation infrastructure. You are a system thinker with deep technical instincts, capable of root-causing failures that span GPU silicon, PCIe/Infinity Fabric, networking, drivers, firmware, and orchestration layers. You are comfortable debugging issues that only emerge at scale—during long‑running workloads, high-throughput fabric stress, or multi-node synchronization scenarios. You bring: Proven technical leadership in complex GPU/accelerator environments The ability to translate low-level failures (timeouts, hangs, data corruption) into actionable root causes Strong collaboration skills across Architecture, Design, Firmware, Software, and Validation teams A track record of building robust, repeatable test infrastructure, not one-off debug scripts
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Principal
Number of Employees
5,001-10,000 employees