The AI Customer Engineering organization is looking for a Principal AI Systems Engineer to help customers achieve best-in-class telemetry capabilities on AMD GPU platforms. This is a hands-on, customer-facing role leading full-stack debug of AI infrastructure focusing on Reliability, Availability, and Serviceability (RAS) features. The ideal candidate is a senior technologist with significant experience in server architecture, RAS, and debug with firmware and software skills. Expertise in Data Center and AI domains, with a good understanding of AI workload optimization and resolving customer design issues, is essential. The role requires experience in debugging complex full-stack SW/FW/HW issues, understanding the flow of a GPU through different system layers, and validating components connected to the GPU SOC (PCIe, VR’s, RMs, retimers, HBM, internal networking). Effective communication with various functional code stack owners and the ability to drive issues to resolution through various channels are crucial. Hands-on experience with hardware in a Data Center environment is also required.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Principal
Number of Employees
5,001-10,000 employees