Fellow, DCGPU Systems Debug lead

Advanced Micro Devices, Inc
Onsite

About The Position

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. THE TEAM: AMD's Data Center GPU organization is transforming the industry with our AI based Graphic Processors. Our primary objective is to design exceptional products that drive the evolution of computing experiences, serving as the cornerstone for enterprise Data Centers, (AI) Artificial Intelligence, HPC and Embedded systems. If this resonates with you, come and joining our Data Center GPU organization where we are building amazing AI powered products with amazing people. THE ROLE: We are seeking an exceptional Systems Debug with strong validation expertise to join our Data Center GPU Customer Engineering team. In this pivotal role, you will lead the charge in debug for HW failures from AI customers, ensuring robust system-level integration and validation on DCGPUs particularly in domains of complex technologies by working closely with our customers. Our cutting-edge Data Center GPU solutions, encompassing APUs and GPUs, demand a proactive approach to testing and debug, aiming not just for detecting issues but also identifying and mitigating future failures. THE PERSON: As a Systems Debug lead, your mission is to orchestrate an end-to-end customer issue debug by working closely with cross functional engineering teams across the company. You will need to demonstrate strong validation skills as you would interact with system validation, debug architecture and design organizations. You need to ensure products are well tested internally, deliberately pushing them to their limits to uncover vulnerabilities. This is a hands-on technical position that requires your expertise in systems design engineering which will be crucial for comprehensive product development, innovative validation strategies, and efficient problem-solving.

Nice To Haves

  • Proficiency in programming/scripting languages (e.g., C/C++, Perl, Ruby, Python).
  • Expertise in state-of-the-art debugging techniques and methodologies.
  • Extensive experience with lab equipment such as protocol/logic analyzers and oscilloscopes.
  • Deep knowledge in board/platform-level debug, including delivery, sequencing, analysis, and optimization.
  • Comprehensive understanding of system architecture, with a focus on technical debug and validation strategy development.
  • Exceptional analytical and problem-solving skills, with meticulous attention to detail.
  • Self-driven with the ability to lead tasks independently to successful completion.

Responsibilities

  • Engage with debug architecture teams in defining right debug hooks on Hardware and Firmware proactively to ensure they are ready for Silicon
  • Hands-on debug of L2 customer issues from our Application engineers and driving them to closure by working with cross functional engineering teams.
  • Championing the process of debugging, root cause analysis, and resolution of issues discovered during the validation phases of AI and HPC systems.
  • Orchestrating the development and implementation of advanced validation strategies, specifically designed to stress and break the system, thereby identifying potential product weaknesses.
  • Providing pioneering technical validation initiatives, focusing on high-impact areas like PCIe, HBM and SMC/BMC firmware to identify vulnerabilities during system-level integration.
  • Creating and executing validation test plans that address both functional and stress scenarios, including emulation of end-customer systems.
  • Ensuring compliance with OCP standards and secure solution development, including Out of Band Management and Redfish features.
  • Collaborating with multiple teams to devise and execute exhaustive validation test plans that simulate real-world stress scenarios and customer workloads.
  • Working closely with development teams to ensure all identified issues are addressed and rectified before production.
  • Advancing end-to-end validation test content, utilizing creative debugging skills and innovative approaches.

Benefits

  • AMD benefits at a glance.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service