Senior Software Development Engineer - DCGPU Diagnostics Quality Team

Advanced Micro Devices, IncMarkham, ON
Hybrid

About The Position

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. Support development and deployment of diagnostic tests that validate AMD Data Center GPU products at all test stages, from silicon screening to server rack assembly.

Requirements

  • Strong proficiency in Python and C++
  • SQL and Snowflake for data analysis and reporting
  • Linux system administration and shell scripting
  • Git version control and code review practices
  • Experience with diagnostic tools and hardware debugging methodologies
  • Knowledge of at least one GPU programming framework (ROCm/CUDA/OpenCL/Vulkan/OpenGL), with ROCm strongly preferred
  • Excellent written and verbal communication skills is an absolute
  • Ability to document technical designs, test plans, and procedures clearly
  • Proven ability to coordinate with cross-functional teams
  • BS in Computer Science, Computer Engineering, Electrical Engineering, or related field preferred
  • Equivalent experience considered

Nice To Haves

  • Proven experience with software development or test engineering experience
  • Proven experience with hardware/silicon validation or manufacturing test environments
  • Hands-on debugging and root cause analysis in low-level hardware/software systems
  • Experience with server or datacenter systems architecture
  • Understanding of silicon validation processes and test methodologies
  • Familiarity with manufacturing workflows and production test environments
  • Knowledge of server architectures (BMC, firmware, system integration)
  • Experience with GPU/accelerator performance metrics including computational throughput, memory bandwidth, power efficiency, thermal characteristics, and whole-system performance
  • Background in AMD GPU or CPU technologies is a plus

Responsibilities

  • Design and implement diagnostic tests for AMD silicon and server platforms
  • Develop test automation frameworks and infrastructure
  • Debug test failures and hardware issues across production stages
  • Optimize test coverage and execution time
  • Lead root cause analysis and debug efforts for failures on production systems, often in time-sensitive and urgent scenarios
  • Interface with silicon design, firmware, performance, systems integration, and manufacturing teams to investigate and resolve issues
  • Support manufacturing partners in test bring-up and issue resolution
  • Coordinate test deployment schedules and deliverables
  • Track and report on test coverage, quality metrics, and production readiness
  • Participate in code reviews and maintain test code quality
  • Document test specifications and deployment procedures
  • Occasional lab work and limited factory visits as needed

Benefits

  • AMD benefits at a glance
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service