About The Position

We are seeking an experienced AI HW Systems Engineering and Debug Lead to drive system-level debug and bring-up activities for Graphcore’s next-generation AI data center platforms. The successful candidate will lead complex debug efforts across hardware, firmware, and software layers for blade and rack-level systems. This role focuses on developing scalable debug strategies, improving debug throughput, and ensuring timely resolution of system-level issues throughout the product lifecycle.

Requirements

  • Bachelor’s or Master’s degree in Electrical Engineering, Computer Engineering, or related discipline.
  • 15+ years of experience working on complex systems engineering challenges involving HW/FW/SW debug in server or data center environments.
  • Proven experience leading validation and debug for board, blade, and rack-level hardware platforms.
  • Strong experience debugging OS, firmware, silicon, and hardware issues.
  • Understanding of industry-standard system buses such as PCIe and CXL and their software stacks.
  • Strong knowledge of ARM or x86 CPU architectures, SoC design, memory systems, and power management.
  • Experience with system architecture, validation strategies, and complex system debug methodologies.
  • Strong collaboration, communication, and cross-team coordination skills.

Nice To Haves

  • Experience designing or deploying AI/ML rack-scale systems.
  • Experience developing at-scale debug methodologies for hyperscale data center systems.
  • Familiarity with data center infrastructure and emerging AI hardware technologies.
  • Experience with rack integration testing and hyperscale deployment readiness.
  • Knowledge of automated validation frameworks, test analytics, and continuous validation practices.

Responsibilities

  • Own and develop AI systems debug methodology and system bring-up strategies for next-generation AI data center platforms.
  • Lead system-level debug and root cause analysis for issues identified during server rack validation, post-silicon validation, and production phases.
  • Drive complex debug efforts across silicon, hardware platforms, firmware, operating systems, and software stacks.
  • Manage and track technical issues, risks, and priorities to ensure program milestones are achieved.
  • Publish debug program indicators and metrics to identify roadblocks and improve debug throughput.
  • Coordinate cross-functional teams including system architecture, silicon, firmware, and validation teams to resolve system-level issues.
  • Lead development and integration of debug tools, scripts, and methodologies to improve debug efficiency.
  • Communicate program status, risks, and technical findings to engineering leadership and stakeholders.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service