Senior Systems Engineer

GraphcoreMilpitas, CA

About The Position

Graphcore is a leading innovator in Artificial Intelligence compute, developing hardware, software, and systems infrastructure to drive AI breakthroughs. As part of the SoftBank Group, Graphcore is committed to enabling Artificial Super Intelligence and making its benefits accessible to everyone. The company fosters a culture of continuous learning and innovation, with diverse teams of AI research specialists, silicon designers, software engineers, and systems architects. This role focuses on providing advanced operational, diagnostic, and engineering support for Graphcore’s Arm-based hardware platforms in lab and data center environments. The Senior Systems Engineer will support hardware bring-up, validation, and troubleshooting of complex AI compute platforms, including server blades, racks, and rack-scale infrastructure. Collaboration with engineering, platform, and data center teams is key to ensuring the reliability and performance of next-generation AI systems.

Requirements

  • Bachelor’s degree in Electrical Engineering, Computer Engineering, Computer Science, or related discipline.
  • Strong experience with server hardware architectures and board-level debugging.
  • Experience analyzing system logs, hardware telemetry, and power/thermal metrics to isolate hardware failures.
  • Hands-on experience with HPC systems, AI compute platforms, or rack-scale infrastructure.
  • Strong collaboration skills and ability to work effectively in fast-paced engineering environments.
  • Excellent written and verbal communication skills.

Nice To Haves

  • Experience supporting prototype or pre-production hardware bring-up.
  • Familiarity with data center facilities, including liquid cooling and power distribution systems.
  • Experience using Python, Bash, or automation tools for hardware validation or troubleshooting.
  • Exposure to structured failure analysis and reliability engineering methodologies.

Responsibilities

  • Lead advanced break-fix troubleshooting for server blades, motherboards, power systems, and rack-scale infrastructure.
  • Support engineering bring-up activities, including component validation and firmware interaction testing.
  • Diagnose system-level failures involving thermal behavior, power anomalies, network configuration, and BIOS/BMC issues.
  • Collaborate with server engineering teams to perform root cause analysis and propose corrective actions or design improvements.
  • Support deployment and rollout of next-generation hardware platforms through structured validation and qualification cycles.
  • Interface with facilities and infrastructure teams to understand environmental factors impacting system reliability.
  • Develop and maintain standard operating procedures (SOPs), troubleshooting guides, and validation documentation.
  • Provide guidance and mentorship to junior technicians and engineers on troubleshooting methodologies and hardware diagnostics.
  • Participate in on-call rotations or off-hours support during critical engineering milestones or hardware bring-up phases.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service