About The Position

We are seeking a senior system software engineer to work on next-generation Data Center GPU diagnostics for rack-scale AI supercomputer systems. Our charter is to build applications and compute workloads that test and heavily stress GPU compute engines, HBM memory, cache hierarchy, PCIe/ NVLink interfaces, power delivery, and ther m a l behavior, and to use those applications in silicon/system bring-up along with packaging such tools for man u f a c t u r in g and customer use. The best candidates will have strong experience writing low-level diagnostic, perfor man ce, or stress software for complex hardware systems, ideally including experience with GPUs, CUDA kernels, GEMM-style workloads, NCCL communication patterns, CPUs, N ICs or high-speed interconnects such as PCIe. Excellent interpersonal skills are as this role will involve mentoring other engineers and collaborating with hardware architecture, silicon validation, man u f a c t u r in g and field teams. In addition, the engineer will extensively use their knowledge of operating systems, computer architecture, GPU memory, voltage/frequency behavior, thermal limits, high-speed buses, and modern AI development and analysis tools to efficiently validate and test next-generation processors and systems. Join an exciting, rewarding and fast paced environment!

Requirements

  • BS or MS degree in Electrical Engineering, Computer Engineering, Computer Science, or equivalent experience.
  • 12 + years of system software, GPU software, embedded software, or hardware validation experience.
  • Experience driving technical work across multiple engineers, mentoring other s, or leading development of a complex software component .
  • Experience writing diagnostics and stress tests that interface to low-level hardware drivers and hardware registers.
  • Strong C/C++ and Python programming skills.
  • Experience with Linux device drivers, CUDA kernels, GPU compute workloads, or related accelerator programming is strongly preferred.
  • Understanding of memory systems, ECC behavior, cache hierarchy, bandwidth bottlenecks, and hardware failure signatures.
  • Understanding of GEMM-style workloads and how workload shape, precision, runtime, and verification affect compute stress, power, memory, and ther m a l behavior.
  • Experience with voltage/frequency characterization, ther m a l testing, power stress, or related silicon validation concepts such as Vmin /Fmax and P-state testing .
  • Background with PCIe, NVLink , or networking technologies such as InfiniBand and E ther net.

Responsibilities

  • Working closely with hardware architecture, driver, manufacturing and field teams through product development lifecycle of rack-scale AI systems.
  • Responsible for crafting CUDA/C++ diagnostic workloads and software infrastructure required for new chip development, validation, prod uc t ization, and field triage.
  • Designing and implementing GPU compute tests that stress Tensor Cores, SMs, L2/cache hierarchy, HBM memory, and related power/ther m a l operating points.
  • Developing and tuning GEMM-style diagnostic workloads, including tests combined with additional load in N V Link , PCIe or CPU subsystems.
  • Developing and integrating higher-level AI workload tests, including PyTorch -based large model workloads to stress GPUs, memory, interconnects, thermal s , and system software under realistic rack-scale AI use cases.
  • Assessing new hardware features and architecting man u f a c t u r in g and field diagnostic tests using pre-beta GPU drivers, low-level diagnostic software, and system telemetry .
  • Debugging failures involving ECC, HBM behavior, ther m a l limits, voltage/frequency margining and PCIe/ NVLink errors .

Benefits

  • equity
  • benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service