Systems Engineer - AI Infrastructure

Clockwork.ioPalo Alto, CA
16h$140,000 - $210,000

About The Position

We're building infrastructure for fault-tolerant, high-performance distributed GPU training. You'll work at the intersection of GPU systems, high-speed networking, and distributed coordination—designing and implementing systems that run at scale. This is a systems building role. You'll dig into internals, understand why things break under pressure, and design solutions that handle the messy reality of distributed systems.

Requirements

  • Systems building experience
  • You've designed and built complex systems—not just deployed or operated them. Examples: Kernel subsystems, device drivers, or OS-level components Distributed storage, databases, or coordination systems Runtimes, profilers, or performance tooling Network stacks, protocols, or high-performance I/O systems Large-scale infrastructure at the systems layer
  • Core technical skills: Strong C/C++ in systems contexts (not just application code)
  • Deep understanding of concurrency, memory models, and failure modes
  • Experience reasoning about distributed system behavior: consistency, ordering, partial failures
  • Comfortable reading and modifying large, unfamiliar codebases
  • Lead design of significant system components
  • Navigate ambiguity and define technical direction
  • Mentor engineers and raise team capabilities
  • 5+ years building systems software

Nice To Haves

  • GPU programming (CUDA) or GPU systems experience
  • High-performance networking (RDMA, InfiniBand)
  • ML framework or runtime internals
  • Cluster scheduling or orchestration systems

Responsibilities

  • Design and implement low-level systems software for GPU clusters
  • Work with internals of frameworks like PyTorch, NCCL, CUDA runtime—not as a user, but modifying and extending them
  • Build components that make large-scale GPU training more reliable and efficient
  • Debug complex distributed/concurrent systems where failures are subtle and non-deterministic
  • Own systems end-to-end: from design through production

Benefits

  • Challenging projects.
  • A friendly and inclusive workplace culture.
  • Competitive compensation.
  • A great benefits package.
  • Catered lunch.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

11-50 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service