About The Position

As a Senior/Staff Engineer on the Foundation Model Compute Infrastructure team, you will lead the design and development of scheduling and orchestration systems for large-scale TPU workloads across multi-region clusters. You will work on distributed systems that manage thousands of accelerators and enable reliable, efficient execution of large-scale training and inference jobs. This role spans scheduling algorithms, cluster lifecycle management, workload orchestration, reliability engineering, and performance optimization.

Requirements

  • 7+ years of industry experience building large-scale distributed systems or cloud infrastructure
  • Strong programming skills in Python, Go, C++, or similar systems languages
  • Extensive experience with compute infrastructure and workload scheduling
  • Strong expertise in distributed systems, scalability, reliability, and performance engineering
  • Experience with Kubernetes, container orchestration, or large-scale cluster management systems
  • Experience designing backend services or infrastructure platforms operating at production scale
  • Strong communication and collaboration skills across engineering and research teams
  • Bachelor’s degree in Computer Science, Engineering, or related field

Nice To Haves

  • Experience building schedulers, resource managers, or orchestration systems for distributed workloads
  • Experience with accelerator infrastructure such as TPU, GPU
  • Experience with distributed ML training or inference systems
  • Familiarity with frameworks such as JAX, PyTorch, TensorFlow, Ray, Pathways
  • Experience operating large-scale multi-tenant infrastructure in cloud or hybrid environments
  • Background in performance optimization, fault tolerance, or resource efficiency for large distributed systems
  • MS or PhD in Computer Science, Engineering, or related field

Responsibilities

  • Lead the design and development of scheduling and orchestration systems for large-scale TPU workloads across multi-region clusters.
  • Work on distributed systems that manage thousands of accelerators and enable reliable, efficient execution of large-scale training and inference jobs.
  • Focus on scheduling algorithms, cluster lifecycle management, workload orchestration, reliability engineering, and performance optimization.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service