Site Reliability Engineer

AnyscalePalo Alto, CA

About The Position

Anyscale is seeking a Site Reliability Engineer to lead the technical vision for our Infrastructure team. As a Staff Engineer, you will be responsible for the architectural evolution of our control plane and data plane, ensuring that our "infinite laptop" vision scales to meet the most demanding distributed AI workloads in the world. You will act as a force multiplier, setting the standards for Kubernetes-based cloud-native infrastructure while mentoring engineers and driving cross-functional alignment across the Ray open-source community and our proprietary product teams.

Requirements

  • 5+ years of experience writing high-quality production code and leading complex distributed systems projects.
  • Proven track record of designing and maintaining highly available, scalable, and secure cloud-native platforms (AWS, Azure, or GCP).
  • Deep expertise in Kubernetes-based deployments and container orchestration at massive scale.
  • Advanced knowledge of Linux kernel, networking, and low-level operating system foundations.
  • Mastery of Go and Python, with the ability to set coding standards and best practices for the team.
  • Demonstrated ability to mentor senior engineers, influence technical direction without direct authority, and navigate complex trade-offs in a fast-paced environment.

Responsibilities

  • Define and drive the multi-year technical roadmap for services that orchestrate Ray clusters across diverse cloud and on-premises environments.
  • Lead the design and optimization of high-performance control plane components specifically tailored for large-scale, heterogeneous AI/ML workloads.
  • Establish the organization-wide standards for the reliability, scalability, and observability of Anyscale-managed infrastructure.
  • Direct the long-term strategy for accelerator integration (GPUs, TPUs) and container management to ensure seamless execution of distributed workloads.
  • Lead complex design and architecture discussions, resolving deep technical debt and ensuring engineering excellence across the organization.
  • Partner with ML experts and customer-facing teams to translate market needs into robust infrastructure foundations.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service