Principal Site Reliability Engineer

GruveRedwood City, CA
7dOnsite

About The Position

This role defines organization-wide reliability strategy, architecture, and culture. Guide large-scale automation programs, lead executive-level incident reviews, and mentor senior leaders and ICs. Mentor engineers, manage high-severity incidents, and drive SLO governance. You will work with other SRE engineers to set up, maintain, and troubleshoot the stack from bare metal through the application layer.

Requirements

  • 10+ years in SRE, distributed systems, or large-scale infrastructure.
  • Deep mastery of Kubernetes, GPU compute, observability, and public cloud.
  • Proven leadership shipping mission-critical, high-availability systems.
  • Expertise with DGX/HGX, NIMs, Nemotron , GPU operators and exporters.

Nice To Haves

  • Multi-cloud/multi-region architecture leadership and cost/performance optimization.
  • Strong cross-org influence and executive communication skills.

Responsibilities

  • Own the reliability strategy across product and platform teams.
  • Architect global infrastructure spanning Kubernetes, GPU platforms, ML Ops, and observability.
  • Lead chaos/performance engineering and fleet-wide automation initiatives.
  • Partner with executives and engineering leadership; influence roadmaps and resourcing.
  • Establish SLO/error-budget governance and drive reliability best practices across org.
  • Engage with customers during key operational events.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service