About The Position

The Lead Site Reliability Engineer is a senior technical leader responsible for elevating the reliability, availability, and operational maturity of our SaaS platform. This role sets engineering standards for the entire SRE organization, drives platform-wide initiatives, and leads the execution of cross-team, high-impact reliability projects. You will partner closely with SRE managers, engineering teams, platform owners, and incident/problem management to shape how reliability is built, measured, and continuously improved across all services.

Requirements

  • Bachelor’s degree in Computer Science, Information Systems, or equivalent experience.
  • 6+ years in SRE, platform engineering, or cloud reliability roles.
  • Expert-level proficiency in public cloud ecosystems (AWS, GCP, Azure).
  • Advanced programming/scripting experience (Python, Go, Java, or similar).
  • Deep experience with monitoring, automation, CI/CD, and observability tools.
  • Proven success leading complex cross-functional engineering initiatives.
  • Outstanding communication skills for both technical and executive-level audiences.

Nice To Haves

  • Experience defining SRE organizational standards or building an SRE practice.
  • Hands-on experience with Kubernetes, microservices, Terraform, or Ansible.
  • Strong background in distributed systems and fault-tolerant architectures.

Responsibilities

  • Define, maintain, and evangelize SRE standards, frameworks, and best practices across the entire SRE organization.
  • Establish consistent patterns for SLI/SLO design, observability instrumentation, incident response, readiness reviews, and postmortem quality.
  • Partner with architecture and engineering leadership to ensure reliability is embedded in solution design.
  • Lead multi-team efforts to reduce toil, improve quality, and increase service resilience.
  • Lead large-scale, strategic SRE initiatives requiring alignment across multiple SRE and engineering teams.
  • Serve as the technical owner for cross-functional reliability projects—including scope, timelines, and technical decisions.
  • Provide deep technical guidance on cloud architecture, distributed systems reliability, and automation patterns.
  • Create advanced observability dashboards and distributed tracing solutions to provide visibility across product lines.
  • Automate manual operational processes to eliminate toil and increase efficiency across teams.
  • Lead and mentor engineers in performance analysis, capacity planning, and reliability-focused system design.
  • Drive consistency and maturity in monitoring and alerting implementations across services.
  • Oversee and elevate blameless incident response and ensure high-quality postmortems across SRE teams.
  • Partner with Incident & Problem Management to identify systemic weaknesses and lead long-term remediation.
  • Provide highest-tier on-call leadership for critical incidents, guiding teams in improving MTTR and outage prevention.
  • Mentor senior and mid-level SREs, uplift team capability, and provide technical coaching and training.
  • Review complex engineering work and provide robust, actionable feedback.
  • Help teams develop and adopt operational playbooks, engineering processes, and shared troubleshooting libraries.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service