Senior Manager, Site Reliability Engineer

Eltropy Inc.Santa Clara, CA
3d$200,000 - $220,000Remote

About The Position

We are seeking a Senior Manger of Site Reliability Engineering to lead and scale our SRE function, ensuring the reliability, availability, performance, and efficiency of our critical systems. This role blends deep technical expertise with strategic leadership, partnering closely with Engineering, Product, Security, and Infrastructure teams to build resilient, scalable platforms that support business growth. As a Senior Manager of SRE, you will define reliability standards, establish operational excellence, and foster a culture of automation, observability, and continuous improvement.

Requirements

  • 8+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering
  • 3+ years in engineering leadership roles
  • Strong background in distributed systems, cloud platforms (AWS, GCP, Azure), and container orchestration (Kubernetes)
  • Hands-on experience with CI/CD, Infrastructure as Code (e.g., Terraform, CloudFormation), and automation
  • Proven experience defining and operating SLOs, SLIs, and error budgets
  • Excellent incident management and root cause analysis skills
  • Strong communication skills with the ability to influence technical and non-technical stakeholders

Nice To Haves

  • Experience supporting large-scale, high-traffic, or mission-critical systems
  • Background in software engineering or systems engineering
  • Experience scaling SRE practices in a fast-growing organization
  • Familiarity with security, compliance, and regulatory requirements
  • Bachelor’s or Master’s degree in Computer Science or a related field (or equivalent experience)

Responsibilities

  • Define and execute the SRE vision, strategy, and roadmap aligned with business objectives
  • Build, mentor, and lead a high-performing team of SRE managers and engineers
  • Establish best practices for reliability, incident management, change management, and capacity planning
  • Serve as a senior technical leader and trusted advisor across the organization
  • Own system reliability metrics, including SLIs, SLOs, and error budgets
  • Lead major incident response, post-incident reviews, and long-term remediation efforts
  • Drive improvements in uptime, latency, scalability, and fault tolerance across
  • Influence system architecture to improve resilience, scalability, and operability
  • Champion automation, Infrastructure as Code, and self-service platforms
  • Oversee observability strategy (monitoring, logging, tracing, alerting)
  • Ensure systems are designed for high availability, disaster recovery, and business continuity
  • Partner with Product, Platform, Security, and Compliance teams to meet operational and regulatory requirements
  • Define operational standards, runbooks, and on-call practices
  • Communicate reliability risks, tradeoffs, and performance to executive leadership
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service