About The Position

We are seeking a seasoned and strategic Senior Manager, Site Reliability Engineering (SRE) to lead a high-performing team responsible for the reliability, scalability, and performance of our critical systems and services. This role blends deep technical expertise with strong leadership and operational excellence, driving a culture of resilience, automation, and continuous improvement.

Requirements

  • 8+ years of experience in software engineering, infrastructure, or SRE roles, with 3+ years in a leadership capacity.
  • Proven experience managing distributed systems at scale in cloud-native environments (AWS, GCP, Azure).
  • Strong understanding of observability tools (e.g., Prometheus, Grafana, Datadog), CI/CD pipelines, and infrastructure-as-code.
  • Excellent communication and stakeholder management skills.
  • Experience with agile methodologies and DevOps practices.
  • Experience with Python, Powershell, and other similar languages.

Nice To Haves

  • Experience with Kubernetes, service meshes, and microservices architecture.
  • Familiarity with chaos engineering and resilience testing.
  • Background in performance engineering or capacity planning.

Responsibilities

  • Lead and grow a team of SREs, fostering a culture of ownership, innovation, and accountability.
  • Define and drive the SRE roadmap in alignment with business goals and engineering priorities.
  • Partner with engineering, product, and infrastructure teams to ensure reliability is built into every layer of the stack.
  • Own the availability, latency, performance, and capacity of services across production environments.
  • Implement and evolve SRE best practices including SLIs/SLOs, error budgets, incident response, and postmortems.
  • Drive automation of operational tasks and improve system observability.
  • Lead major incident response efforts, ensuring timely resolution and clear communication.
  • Establish and refine incident management processes, including root cause analysis and follow-up actions.
  • Monitor and report on system health, reliability metrics, and operational KPIs.
  • Champion continuous improvement through blameless postmortems and reliability reviews.
  • Ensure compliance with security, privacy, and regulatory standards.

Benefits

  • Competitive total rewards (base salary + bonus, if applicable)
  • Customizable benefits package (3 medical plans with Health Saving Account company match)
  • Generous paid time off for non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays.
  • Flexible time off for exempt team members + 13 paid holidays
  • Paid parental leave (including maternity + paternity leave)
  • Education assistance opportunities and free LinkedIn Learning access
  • Free mental health and family planning programs, including adoption assistance and fertility support
  • 401(K) program with company match
  • Pet insurance
  • Employee resource groups
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service