About The Position

We are seeking a seasoned and strategic Sr. Manager, Site Reliability Engineering (SRE) to lead a high-performing team responsible for the reliability, scalability, and performance of our critical systems and services. This role blends deep technical expertise with strong leadership and operational excellence, driving a culture of resilience, automation, and continuous improvement.

Requirements

  • 8+ years of experience in software engineering, infrastructure, or SRE roles, with 3+ years in a leadership capacity.
  • Proven experience managing distributed systems at scale in cloud-native environments (AWS, GCP, Azure).
  • Strong understanding of observability tools (e.g., Prometheus, Grafana, Datadog), CI/CD pipelines, and infrastructure-as-code.
  • Excellent communication and stakeholder management skills.
  • Experience with agile methodologies and DevOps practices.
  • Experience with Python, Powershell, and other similar languages
  • Active use of artificial intelligence (AI) tools and techniques to enhance performance, drive innovation, and improve decision-making across business functions.
  • Ability to leverage AI tools and platforms to streamline workflows, improve decision-making, and drive innovation.
  • Curiosity and adaptability in exploring emerging AI technologies, with a mindset for continuous learning and experimentation.

Nice To Haves

  • Experience with Kubernetes, service meshes, and microservices architecture.
  • Familiarity with chaos engineering and resilience testing.
  • Background in performance engineering or capacity planning.

Responsibilities

  • Lead and grow a team of SREs, fostering a culture of ownership, innovation, and accountability.
  • Define and drive the SRE roadmap in alignment with business goals and engineering priorities.
  • Partner with engineering, product, and infrastructure teams to ensure reliability is built into every layer of the stack.
  • Own the availability, latency, performance, and capacity of services across production environments.
  • Implement and evolve SRE best practices including SLIs/SLOs, error budgets, incident response, and postmortems.
  • Drive automation of operational tasks and improve system observability.
  • Lead major incident response efforts, ensuring timely resolution and clear communication.
  • Establish and refine incident management processes, including root cause analysis and follow-up actions.
  • Monitor and report on system health, reliability metrics, and operational KPIs.
  • Champion continuous improvement through blameless postmortems and reliability reviews.
  • Ensure compliance with security, privacy, and regulatory standards.

Benefits

  • Competitive total rewards (base salary + bonus, if applicable)
  • Customizable benefits package (3 medical plans with Health Saving Account company match)
  • We offer generous paid time off for our non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays
  • Paid parental leave (including maternity + paternity leave)
  • Education assistance opportunities and free LinkedIn Learning access
  • Free mental health and family planning programs, including adoption assistance and fertility support
  • 401(K) program with company match
  • Pet insurance
  • Employee resource groups

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Professional, Scientific, and Technical Services

Education Level

No Education Listed

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service