Sr. Manager, Site Reliability Engineer (SRE)

Planet FitnessHampton, NH
Hybrid

About The Position

The Sr. Manager, Site Reliability Engineering (SRE) leads the strategy, execution, and continuous improvement of reliability, availability, and performance across Planet Fitness’s retail technology ecosystem. This role is responsible for ensuring that both digital platforms and in-club systems run reliably, efficiently, and at scale. This position will lead teams responsible for incident management, observability, platform reliability, and end-to-end technology support. This role operates at the intersection of software engineering and infrastructure, driving automation, reducing toil, and embedding reliability into the software development lifecycle.

Requirements

  • Bachelor's degree in a related field (Computer Science, Computer Engineering, Management Information Systems, etc.) or equivalent work experience
  • 7+ years of experience leading Site Reliability Engineering, DevOps, or Production Engineering
  • Strong experience with cloud platforms (AWS, Azure, or GCP) and distributed systems
  • Deep understanding of reliability principles, including SLOs, SLIs, and error budgets
  • Experience with CI/CD pipelines and modern deployment strategies
  • Hands-on experience with observability tools
  • Proven incident management and root cause analysis experience in high-availability environments
  • Experience in retail, eCommerce, or multi-location operations
  • Extremely detail-oriented, efficient, and organized with an exceptional ability to establish priorities and objectives
  • Excellent presentation and communication skills along with the ability to communicate effectively across all levels of the organization
  • Able to establish and maintain effective, collaborative work relationships with diverse individuals, internally and externally
  • Creative, progressive, thought leadership with the ability to influence at all levels of the organization
  • Excellent leadership skills including the ability to build teams, motivate, guide, and mentor
  • Dedicated learner with a natural curiosity for consistent growth
  • Exhibits comfort, ease, and flexibility working in an extremely fast-paced ever-changing, deadline-driven environment
  • Cooperative team player with an upbeat, positive, “can-do” attitude!

Responsibilities

  • Define and own Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across critical systems.
  • Drive reliability engineering practices across the software lifecycle, embedding resilience into system design.
  • Lead performance engineering, capacity planning, and scalability strategies aligned to retail demand cycles.
  • Implement and mature incident management processes, including post-incident root cause analysis (RCA) and continuous improvement loops.
  • Design and implement cross-functional support and escalation process for IT platforms to align with business and technical goals; including stakeholder alignment across engineering, operations, and support teams.
  • Act as incident commander for high-severity issues, ensuring rapid resolution and clear stakeholder communication.
  • Establish production readiness and operational acceptance criteria for new platforms and services.
  • Champion infrastructure as code (IaC), automation, and self-healing systems to reduce manual intervention and operational toil.
  • Partner with platform engineering to build scalable, resilient cloud-native architectures.
  • Drive adoption of CI/CD pipelines, safe deployment strategies (e.g., canary, blue/green), and automated rollback mechanisms.
  • Implement comprehensive observability (metrics, logs, traces) to provide actionable insights into system health.
  • Standardize monitoring frameworks and alerting strategies aligned to business-critical services.
  • Enable real-time visibility into customer experience across digital and physical retail channels.
  • Partner with technology operations, engineering, product, security, and infrastructure teams to embed reliability into design and delivery.
  • Act as a key liaison during high-impact incidents, communicating clearly with technical and business stakeholders.
  • Align reliability initiatives with business priorities such as uptime during peak retail periods.
  • Partner with security and compliance teams to ensure operational controls meet enterprise standards.
  • Lead a high-performing SRE team with strong engineering and operational capabilities.
  • Mature SRE practices to proactively identify and remediate issues, reduce incidents, and streamline MTTR.
  • Foster a culture of accountability, learning, and continuous improvement.
  • Establish and govern operational readiness standards, disaster recovery testing, and resilience validation processes for critical platforms.

Benefits

  • core medical, dental, vision, life and disability
  • supplemental accident, hospital and critical illness coverage options
  • generous time off program (including volunteer time)
  • childcare reimbursement
  • paid parental leave
  • pet care reimbursement
  • tuition reimbursement
  • free Black Card membership
  • learning and development programs
  • engagement activities
  • 401(k) Plan with safe harbor employer matching
  • employee stock purchase plan
  • annual corporate bonus incentive program
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service