Sr. Manager, Site Reliability Engineer (SRE)

Planet Fitness•Hampton, NH

1d•Hybrid

About The Position

The Sr. Manager, Site Reliability Engineering (SRE) leads the strategy, execution, and continuous improvement of reliability, availability, and performance across Planet Fitness’s retail technology ecosystem. This role is responsible for ensuring that both digital platforms and in-club systems run reliably, efficiently, and at scale. This position will lead teams responsible for incident management, observability, platform reliability, and end-to-end technology support. This role operates at the intersection of software engineering and infrastructure, driving automation, reducing toil, and embedding reliability into the software development lifecycle.

Requirements

Bachelor's degree in a related field (Computer Science, Computer Engineering, Management Information Systems, etc.) or equivalent work experience
7+ years of experience leading Site Reliability Engineering, DevOps, or Production Engineering
Strong experience with cloud platforms (AWS, Azure, or GCP) and distributed systems
Deep understanding of reliability principles, including SLOs, SLIs, and error budgets
Experience with CI/CD pipelines and modern deployment strategies
Hands-on experience with observability tools
Proven incident management and root cause analysis experience in high-availability environments
Experience in retail, eCommerce, or multi-location operations
Extremely detail-oriented, efficient, and organized with an exceptional ability to establish priorities and objectives
Excellent presentation and communication skills along with the ability to communicate effectively across all levels of the organization
Able to establish and maintain effective, collaborative work relationships with diverse individuals, internally and externally
Creative, progressive, thought leadership with the ability to influence at all levels of the organization
Excellent leadership skills including the ability to build teams, motivate, guide, and mentor
Dedicated learner with a natural curiosity for consistent growth
Exhibits comfort, ease, and flexibility working in an extremely fast-paced ever-changing, deadline-driven environment
Cooperative team player with an upbeat, positive, “can-do” attitude!

Responsibilities

Define and own Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across critical systems.
Drive reliability engineering practices across the software lifecycle, embedding resilience into system design.
Lead performance engineering, capacity planning, and scalability strategies aligned to retail demand cycles.
Implement and mature incident management processes, including post-incident root cause analysis (RCA) and continuous improvement loops.
Design and implement cross-functional support and escalation process for IT platforms to align with business and technical goals; including stakeholder alignment across engineering, operations, and support teams.
Act as incident commander for high-severity issues, ensuring rapid resolution and clear stakeholder communication.
Establish production readiness and operational acceptance criteria for new platforms and services.
Champion infrastructure as code (IaC), automation, and self-healing systems to reduce manual intervention and operational toil.
Partner with platform engineering to build scalable, resilient cloud-native architectures.
Drive adoption of CI/CD pipelines, safe deployment strategies (e.g., canary, blue/green), and automated rollback mechanisms.
Implement comprehensive observability (metrics, logs, traces) to provide actionable insights into system health.
Standardize monitoring frameworks and alerting strategies aligned to business-critical services.
Enable real-time visibility into customer experience across digital and physical retail channels.
Partner with technology operations, engineering, product, security, and infrastructure teams to embed reliability into design and delivery.
Act as a key liaison during high-impact incidents, communicating clearly with technical and business stakeholders.
Align reliability initiatives with business priorities such as uptime during peak retail periods.
Partner with security and compliance teams to ensure operational controls meet enterprise standards.
Lead a high-performing SRE team with strong engineering and operational capabilities.
Mature SRE practices to proactively identify and remediate issues, reduce incidents, and streamline MTTR.
Foster a culture of accountability, learning, and continuous improvement.
Establish and govern operational readiness standards, disaster recovery testing, and resilience validation processes for critical platforms.