Sr. Manager, SRE/ITOps

Panera Bread

1d•Onsite

About The Position

Panera, LLC is seeking a talented Sr. Manager, SRE/ITOps to lead our Site Reliability Engineering / IT Operations function. This role is responsible for building and mentoring a team of engineers (direct reports), operating and continuously improving production platforms and services, and partnering with Engineering, Security, and Product to deliver reliable, scalable, and cost-effective systems. In this role, you will be a people leader who sets direction and drives execution across reliability engineering and day-to-day operations. You will establish SLOs/SLIs, lead incident response and continuous improvement, and own the Major Incident Management (MIM) process to ensure clear command, communications, and rapid service restoration for high-severity events. You will ensure internal and external services meet reliability, performance, and security expectations while upholding strong engineering and operational excellence principles.

Requirements

7+ years of experience in SRE, production operations, DevOps, or infrastructure engineering, with demonstrated ownership of highly available services.
2+ years of people management experience (or team lead experience with direct coaching responsibility), including hiring and developing engineers.
Experience operating cloud and/or hybrid environments (IaaS/PaaS, microservices), including observability, incident response, capacity planning, and reliability engineering practices.
Hands-on technical depth across systems, networking, security, and databases; ability to dive deep when needed and guide design/operational decisions.
Proficiency with automation, orchestration, and infrastructure as code (e.g., Terraform/CloudFormation, Ansible/Chef/Puppet/Salt, containers/Kubernetes).
Experience with CI/CD practices and operational governance (change management, release management, environment hygiene), balancing delivery speed with reliability.
Strong analytical, troubleshooting, and communication skills, with the ability to align stakeholders during incidents and drive cross-team execution.

Nice To Haves

Experience designing, analyzing, and operating large-scale distributed systems, including disaster recovery and business continuity planning (RTO/RPO).
Strong observability background (monitoring, logging, tracing) and APM tooling such as Dynatrace, New Relic, AppDynamics, Datadog, Splunk, or similar.
Demonstrated ability to influence without authority, lead through ambiguity, and partner effectively with Engineering, Security, and business stakeholders.
Experience establishing operational processes (incident, problem, change) and service management practices (ITIL familiarity a plus).
Budgeting, vendor management, and tool lifecycle management experience (selection, procurement partnership, renewals, and value realization).
Experience building or operating systems in a secure, regulated, or compliant environment (e.g., SOX, PCI, SOC2), including audit support and control remediation.
A passion for automation and operational excellence, and experience partnering with engineering teams in a DevOps/SRE culture.

Responsibilities

Lead, coach, and develop a team of SRE/ITOps engineers (direct reports), including hiring, onboarding, performance management, career development, and succession planning.
Own operational readiness for production services: capacity planning, change/release readiness, resiliency reviews, and launch approvals in partnership with Engineering and Product.
Define and manage service level objectives (SLOs/SLIs) and error budgets; monitor availability, latency, and overall system health; and drive improvements based on data and customer impact.
Drive automation to reduce toil and improve scalability, resiliency, and efficiency (infrastructure as code, configuration management, CI/CD enablement, and self-service operational tooling).
Lead incident management, on-call operations, and escalation processes, including ownership of the Major Incident Management (MIM) program: declare/triage severity, run major incident bridges/war rooms, drive cross-team coordination, provide timely stakeholder communications, and facilitate blameless postmortems to ensure corrective actions are prioritized, tracked, and completed.
Establish and continuously improve Major Incident Management standards and readiness (playbooks, roles/RACI, tooling, training and drills).
Track MIM KPIs (MTTA/MTTR, incident frequency/severity), and partner with Engineering and Service Management on problem management and recurring-incident elimination.
Manage operational backlogs and service improvement plans; partner with Security/Compliance to meet audit and control requirements; and manage vendors/tools as needed.