Sr. Program Manager, Incident Management

Zapier•San Francisco, CA

1d•Remote

About The Position

As Zapier expands into the enterprise market, operational rigor matters more than ever. The Sr. Program Manager will own the end-to-end incident management program for Zapier's Product and Engineering organization: response, post-incident learning and actions, and everything in between. You'll report to the Director of Engineering for Internal Platforms & Infrastructure and be the DRI for the program's design, execution, and outcomes. You build the program and leverage AI to scale its impact. We need someone with deep incident management expertise who's comfortable navigating ambiguity and stretching across engineering, support, security, and GTM. You have a thesis on where AI-enabled incident management is going and you'll lead us there. Zapier's product surface is expanding rapidly and with it, the complexity and stakes of incident management. This role grows with that complexity. About You You have deep incident management experience and you've moved beyond just executing it. You've built and led incident response programs, post-incident processes, SRE practices, or reliability-focused work. You know incident management deeply enough to rethink it, not just replicate it. You've ideally done 0-to-1 work in this space: stood up programs, defined standards, trained responders. You re-engineer how work happens based on where AI is headed. You've created repeatable systems (workflows, agents, copilots, or automation) that fundamentally changed how work gets done. You use AI-native tools (Cursor, Claude Code, or similar) as your default, and orchestrate them into durable capabilities that compound over time. You have a forward-looking thesis on how AI will reshape your domain and you've already acted on it: stopping legacy work, redesigning processes around what AI makes possible, and redefining what the role itself looks like. You can quantify the impact on velocity, quality, or organizational capacity. You iterate, refine, and critically evaluate AI outputs, embedding quality standards and accountability into the systems you build, not just the outputs. You're a builder, not a specialist. You have deep expertise in incident management, but you're not rigidly attached to how you've done it before. You can stretch into adjacent areas (reliability strategy, enterprise readiness, operational tooling) as the role evolves. A year from now, parts of this role may look very different, and you'll be the one driving that change. You build durable systems that work without you: processes that continue when you're on PTO or move to other work. You're energized by creating, not just maintaining. You bring an upstream, systems mindset. You instinctively look for root causes and design solutions that scale beyond your immediate program. You understand how the full incident lifecycle (prevention, detection, response, learning) supports customer trust and enterprise readiness. You influence without authority. You shape outcomes by building trust. You know how to build coalitions across engineering, support, security, GTM, and leadership. You lead change and not just implement it, you make it stick. You anticipate resistance, adapt your approach, and help others adopt new ways of working. You have technical empathy. You can go toe-to-toe with engineers, support leads, and product leaders to clarify the "why" behind technical tradeoffs and incident decisions. You understand the role of observability (logs, metrics, traces), SLOs, and thresholds in incident response and prevention even if you're not the one implementing them. You bias for velocity and clarity. You act decisively even in high ambiguity. When priorities collide, you clarify, decide, and help the org move forward. You communicate with relentless clarity: context and intent early, often, and candidly especially when it's uncomfortable. You're analytical and hands-on with data. You can work directly with data tools (e.g., Databricks, SQL) to build rich reporting and meaningful insights. You understand incident tooling (incident.io or similar) and how it integrates with Slack, PagerDuty, and on-call workflows. You work well remotely. Zapier is 100% remote. You communicate proactively, write clearly, and know when async works and when to jump on a call.

Requirements

You have deep incident management experience and you've moved beyond just executing it.
You've built and led incident response programs, post-incident processes, SRE practices, or reliability-focused work.
You know incident management deeply enough to rethink it, not just replicate it.
You've ideally done 0-to-1 work in this space: stood up programs, defined standards, trained responders.
You re-engineer how work happens based on where AI is headed.
You've created repeatable systems (workflows, agents, copilots, or automation) that fundamentally changed how work gets done.
You use AI-native tools (Cursor, Claude Code, or similar) as your default, and orchestrate them into durable capabilities that compound over time.
You have a forward-looking thesis on how AI will reshape your domain and you've already acted on it: stopping legacy work, redesigning processes around what AI makes possible, and redefining what the role itself looks like.
You can quantify the impact on velocity, quality, or organizational capacity.
You iterate, refine, and critically evaluate AI outputs, embedding quality standards and accountability into the systems you build, not just the outputs.
You're a builder, not a specialist.
You have deep expertise in incident management, but you're not rigidly attached to how you've done it before.
You can stretch into adjacent areas (reliability strategy, enterprise readiness, operational tooling) as the role evolves.
You build durable systems that work without you: processes that continue when you're on PTO or move to other work.
You're energized by creating, not just maintaining.
You bring an upstream, systems mindset.
You instinctively look for root causes and design solutions that scale beyond your immediate program.
You understand how the full incident lifecycle (prevention, detection, response, learning) supports customer trust and enterprise readiness.
You influence without authority.
You shape outcomes by building trust.
You know how to build coalitions across engineering, support, security, GTM, and leadership.
You lead change and not just implement it, you make it stick.
You anticipate resistance, adapt your approach, and help others adopt new ways of working.
You have technical empathy.
You can go toe-to-toe with engineers, support leads, and product leaders to clarify the "why" behind technical tradeoffs and incident decisions.
You understand the role of observability (logs, metrics, traces), SLOs, and thresholds in incident response and prevention even if you're not the one implementing them.
You bias for velocity and clarity.
You act decisively even in high ambiguity.
When priorities collide, you clarify, decide, and help the org move forward.
You communicate with relentless clarity: context and intent early, often, and candidly especially when it's uncomfortable.
You're analytical and hands-on with data.
You can work directly with data tools (e.g., Databricks, SQL) to build rich reporting and meaningful insights.
You understand incident tooling (incident.io or similar) and how it integrates with Slack, PagerDuty, and on-call workflows.
You work well remotely.
Zapier is 100% remote.
You communicate proactively, write clearly, and know when async works and when to jump on a call.