Escalation Manager

OneStream Software

9d•$104,000 - $130,000•Remote

About The Position

The Escalation Manager is responsible for overseeing the resolution of critical customer-impacting issues across the OneStream Cloud platform. This role serves as the operational incident leader for high-severity events, ensuring incidents are managed with urgency, clear ownership, structured coordination, and transparent communication. The Escalation Manager acts as the central coordination point during major incidents, partnering closely with Cloud Operations, Support, Platform Engineering, Cloud Engineering & Development, and Customer Success to drive rapid resolution and maintain customer confidence during complex situations. In addition to incident coordination, this role drives continuous improvement of escalation and incident management processes. The Escalation Manager helps mature operational practices by improving incident response frameworks, strengthening root cause analysis discipline, and identifying systemic reliability improvements that reduce recurring incidents. The ideal candidate brings strong technical operations knowledge, excellent communication skills, and experience leading incident response in high-availability SaaS or cloud environments. A passion for operational excellence, customer experience, and data-driven improvement is essential for success.

Requirements

Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related technical field, or equivalent professional experience.
5+ years of experience in cloud operations, incident management, site reliability engineering, or technical escalation management.
Proven experience coordinating and managing high-severity incidents in production cloud or SaaS environments.
Strong understanding of cloud infrastructure, distributed systems, networking fundamentals, and enterprise SaaS operations.
Experience coordinating cross-functional technical teams during complex production incidents.
Demonstrated experience operating incident management platforms used to coordinate major incident response (e.g., PagerDuty, Opsgenie, ServiceNow, or similar).
Experience using observability and monitoring tools to support incident diagnosis and response.
Demonstrated ability to communicate effectively with both technical teams and executive stakeholders during high-impact situations.
Strong analytical and problem-solving skills with the ability to drive root cause analysis and systemic resolution of operational issues.

Nice To Haves

Experience working in enterprise SaaS, cloud-hosted application environments, or managed service providers (MSP/CSP).
Experience operating within Microsoft Azure environments.
Familiarity with incident management and problem management frameworks such as ITIL or SRE practices.
Experience working with observability platforms such as Datadog, New Relic, Prometheus, Grafana, or similar monitoring ecosystems.
Experience contributing to reliability engineering initiatives focused on improving service availability and operational maturity.
Relevant certifications such as Azure Fundamentals, Azure Administrator, ITIL Foundation, or reliability engineering certifications.

Responsibilities

Lead the operational management of high-severity incidents and customer escalations across the OneStream Cloud platform.
Serve as the central coordination point during critical incidents, ensuring appropriate teams are engaged and resolution efforts remain focused and efficient.
Act as the incident manager during major incidents, maintaining situational awareness, coordinating response activities, and ensuring accountability for resolution actions.
Facilitate incident response calls, coordinate technical teams, and maintain executive-level communication during major incidents.
Clearly identify and assign resolution ownership to reduce ambiguity during incidents.
Ensure customers receive timely updates, clear communication, and strong ownership throughout the escalation lifecycle.
Own the operational incident lifecycle including incident declaration, coordination, escalation, communication, and post-incident review.
Drive root cause analysis (RCA) processes and ensure corrective and preventative actions are implemented and tracked to completion.
Track and manage escalated issues to resolution while identifying patterns, systemic risks, and recurring operational gaps.
Develop and improve incident management frameworks, escalation procedures, severity definitions, and operational runbooks.
Partner with cross-functional teams to reduce recurring incidents through automation, resiliency improvements, and architectural enhancements.
Monitor escalation metrics and operational KPIs including MTTR, incident frequency, and customer impact.
Lead post-incident reviews and drive accountability for operational improvements.
Own and drive measurable incident outcomes, including reduction in MTTR and reduction of recurring incidents.
Collaborate with Customer Support, Cloud Operations, and Engineering teams to improve the customer experience during major incidents.
Maintain and evolve documentation for incident response procedures, escalation workflows, and communication templates.
Identify opportunities to improve monitoring, operational tooling, and incident coordination practices.
Contribute to reliability and operational maturity initiatives aligned with Site Reliability Engineering (SRE) practices.
Provide guidance and mentorship to engineers and support personnel on escalation and incident management best practices.