ITSM Incident & Problem Manager

Convera•Santa Ana, CA

20h•Hybrid

About The Position

Serve as the Incident Manager / Major Incident Manager for high-severity and business-impacting incidents by organizing incident bridges and war rooms, driving Rapid triage / Clear ownership / Timely decision-making. Ensure incidents are properly classified, prioritized, and escalated based on impact and urgency. Enforce ITIL-aligned Incident and Problem Management practices. Ensure accurate and complete documentation within ServiceNow, including Impact and affected services / Incident timelines / Root cause summaries and follow-ups. Identify recurring issues and systemic risks / Ensure RCAs are completed with actionable outcomes. Act as a process authority during incidents, ensuring teams adhere to defined ITSM standards. Own operational oversight of service availability and reliability - Monitor and manage key service health indicators, including Service availability and uptime / Incident volumes and severity trends / MTTR and MTTD / SLA and OLA adherence. Use observability data to proactively identify service degradation and emerging risks. Escalate systemic availability or reliability concerns to leadership with data-backed insights. Actively leverage observability platforms (e.g., Grafana, Datadog). Partner with engineering and SRE teams to improve Monitoring coverage / Alert quality and signal-to-noise ratio. Ensure alerting and escalation via PagerDuty aligns with service criticality. Serve as the primary communication lead during incidents - Deliver concise, executive-level updates that articulate Business impact / Current status / Mitigation steps / Next milestones. Translate complex technical details into clear business language. Maintain confidence and composure while engaging senior leaders during high-pressure events. Facilitate or support post-incident reviews - Identify trends, gaps, and opportunities for Process improvement / Tooling enhancement / Better operational readiness. Contribute to the evolution of Command Center playbooks, runbooks, and response standards.

Requirements

3–6 years of experience in: Incident Management
Major Incident / Command Center operations
Production operations or site reliability support
Proven experience managing high-severity incidents in 24×7 environments
Demonstrated ownership of service reliability and operational KPIs
Strong working knowledge of ITIL / ITSM frameworks
Deep hands-on experience with: Incident Management
Major Incident workflows
Problem Management
Experience enforcing ITSM discipline across distributed technology teams
Exceptional communication and facilitation skills
Strong analytical mindset with comfort using metrics and dashboards
Ability to operate decisively in high-pressure situations
Influences outcomes without formal authority
Comfortable interfacing with executive leadership

Nice To Haves

Experience in regulated or customer-critical environments (FinTech, Payments, SaaS)
Exposure to ITSM tools like ServiceNow, PagerDuty etc.
Exposure to monitoring tools like Datadog, Grafana, Dynatrace etc.

Responsibilities

Serve as the Incident Manager / Major Incident Manager for high-severity and business-impacting incidents by organizing incident bridges and war rooms, driving Rapid triage / Clear ownership / Timely decision-making
Ensure incidents are properly classified, prioritized, and escalated based on impact and urgency
Enforce ITIL-aligned Incident and Problem Management practices
Ensure accurate and complete documentation within ServiceNow, including Impact and affected services / Incident timelines / Root cause summaries and follow-ups
Play the role of Problem Manager to Identify recurring issues and systemic risks / Ensure RCAs are completed with actionable outcomes
Act as a process authority during incidents, ensuring teams adhere to defined ITSM standards
Own operational oversight of service availability and reliability - Monitor and manage key service health indicators, including Service availability and uptime / Incident volumes and severity trends / MTTR and MTTD / SLA and OLA adherence
Use observability data to proactively identify service degradation and emerging risks
Escalate systemic availability or reliability concerns to leadership with data-backed insights
Actively leverage observability platforms (e.g., Grafana, Datadog)
Partner with engineering and SRE teams to improve Monitoring coverage / Alert quality and signal-to-noise ratio
Ensure alerting and escalation via PagerDuty aligns with service criticality
Serve as the primary communication lead during incidents - Deliver concise, executive-level updates that articulate Business impact / Current status / Mitigation steps / Next milestones
Translate complex technical details into clear business language
Maintain confidence and composure while engaging senior leaders during high-pressure events
Facilitate or support post-incident reviews - Identify trends, gaps, and opportunities for Process improvement / Tooling enhancement / Better operational readiness
Contribute to the evolution of Command Center playbooks, runbooks, and response standards

Benefits

Market competitive salary.
Great career growth and development opportunities in a global organization.
Hybrid schedule with 2 in the office per week.
Generous insurance (health, disability, life).
Paid holidays, time-off, and leave policies for life events (maternity, paternity, adoption).
Paid volunteering opportunities (5 days per year).

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume