ITSM Incident & Problem Manager

ConveraSanta Ana, CA
Hybrid

About The Position

Serve as the Incident Manager / Major Incident Manager for high-severity and business-impacting incidents by organizing incident bridges and war rooms, driving Rapid triage / Clear ownership / Timely decision-making. Ensure incidents are properly classified, prioritized, and escalated based on impact and urgency. Enforce ITIL-aligned Incident and Problem Management practices. Ensure accurate and complete documentation within ServiceNow, including Impact and affected services / Incident timelines / Root cause summaries and follow-ups. Identify recurring issues and systemic risks / Ensure RCAs are completed with actionable outcomes. Act as a process authority during incidents, ensuring teams adhere to defined ITSM standards. Own operational oversight of service availability and reliability - Monitor and manage key service health indicators, including Service availability and uptime / Incident volumes and severity trends / MTTR and MTTD / SLA and OLA adherence. Use observability data to proactively identify service degradation and emerging risks. Escalate systemic availability or reliability concerns to leadership with data-backed insights. Actively leverage observability platforms (e.g., Grafana, Datadog). Partner with engineering and SRE teams to improve Monitoring coverage / Alert quality and signal-to-noise ratio. Ensure alerting and escalation via PagerDuty aligns with service criticality. Serve as the primary communication lead during incidents - Deliver concise, executive-level updates that articulate Business impact / Current status / Mitigation steps / Next milestones. Translate complex technical details into clear business language. Maintain confidence and composure while engaging senior leaders during high-pressure events. Facilitate or support post-incident reviews - Identify trends, gaps, and opportunities for Process improvement / Tooling enhancement / Better operational readiness. Contribute to the evolution of Command Center playbooks, runbooks, and response standards.

Requirements

  • 3–6 years of experience in: Incident Management
  • Major Incident / Command Center operations
  • Production operations or site reliability support
  • Proven experience managing high-severity incidents in 24×7 environments
  • Demonstrated ownership of service reliability and operational KPIs
  • Strong working knowledge of ITIL / ITSM frameworks
  • Deep hands-on experience with: Incident Management
  • Major Incident workflows
  • Problem Management
  • Experience enforcing ITSM discipline across distributed technology teams
  • Exceptional communication and facilitation skills
  • Strong analytical mindset with comfort using metrics and dashboards
  • Ability to operate decisively in high-pressure situations
  • Influences outcomes without formal authority
  • Comfortable interfacing with executive leadership

Nice To Haves

  • Experience in regulated or customer-critical environments (FinTech, Payments, SaaS)
  • Exposure to ITSM tools like ServiceNow, PagerDuty etc.
  • Exposure to monitoring tools like Datadog, Grafana, Dynatrace etc.

Responsibilities

  • Serve as the Incident Manager / Major Incident Manager for high-severity and business-impacting incidents by organizing incident bridges and war rooms, driving Rapid triage / Clear ownership / Timely decision-making
  • Ensure incidents are properly classified, prioritized, and escalated based on impact and urgency
  • Enforce ITIL-aligned Incident and Problem Management practices
  • Ensure accurate and complete documentation within ServiceNow, including Impact and affected services / Incident timelines / Root cause summaries and follow-ups
  • Play the role of Problem Manager to Identify recurring issues and systemic risks / Ensure RCAs are completed with actionable outcomes
  • Act as a process authority during incidents, ensuring teams adhere to defined ITSM standards
  • Own operational oversight of service availability and reliability - Monitor and manage key service health indicators, including Service availability and uptime / Incident volumes and severity trends / MTTR and MTTD / SLA and OLA adherence
  • Use observability data to proactively identify service degradation and emerging risks
  • Escalate systemic availability or reliability concerns to leadership with data-backed insights
  • Actively leverage observability platforms (e.g., Grafana, Datadog)
  • Partner with engineering and SRE teams to improve Monitoring coverage / Alert quality and signal-to-noise ratio
  • Ensure alerting and escalation via PagerDuty aligns with service criticality
  • Serve as the primary communication lead during incidents - Deliver concise, executive-level updates that articulate Business impact / Current status / Mitigation steps / Next milestones
  • Translate complex technical details into clear business language
  • Maintain confidence and composure while engaging senior leaders during high-pressure events
  • Facilitate or support post-incident reviews - Identify trends, gaps, and opportunities for Process improvement / Tooling enhancement / Better operational readiness
  • Contribute to the evolution of Command Center playbooks, runbooks, and response standards

Benefits

  • Market competitive salary.
  • Great career growth and development opportunities in a global organization.
  • Hybrid schedule with 2 in the office per week.
  • Generous insurance (health, disability, life).
  • Paid holidays, time-off, and leave policies for life events (maternity, paternity, adoption).
  • Paid volunteering opportunities (5 days per year).
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service