Director, IT Event and Problem Management

The HartfordHartford, CT
Hybrid

About The Position

The Director of IT Major Incident Management (MIM) and Problem Management leveraging Agentic AI leads a modern, autonomous service operations team that transforms incident response from reactive, manual effort into proactive, intelligent resolution. This role uses AI agents to automate complex ITIL workflows, including detection, diagnosis, communication, and remediation—to reduce mean time to resolution (MTTR) and improve service availability.

Requirements

  • Crisis leadership and composure
  • Executive presence and communication
  • Conflict resolution under pressure
  • Highly organized, structured, process‑driven mindset
  • Strong ITSM tool proficiency (e.g., ServiceNow)
  • Deep ITIL expertise (Incident, Event, Problem Management)
  • Expertise in monitoring, observability, and synthetic tools
  • AIOps‑based alert correlation and automation
  • Scripting/automation supporting incident response
  • Extensive cloud and infrastructure operations experience
  • Strong understanding of distributed systems and system design
  • Advanced troubleshooting and RCA
  • Data‑driven operational analysis
  • Strong technical documentation and communication
  • Advanced Troubleshooting & RCA
  • System Design & Architecture
  • Cloud Proficiency
  • Data Analysis
  • Infrastructure & Production Operations
  • Bachelor’s degree in Computer Science, Engineering, or related field
  • 10+ years in IT Operations / Incident / Problem Management
  • Leadership experience in large‑scale, 24x7 production environments
  • Candidates must be authorized to work in the US without company sponsorship.

Responsibilities

  • Autonomous Incident Response: Utilizing AI agents to analyze incidents, identify root causes, and suggest or execute remediation steps, shifting the team from manual troubleshooting to managing AI-driven resolution workflows.
  • Proactive Problem Management: Implementing Agentic AI to analyze incident data for trends, identify recurring issues before they cause major outages, and automate the creation of problem records.
  • Autonomous Communications & Reporting: Deploying AI agents to draft incident updates, notify stakeholders, and document post-incident reviews, ensuring speed and consistency in communication.
  • AI Governance & Trust: Establishing trust layers and human-in-the-loop controls for AI actions, balancing speed with governance, security, and risk.
  • Operational Excellence & Strategy: Defining AI-driven metrics (e.g., automated resolution rate) and aligning AI platform strategy with business outcomes to enhance IT resilience
  • Leadership: Lead high‑severity incidents with calm, decisive crisis leadership.
  • Process Ownership: Own Event, Incident, and Problem Management frameworks aligned to ITIL.
  • Compliance: Enforce structured execution, roles, and accountability across operations.
  • Collaboration: Coordinate across applications, infrastructure, cloud, security, and vendors.

Benefits

  • short-term or annual bonuses
  • long-term incentives
  • on-the-spot recognition
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service