Director, IT Event and Problem Management

The Hartford•Hartford, CT

1d•Hybrid

About The Position

The Director of IT Major Incident Management (MIM) and Problem Management leveraging Agentic AI leads a modern, autonomous service operations team that transforms incident response from reactive, manual effort into proactive, intelligent resolution. This role uses AI agents to automate complex ITIL workflows, including detection, diagnosis, communication, and remediation—to reduce mean time to resolution (MTTR) and improve service availability.

Requirements

Crisis leadership and composure
Executive presence and communication
Conflict resolution under pressure
Highly organized, structured, process‑driven mindset
Strong ITSM tool proficiency (e.g., ServiceNow)
Deep ITIL expertise (Incident, Event, Problem Management)
Expertise in monitoring, observability, and synthetic tools
AIOps‑based alert correlation and automation
Scripting/automation supporting incident response
Extensive cloud and infrastructure operations experience
Strong understanding of distributed systems and system design
Advanced troubleshooting and RCA
Data‑driven operational analysis
Strong technical documentation and communication
Advanced Troubleshooting & RCA
System Design & Architecture
Cloud Proficiency
Data Analysis
Infrastructure & Production Operations
Bachelor’s degree in Computer Science, Engineering, or related field
10+ years in IT Operations / Incident / Problem Management
Leadership experience in large‑scale, 24x7 production environments
Candidates must be authorized to work in the US without company sponsorship.

Responsibilities

Autonomous Incident Response: Utilizing AI agents to analyze incidents, identify root causes, and suggest or execute remediation steps, shifting the team from manual troubleshooting to managing AI-driven resolution workflows.
Proactive Problem Management: Implementing Agentic AI to analyze incident data for trends, identify recurring issues before they cause major outages, and automate the creation of problem records.
Autonomous Communications & Reporting: Deploying AI agents to draft incident updates, notify stakeholders, and document post-incident reviews, ensuring speed and consistency in communication.
AI Governance & Trust: Establishing trust layers and human-in-the-loop controls for AI actions, balancing speed with governance, security, and risk.
Operational Excellence & Strategy: Defining AI-driven metrics (e.g., automated resolution rate) and aligning AI platform strategy with business outcomes to enhance IT resilience
Leadership: Lead high‑severity incidents with calm, decisive crisis leadership.
Process Ownership: Own Event, Incident, and Problem Management frameworks aligned to ITIL.
Compliance: Enforce structured execution, roles, and accountability across operations.
Collaboration: Coordinate across applications, infrastructure, cloud, security, and vendors.