About The Position

The Data Center Incident Program Manager is responsible for designing, operating, and continuously improving the end-to-end incident management lifecycle across mission-critical data center environments.This role owns the “before, during, and after” mechanics of incidents — establishing standards and playbooks in steady state, serving as (or designating) Incident Commander during active events, and driving structured post-incident review and corrective action to closure. The ideal candidate brings operational credibility in hyperscale or mission-critical infrastructure, demonstrates calm leadership during high-pressure events, and has a strong bias toward structured documentation, process clarity, and measurable improvement.

Requirements

  • 7+ years in mission-critical infrastructure, data center operations, or reliability engineering
  • Direct experience leading major incidents (P1/P0 equivalent)
  • Strong familiarity with facilities systems, hardware operations, or network infrastructure
  • Demonstrated experience running war rooms and executive updates
  • Experience conducting root cause analysis and corrective action tracking
  • Ability to remain calm and decisive under high-pressure conditions

Nice To Haves

  • Experience in hyperscale or high-density AI compute environments
  • Background in facilities commissioning, facility operations, hardware operations, or network reliability
  • Familiarity with ISO-based quality systems or structured operational documentation frameworks
  • Experience implementing incident tooling (PagerDuty, ServiceNow, Jira, etc.)

Responsibilities

  • Define and maintain incident severity levels (SEV definitions), classification criteria, and escalation thresholds.
  • Establish end-to-end incident response standards: protocols, lifecycle stages (declare → stabilize → mitigate → recover → close), and operating cadence.
  • Build and maintain governance artifacts: runbooks, war room formats, reporting templates, and decision/communication standards.
  • Create and operationalize notification trees, stakeholder comms templates (initial, periodic updates, recovery/closure), and executive escalation criteria.
  • Define clear RACI across Facilities, Hardware Ops, Network, Security, and vendor/partner teams, including handoffs and accountability paths.
  • Set and manage SLAs/OLAs for acknowledgment, escalation, containment, mitigation, and reporting.
  • Implement and run incident management tooling (ticketing, paging, logging) and ensure integrations with monitoring and workflow systems.
  • Establish dashboards and program health metrics to track incident performance and readiness.
  • Lead readiness activities: tabletop exercises, cross-functional simulations, IC/Deputy training, and a rotating on-call IC bench with certification standards.
  • Serve as Incident Commander as needed: declare severity, stand up the war room, assign functional leads, and drive structured execution under pressure.
  • Maintain real-time documentation (decisions, timelines, impact scope) and ensure clear restoration objectives and scope control during active events.
  • Run post-incident reviews (PIRs), validate timelines, drive structured RCA (e.g., 5 Whys, Fault Tree), and separate root cause vs contributing factors.
  • Define corrective/preventative actions (CAPAs), assign accountable owners, track to verified closure, and escalate overdue actions.
  • Publish trend reporting (incident taxonomy, counts by severity, MTTA/MTTR, repeat failure domains) and feed systemic gaps back into design and operations teams.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service