Monitoring & Incident Management Manager

i4DM•Millersville, MD

About The Position

We are seeking an experienced and highly motivated Monitoring & Incident Management Manager to lead enterprise monitoring operations, incident detection, response coordination, and operational situational awareness supporting a mission-critical platform within the Department of Veterans Affairs (VA) environment. In this role, you will serve as the Contractor’s lead responsible for ensuring monitoring and incident management processes effectively support system reliability, operational continuity, and rapid restoration of services across a large-scale, 24x7 enterprise healthcare platform. You will work closely with the Program Manager, Technical Directors, DevSecOps & SRE teams, and VA stakeholders to ensure incidents are proactively identified, escalated, communicated, and resolved in alignment with strict service-level expectations and operational standards.

Requirements

Bachelor’s degree in Information Technology, Computer Science, Engineering, Cybersecurity, or a related field.
5+ years of experience supporting enterprise monitoring, incident management, or operational environments for mission-critical systems.
Strong expertise in ITIL-based incident management processes, escalation procedures, and service restoration practices.
Experience with modern observability and monitoring tools (e.g., logging, metrics, tracing platforms).
Experience supporting cloud-based or hybrid environments and enterprise-scale application platforms.
Strong communication and coordination skills, with the ability to manage high-pressure operational events across technical and business stakeholders.
Ability to operate in 24x7, SLA-driven environments with strict performance and response requirement.
Candidates must be eligible to obtain and maintain a Public Trust clearance.

Nice To Haves

Experience supporting VA or Federal Government environments, including familiarity with incident management frameworks and operational procedures.
Experience with AIOps concepts and automation tools to enhance monitoring and incident detection.
Familiarity with platforms such as AWS, Kubernetes, and enterprise monitoring tools (e.g., Splunk, Dynatrace, or similar).
Exposure to SAFe Agile, DevSecOps, and Site Reliability Engineering (SRE) practices.
ITIL, SAFe, or related certifications.

Responsibilities

Lead all monitoring operations supporting enterprise platform services and hosted healthcare applications.
Oversee system health, performance, availability, and reliability across cloud-based and platform environments.
Ensure proactive detection of issues through effective monitoring, alerting, and observability practices (not relying on user-reported incidents).
Drive improvements in monitoring coverage, alert accuracy, and operational visibility across all platform services.
Lead incident management processes, ensuring timely identification, triage, escalation, tracking, and resolution of incidents impacting mission-critical services.
Coordinate and support major incident response activities, including outage management, stakeholder communication, and service restoration.
Ensure incidents are managed in accordance with defined severity levels, response timelines, and escalation procedures.
Oversee root cause analysis, post-incident reviews, and implementation of corrective and preventive actions.
Serve as the primary coordination lead during operational events, ensuring alignment across VA stakeholders, technical leadership, and delivery teams.
Communicate incident status, service impacts, and recovery progress clearly and consistently to stakeholders.
Coordinate rapid response actions during critical incidents to minimize disruption to healthcare services.
Maintain strong collaboration across Program Management, SRE, DevSecOps, and engineering teams.
Partner with DevSecOps, SRE, and engineering teams to enhance observability capabilities, including monitoring, logging, and alerting solutions.
Identify recurring issues, operational trends, and system weaknesses, driving continuous service improvement initiatives.
Support adoption of modern monitoring practices, including automation, event correlation, and AIOps capabilities where applicable.
Improve mean time to detect (MTTD) and mean time to resolve (MTTR) across platform services.
Maintain operational reporting, including incident metrics, system performance trends, and SLA adherence.
Provide regular updates and dashboards to VA stakeholders on operational health and incident trends.
Ensure readiness of incident response procedures, escalation paths, and communication protocols.
Support operational processes aligned with Agile and SAFe delivery environments.