Monitoring & Incident Management Manager (Remote)

Oxley Enterprises®, Inc.Stafford, VA
$70,612 - $102,584Remote

About The Position

Protect the operational continuity of a platform that Veterans depend on every day. As the Monitoring & Incident Management Manager, you will lead 24/7 monitoring operations, incident response governance, and observability strategy for a mission-critical cloud environment supporting the Department of Veterans Affairs (VA). The Monitoring & Incident Management Manager serves as the lead for all platform and application monitoring, incident detection, response coordination, and operational situational awareness while ensuring production issues are detected proactively, escalated, and resolved.

Requirements

  • 5 years of experience in cloud platform operations, monitoring engineering, and incident management and response operations
  • Excellent experience managing enterprise monitoring and incident response operations for complex, mission-critical systems
  • Excellent experience with modern monitoring, observability, and incident management practices
  • Excellent experience with enterprise monitoring platforms (e.g., Dynatrace, Splunk)
  • Excellent experience in dashboard design, alert configuration, and observability best practices
  • Excellent knowledge of the four Golden Signals (e.g., latency, error rate, saturation, volume) and incident-free availability measurement across complex distributed systems
  • Excellent ability to manage incident response operations including Priority Troubleshooting Calls (PTC) participation, Office of Information & Technology (OI&T) Major Incident Management (MIM) coordination, and executive communication during critical events
  • Excellent experience establishing and maintaining actionable alert thresholds, on-call rotation schedules, and escalation procedures for 24/7 coverage
  • Above average knowledge of AWS GovCloud monitoring capabilities, CloudWatch, and integration with third-party observability tools in a FedRAMP environment
  • Above average ability to produce incident reports including executive summaries, root cause analysis, timeline of events, corrective actions, and lessons learned
  • Working knowledge of ServiceNow, Jira-based service request workflows, and Federal incident reporting requirements
  • Experience supporting Federal Government programs and enterprise-scale applications operating in cloud or hybrid environments
  • Excellent verbal and written communication skills
  • Active Federal Civilian Public Trust clearance
  • U.S. Citizenship or Permanent Resident that has lived in the United States for at least 3 years

Responsibilities

  • Maintains regular communication with the Contracting Officer's Representative (COR) and Government technical leadership regarding operational health, incident status, and service restoration activities
  • Governs all platforms and application monitoring ensuring automated alerts detect production issues prior to user-reported tickets
  • Maintains 24/7 active alert monitoring coverage
  • Delivers the Capabilities and Services Monitoring Plan defining alert conditions, thresholds, escalation paths, and on-call coverage for all capabilities
  • Oversees delivery and maintenance of the Capabilities and Services Dashboard displaying real-time latency, error rate, saturation, volume, and incident-free availability for all services
  • Ensures immediate response to critical service requests
  • Coordinates and leads all PTC and OI&T Major Incident Management (MIM) events regardless of culpability
  • Delivers bi-weekly and ad hoc Incident Report Briefings
  • Presents all incident reports and responds to Government questions with qualified subject matter experts (SMEs)
  • Maintains a complete, auditable alert log including alerted system, alert description, timestamps, corrective actions, and responsible system
  • Coordinates with Site Reliability Engineers (SREs), DevSecOps, and Architecture teams to align monitoring requirements across all tenant environments

Benefits

  • Medical, dental, vision and prescription drug coverage for you and your family.
  • Life Insurance, short-term disability and long-term disability paid for by the Company.
  • Supplemental coverages including Accident, Critical Illness, and Hospital.
  • Additional Life insurance coverage for you and your dependents.
  • 401k plan with various options to select based on your retirement goals.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service