Technical Enterprise Incident Manager

Peraton•,

7d•$86,000 - $138,000

About The Position

We are seeking a highly motivated and technically skilled Technical Enterprise Incident Manager with strong Cloud Platform DevSecOps Engineering and application experience to lead enterprise incident response, service restoration efforts, and operational reliability initiatives. This individual will serve as the central point of coordination during major incidents, ensuring rapid resolution, clear communication, and continuous service improvement across enterprise infrastructure and applications. The ideal candidate possesses a strong operational background, excellent communication skills, and hands-on technical expertise in infrastructure, cloud technologies, monitoring, automation, and IT service management processes. This role requires the ability to drive incident response while also identifying systemic reliability improvements. This position may require participation in an after-hours and weekend on-call rotation supporting enterprise production incidents and critical outage management activities.

Requirements

Bachelor’s degree and 5 years of experience or 9 years with a Highschool diploma.
At least 5 years of experience in Cloud Incident Management, Operations Engineering, NOC, SRE, Application or Production Support environments.
Experience leading enterprise Major Incident response efforts in a 24x7 operational environment.
Strong understanding of ITIL Incident and Problem Management processes.
Hands-on experience with infrastructure technologies including: Windows/Linux Servers, Networking concepts, Cloud platforms (AWS, Azure, or GCP), Load balancers, proxies, DNS, and firewalls
Experience with monitoring and observability platforms such as: Datadog, Cloudcraft, CloudWatch
Experience using Cloudcraft to document and visualize cloud environments and application dependencies.
Experience using ServiceNow or similar ITSM platforms.
Strong analytical, troubleshooting, and organizational skills.
Excellent written and verbal communication skills with ability to facility meetings as well as brief technical teams and executive leadership.
Must be a US Citizen.
Must be able to obtain and maintain the required agency clearance.

Nice To Haves

Experience in a Site Reliability Engineering (SRE) or Cloud Platform DevOps environment.
Familiarity with CI/CD pipelines and Infrastructure as Code (IaC).
Experience supporting federal, healthcare, financial, or other highly regulated environments.
ITIL Foundation certification preferred.
SRE, cloud, or operational certifications are a plus.

Responsibilities

Lead and coordinate Incident bridge calls involving infrastructure, application, network, cloud, security, and vendor teams.
Drive rapid service restoration while maintaining accurate timelines, communications, and executive updates.
Ensure incidents are prioritized appropriately based on business impact and operational risk.
Manage escalation procedures and engage leadership when required.
Monitor SLA compliance and ensure incident response metrics are consistently achieved.
Improve platform reliability, availability, observability, and operational maturity.
Work with application teams to facilitate issues and implement root cause remediations.
Develop and enhance monitoring, alerting, and dashboarding capabilities.
Analyze trends, KPIs, and operational metrics to proactively identify reliability risks.
Support implementation of resiliency strategies including redundancy, failover, capacity planning, and performance optimization.
Create and maintain cloud architecture and service dependency diagrams using Cloudcraft.
Utilize Datadog for monitoring, alert correlation, dashboards, incident investigation, and performance analysis.
Assist with production readiness reviews and operational acceptance activities.
Participate in after-hours on-call incident management rotation as required.
Develop and maintain incident management procedures, runbooks, and knowledge articles.
Ensure accurate ticket documentation within ServiceNow.
Drive continual service improvement initiatives aligned with ITIL and SRE best practices.
Collaborate with cross functional teams to improve communication, escalation paths, and operational workflows.
Support audit, compliance, and operational reporting requirements.