Enterprise Operations Center Specialist - Senior

SAIC•Washington, DC

4h•Onsite

About The Position

SAIC is seeking an Enterprise Operations Specialist to support our government role. This position is in Washington, DC at the Department of Transportation (DOT) Headquarters’ Building. The EOC operates 24 hours per day, 7 days per week including all Federal Holidays and will utilize appropriate monitoring tools and follow standard incident management processes. Event & Availability Monitoring: Lead and supervise proactive, real-time monitoring of enterprise infrastructure and services using automated monitoring/alerting platforms. Triage and validate events from automated tools and external providers (e.g., AT&T), perform directed checks of critical systems, and drive corrective actions per SOPs and runbooks.

Requirements

Early analysis and command-level validation
Advanced troubleshooting & diagnostics
Escalate & coordinate resolution
Incident Command & communications
Technical leadership & decision-making
RCA ownership & knowledge capture
Hands-on support & physical data center operations
Process & documentation stewardship
Reporting & metrics
Mentorship & continuous improvement
Experience with monitoring tools and incident management processes
Experience with automated monitoring/alerting platforms
Experience with AT&T or similar external providers
Experience with ServiceNow
Experience with ITTSM tickets
Experience with Root Cause Analysis (RCA)
Experience with knowledge management repositories and SOPs
Experience with data center operations
Experience with SOPs, playbooks, escalation matrices, contact lists, and IMC process documentation
Experience generating operational reports and KPI dashboards

Responsibilities

Performs day-to-day activities required to monitor systems for events or alerts.
Coordinates and manages the resolutions of events and alerts.
Monitors and identifies problem areas and coordinates resolutions.
Applies advanced technical concepts, processes, practices, and procedures on complex technical assignments and leads others in these activities.
Lead and supervise proactive, real-time monitoring of enterprise infrastructure and services using automated monitoring/alerting platforms.
Triage and validate events from automated tools and external providers (e.g., AT&T), perform directed checks of critical systems, and drive corrective actions per SOPs and runbooks.
Perform initial technical triage, determine event severity, and coordinate with POCs to confirm impact and scope.
Execute network and system diagnostics (ping, traceroute, packet captures, router/switch log/interface analysis, host/service health checks); interpret telemetry and correlate multi-source logs to identify root causes or escalation requirements.
Own escalation path: contact and liaise with DOT Tier III teams, assign and manage ITTSM tickets in ServiceNow (create, route, and track), and open/manage tickets with outside vendors (e.g., AT&T). Ensure SLA-driven escalation and follow-through.
Initiate and anchor the Critical Incident Management process and Incident Response Bridge; act as Incident Commander or Operations Lead as required, coordinate cross-functional responders, take and distribute bridge notes, and update outage communications in real time.
Make authoritative operational decisions during incidents, delegate technical tasks, and direct remediation or containment actions while maintaining chain-of-command communications with senior stakeholders.
Lead or coordinate Root Cause Analysis (RCA) production: gather forensic data, assign sequential RCA IDs, document findings/actions, identify actionable remediation items, and migrate validated content into the knowledge management repository and SOPs.
Provide on-site technical support for ExecHelp and Tier III teams during off-hours; perform authorized hands-on interventions at the Data Center, escort un-badged personnel as required, and execute hardware/system-level recoveries.
Create, update, and enforce SOPs, playbooks, escalation matrices, contact lists, and IMC process documentation; maintain remote site POC and topology data.
Generate and distribute operational reports (daily/weekly), executive incident summaries, COE Morning summary report, and KPI dashboards tracking MTTR, MTTD, incident frequency, and SLA compliance.
Mentor junior EOC analysts, lead shift handoffs, drive post-incident reviews, and sponsor automation/prioritization efforts to reduce noise and improve mean-time-to-resolution.