Observability Engineer

Booz Allen Hamilton•McLean, VA

3d•Remote

About The Position

The Opportunity: Something breaks at 2 AM. Today, a human gets paged. Tomorrow, an AI agent detects the anomaly, correlates the root cause, triggers the remediation, and closes the ticket, all before the first cup of coffee. You are the engineer who builds that tomorrow. We are seeking a senior Observability Engineer with expertise in both AI technologies and enterprise performance monitoring. This role combines hands-on engineering with AIOps implementation to deliver full-stack visibility across 250+ services. You will lead efforts to implement predictive monitoring and self-healing capabilities that drive down operational costs while increasing system availability by leveraging AI to triage and resolve incidents. You will mentor and supervise engineers, own technical quality, and push the program toward AI-driven observability with opportunities to build new observability platforms from the ground up as we expand into new environments. Join us. The world can’t wait.

Requirements

5+ years of experience in enterprise observability, monitoring, and site reliability engineering
Experience architecting and operating Dynatrace for full-stack observability, including agent deployment, distributed tracing, log management, synthetic monitoring, and digital experience monitoring
Experience implementing AIOps workflows, including predictive alerting, anomaly detection, automated remediation, and incident automation
Experience building observability integrations, custom extensions, and infrastructure-as-code using Python, JavaScript, Node.js, and Terraform
Experience building operational and executive dashboards and implementing SLOs and SLAs
Experience working in Agile environments with sprint-based delivery
Knowledge of network monitoring protocols, including SNMP, SNMP traps, NetFlow, and Syslog
Ability to mentor engineers, conduct code reviews, and take accountability for technical delivery and quality
Secret clearance
Bachelor's degree in Computer Science or Information Technology

Nice To Haves

Experience with ServiceNow Event Management, including event rules, alert management rules, alert correlation, threshold tuning, noise reduction, CMDB integration with CI relationships and dependencies, ITOM alignment, automated incident creation, Flow Designer, IntegrationHub, JavaScript, Glide API, HTML or CSS, and ServiceNow platform architecture
Experience developing standardized onboarding processes for integrating new monitoring tools into Event Management with governance, segregation of duties, and compliance documentation
Experience with advanced Dynatrace platform capabilities, including Grail, Smartscape, Davis AI, OpenPipeline, DQL, Workflow Automation, Platform API, AppSec, Session Replay, Grail-powered RUM, AI Observability, and Grail log management
Experience with Dynatrace Intelligence, including Dynatrace Assist, Intelligence Agents, MCP Server integration, and Dynatrace Apps development using the App Toolkit
Experience deploying and building observability platforms from scratch in government cloud environments such as AWS GovCloud, Azure Government, IL4, or IL5, including air-gapped, restricted network, and STIG-hardened deployments
Experience building self-service onboarding portals for application team observability adoption
Experience with open-source observability tooling, including OpenTelemetry, Prometheus, Grafana, ELK, and EFK
Experience with FinOps practices, containerization, and cloud platforms such as AWS, Azure, or GCP
Experience operating Splunk and Splunk Enterprise Security (SIEM), Cribl, and SolarWinds at enterprise scale
Dynatrace Professional or Master Certification or ServiceNow Certified Implementation Specialist - Event Management (CIS-EM) Certification

Responsibilities

Implement predictive monitoring and self-healing capabilities.
Leverage AI to triage and resolve incidents.
Mentor and supervise engineers.
Own technical quality.
Push the program toward AI-driven observability.
Build new observability platforms from the ground up as we expand into new environments.
Deploy agent deployment, distributed tracing, log management, synthetic monitoring, and digital experience monitoring.
Implement AIOps workflows, including predictive alerting, anomaly detection, automated remediation, and incident automation.
Build observability integrations, custom extensions, and infrastructure-as-code.
Build operational and executive dashboards.
Implement SLOs and SLAs.
Work in Agile environments with sprint-based delivery.
Conduct code reviews.
Take accountability for technical delivery and quality.
Develop standardized onboarding processes for integrating new monitoring tools into Event Management with governance, segregation of duties, and compliance documentation.
Build self-service onboarding portals for application team observability adoption.