Lead Observability Data Scientist

Leidos

32d•Remote

About The Position

The Leidos Digital Modernization Sector is seeking an experienced Lead Observability Data Scientist which allows for full time telework from any U.S. based location. POSITION SUMMARY: Leidos is seeking a Lead Observability Data Scientist to join the Enterprise Observability Team within the Chief Data & Analytics Office (CDAO), part of CIO Services. This is a remote, hands-on, individual contributor role focused on applying advanced analytics, machine learning, generative AI and agentic AI to one of our largest and most business-critical data assets comprised within enterprise observability telemetry. You will work across logs, metrics, traces, events, and platform signals to uncover patterns, predict performance and reliability risks, prescribe solutions, and accelerate our roadmap for Agentic AI-enabled observability. You’ll collaborate closely with observability platform engineers, operations teams, and product/business stakeholders to turn complex telemetry into clear and actionable insights. In this role, you will design and implement advanced analytical approaches to improve detection, prediction, prescription, correlation, and decision-making across the enterprise. You’ll work directly with observability tooling such as Splunk, Datadog, Cribl, SolarWinds, and Langfuse, and you’ll build models and analytics that scale to high-volume, high-variety telemetry data. This position is ideal for someone who has delivered large-scale predictive analytics in observability/AIOps environments, enjoys deep technical execution, and can connect “machine signals” to operational and business impact.

Requirements

Bachelor’s degree in Data Science, Computer Science, Statistics, Engineering, or related field with 12+ years relevant experience (additional experience may be considered in lieu of degree).
Demonstrated large-scale observability analytics/AIOps experience working with high-volume telemetry (logs, metrics, traces, events) in complex enterprise environments.
Strong programming skills in Python and experience with ML/data science libraries (e.g., pandas, NumPy, scikit-learn; deep learning frameworks a plus).
Proven delivery of predictive analytics solutions such as time-series forecasting, anomaly detection, clustering, classification, and statistical modeling.
Ability to move from ambiguous problem statements to working analytics in production-like environments.
Excellent written and verbal communication skills; ability to translate analytical output into operational and business impact.

Nice To Haves

Hands-on experience with Splunk, Datadog, Cribl, SolarWinds, and/or observability evaluation tools such as Langfuse.
Experience with distributed tracing and OpenTelemetry concepts; familiarity with SRE/incident management practices (SLIs/SLOs, on-call operational models).
Experience applying Generative AI in operational contexts (e.g., RAG over incidents/runbooks, automated summarization, intelligent correlation/triage).
Experience with streaming and large-scale data platforms (e.g., Kafka, Spark) and cloud services (AWS/Azure/GCP).
Familiarity with MLOps patterns packaging/deploying models, monitoring, retraining triggers, and reproducible experimentation.

Responsibilities

Predictive AIOps & Risk Sensing: Build and operationalize models for anomaly detection, forecasting, early incident warning, performance regression detection, saturation/capacity risk, and service health scoring.
Cross-Signal Correlation: Correlate logs/metrics/traces/events with topology, deployment, change, and business signals to identify drivers of degradation and reduce time-to-diagnosis.
Agentic AI Observability (Hands-On): Prototype and advance agentic workflows that assist with triage, signal enrichment, event clustering, summarization, and guided next-best-action recommendations.
Tool-Integrated Analytics: Use and extend enterprise observability platforms (Splunk, Datadog, Cribl, SolarWinds, Langfuse) to extract signals, engineer features, validate hypotheses, and operationalize outcomes.
Data Engineering for Analytics: Define data quality checks, feature pipelines, and scalable methods for working with high-volume telemetry (batch and/or streaming), partnering with platform teams as needed.
Model Evaluation & Operationalization: Establish model performance measures aligned to operational goals (noise reduction, precision/recall of detections, lead time to failure, MTTR improvements); monitor drift and iterate.
Outcome-Focused Communication: Communicate findings to technical and non-technical stakeholders with clear recommendations, tradeoffs, and measurable results.
Responsible AI & Governance Alignment: Ensure solutions align with Leidos standards for security, privacy, governance, and responsible AI practices.