Platform Operations Manager

Leidos•Bethesda, MD

1d•$154,050 - $278,475•Hybrid

About The Position

Leidos is excited to present an opportunity for a TS/SCI‑cleared Platform Operations Manager to join a high‑impact team driving the design, development, and deployment of a modern technology stack supporting the DOMEX Data Discovery Platform (D3P) Modernization Program. This role directly supports our customer’s mission to centralize and standardize the Tasking, Collection, Processing, Exploitation, and Dissemination (TCPED) of Open Source Intelligence (OSINT) across the Defense Intelligence Enterprise. You’ll be part of a mission‑focused, solutions‑oriented team that values inclusion, innovation, collaboration, and continuous professional growth. While the majority of work is performed on‑site at our customer location in Bethesda, MD, we offer a flexible schedule, and some tasks may be completed remotely. As a Platform Operations Manager you will ensure the availability, reliability, and performance of a full‑stack, containerized microservices platform. You’ll help cultivate a strong DevSecOps culture and collaborate closely with systems engineering, architecture, development, security, operations, and integration teams in a fast‑paced environment. You will partner with a multidisciplinary team of systems engineers, developers, integrators, and system administrators to lead efforts in the following areas: System Reliability & Performance — Ensuring uptime, performance, and capacity planning for a large‑scale big data production platform with a microservice architecture running on Kubernetes, Elasticsearch, PostgreSQL, Kafka, and technologies such as Java, Python, React, and low‑code tools like Appian Monitoring & Observability — Leveraging monitoring tools to proactively detect and resolve issues Incident Response — Leading triage, troubleshooting, root‑cause analysis, and post‑incident reviews SLIs & SLOs — Defining and tracking reliability metrics Management Oversight — Leading a team of system administrators supporting a help desk during core hours; setting technical standards and mentoring staff Technical Leadership — Partnering with systems engineers to design solutions, contribute to documentation, and support architectural alignment SAFe Agile — Participating in release planning, scrums, design sessions, bug triage, and cross‑team coordination You bring enthusiasm, strong collaboration skills, and the ability to work effectively with teammates across varying technical backgrounds.

Requirements

BS in Engineering, Computer Science, Systems Engineering, or related field (or equivalent experience) with 15+ years of relevant experience; 13+ years with a Master’s; additional experience may substitute for a degree
Active TS/SCI clearance with the ability to obtain and maintain a polygraph
At least one DoD 8570.01‑M IAT Level II+ certification (e.g., Security+ CE, CySA+, CCNA Security, SSCP, CISSP (or Associate))
Ability to obtain Privileged User Account (PUA) certification
Experience with Kubernetes, GitLab pipelines, Linux, and containerized environments
Experience supporting enterprise‑scale production systems
Experience with cloud services (preferably AWS) and cloud infrastructure
Familiarity with Elasticsearch, PostgreSQL, Logstash, Kibana, and Keycloak
Demonstrated success in cross‑functional coordination and execution
Team leadership and line management experience
Strong communication skills and the ability to perform under pressure during incidents

Nice To Haves

Experience with Agile methodologies
Development experience (Bash, PowerShell, SALT, Python, Groovy, Java, etc.)
Experience with Appian or other low‑code platforms
Experience with technologies such as Kafka, AMQP/JMS, Prometheus/Grafana, GPU‑based Kubernetes, SALT automation, Nexus, or GraphQL
Knowledge of security best practices (authN/Z, secrets management, data protection)
Infrastructure‑as‑code experience (CloudFormation, Terraform, Pulumi)
AWS cloud certifications

Responsibilities

Ensuring uptime, performance, and capacity planning for a large‑scale big data production platform with a microservice architecture running on Kubernetes, Elasticsearch, PostgreSQL, Kafka, and technologies such as Java, Python, React, and low‑code tools like Appian
Leveraging monitoring tools to proactively detect and resolve issues
Leading triage, troubleshooting, root‑cause analysis, and post‑incident reviews
Defining and tracking reliability metrics
Leading a team of system administrators supporting a help desk during core hours; setting technical standards and mentoring staff
Partnering with systems engineers to design solutions, contribute to documentation, and support architectural alignment
Participating in release planning, scrums, design sessions, bug triage, and cross‑team coordination