Platform Operations Manager

Leidos•Bethesda, MD

10h•Hybrid

About The Position

Leidos is seeking a TS/SCI-cleared Platform Operations Manager to join a high-impact team focused on the design, development, and deployment of a modern technology stack for the DOMEX Data Discovery Platform (D3P) Modernization Program. This role supports the customer's mission to centralize and standardize the Tasking, Collection, Processing, Exploitation, and Dissemination (TCPED) of Open Source Intelligence (OSINT) across the Defense Intelligence Enterprise. The position offers a flexible schedule with the majority of work performed on-site in Bethesda, MD, and some remote work possible. The Platform Operations Manager will be responsible for ensuring the availability, reliability, and performance of a full-stack, containerized microservices platform, fostering a strong DevSecOps culture, and collaborating with various engineering, development, security, and operations teams.

Requirements

BS in Engineering, Computer Science, Systems Engineering, or related field (or equivalent experience) with 15+ years of relevant experience; 13+ years with a Master’s; additional experience may substitute for a degree.
Active TS/SCI clearance with the ability to obtain and maintain a polygraph.
At least one DoD 8570.01‑M IAT Level II+ certification (e.g., Security+ CE, CySA+, CCNA Security, SSCP, CISSP (or Associate)).
Ability to obtain Privileged User Account (PUA) certification.
Experience with Kubernetes, GitLab pipelines, Linux, and containerized environments.
Experience supporting enterprise‑scale production systems.
Experience with cloud services (preferably AWS) and cloud infrastructure.
Familiarity with Elasticsearch, PostgreSQL, Logstash, Kibana, and Keycloak.
Demonstrated success in cross‑functional coordination and execution.
Team leadership and line management experience.
Strong communication skills and the ability to perform under pressure during incidents.

Nice To Haves

Experience with Agile methodologies.
Development experience (Bash, PowerShell, SALT, Python, Groovy, Java, etc.).
Experience with Appian or other low‑code platforms.
Experience with technologies such as Kafka, AMQP/JMS, Prometheus/Grafana, GPU‑based Kubernetes, SALT automation, Nexus, or GraphQL.
Knowledge of security best practices (authN/Z, secrets management, data protection).
Infrastructure‑as‑code experience (CloudFormation, Terraform, Pulumi).
AWS cloud certifications.

Responsibilities

Ensuring uptime, performance, and capacity planning for a large-scale big data production platform with a microservice architecture running on Kubernetes, Elasticsearch, PostgreSQL, Kafka, and technologies such as Java, Python, React, and low-code tools like Appian.
Leveraging monitoring tools to proactively detect and resolve issues.
Leading triage, troubleshooting, root-cause analysis, and post-incident reviews.
Defining and tracking reliability metrics (SLIs & SLOs).
Leading a team of system administrators supporting a help desk during core hours; setting technical standards and mentoring staff.
Partnering with systems engineers to design solutions, contribute to documentation, and support architectural alignment.
Participating in release planning, scrums, design sessions, bug triage, and cross-team coordination within a SAFe Agile framework.