DevOps Software Engineer III

ClearEdge•Annapolis Junction, MD

23h

About The Position

Join ClearEdge and be part of a mission-focused team solving some of the DoDâs most complex technical challenges. Every day, ClearEdge supports government and industry customers by delivering innovative solutions that enable critical operations and mission success. ClearEdge offers an extremely competitive benefits packageâincluding a $10k annual training and education benefit, a 10% 401(k) contribution fully vested on day one, annual health and technology allowances, and access to a state-of-the-art technology lab. Learn more at www.clearedgeit.com/careers/ Your Mission ClearEdge is hiring a Software Engineer III to support DevOps-focused Operational and Maintenance (O&M) efforts for a large, multi-tenant, containerized Kubernetes High Performance Computing as a Service (HPCaaS) platform operating in a Linux environment. This role is entirely focused on DevOps, infrastructure sustainment, automation, and platform operations. You will be responsible for installing, configuring, integrating, monitoring, and sustaining Kubernetes-based infrastructure while ensuring platform reliability, scalability, and performance across mission environments.

Requirements

TS/SCI with Polygraph clearance
Masterâs degree in a related discipline and five (5) years of SWE experience OR
Bachelorâs degree in a related discipline and seven (7) years of SWE experience OR
Nine (9) years of SWE experience in similar programs
DevOps-focused experience supporting Linux systems in production environments
Experience scripting and automating operational tasks using Bash and Python
Experience with containerization and orchestration technologies including Docker and Kubernetes
Experience administering Kubernetes clusters, including bare-metal deployments
Experience with Infrastructure as Code and automation tools such as Ansible and Terraform
Experience supporting CI/CD pipelines and using Git for version control
Experience installing, configuring, and sustaining COTS, GOTS, and FOSS software
General understanding of HPC system components including compute, networking, and storage

Nice To Haves

Familiarity with Site Reliability Engineering (SRE) principles and practices
Experience with the Atlassian Tool Suite including Jira and Confluence
Experience using system monitoring tools such as Grafana and Prometheus

Responsibilities

Supporting O&M activities for a multi-tenant Kubernetes HPCaaS platform
Installing, configuring, integrating, and sustaining Linux-based systems and services
Writing and maintaining Bash and Python scripts to automate operational tasks
Administering Kubernetes clusters running on bare metal
Supporting containerized services using Docker and containerd
Managing Infrastructure as Code using tools such as Ansible and Terraform
Supporting CI/CD pipelines using tools such as GitLab CI
Monitoring system health and performance using Grafana and Prometheus
Collaborating with engineers and stakeholders to maintain platform availability and scalability