DevOps Software Engineer III

ClearEdgeAnnapolis Junction, MD
23h

About The Position

Join ClearEdge and be part of a mission-focused team solving some of the DoD’s most complex technical challenges. Every day, ClearEdge supports government and industry customers by delivering innovative solutions that enable critical operations and mission success. ClearEdge offers an extremely competitive benefits package—including a $10k annual training and education benefit, a 10% 401(k) contribution fully vested on day one, annual health and technology allowances, and access to a state-of-the-art technology lab. Learn more at www.clearedgeit.com/careers/ Your Mission ClearEdge is hiring a Software Engineer III to support DevOps-focused Operational and Maintenance (O&M) efforts for a large, multi-tenant, containerized Kubernetes High Performance Computing as a Service (HPCaaS) platform operating in a Linux environment. This role is entirely focused on DevOps, infrastructure sustainment, automation, and platform operations. You will be responsible for installing, configuring, integrating, monitoring, and sustaining Kubernetes-based infrastructure while ensuring platform reliability, scalability, and performance across mission environments.

Requirements

  • TS/SCI with Polygraph clearance
  • Master’s degree in a related discipline and five (5) years of SWE experience OR
  • Bachelor’s degree in a related discipline and seven (7) years of SWE experience OR
  • Nine (9) years of SWE experience in similar programs
  • DevOps-focused experience supporting Linux systems in production environments
  • Experience scripting and automating operational tasks using Bash and Python
  • Experience with containerization and orchestration technologies including Docker and Kubernetes
  • Experience administering Kubernetes clusters, including bare-metal deployments
  • Experience with Infrastructure as Code and automation tools such as Ansible and Terraform
  • Experience supporting CI/CD pipelines and using Git for version control
  • Experience installing, configuring, and sustaining COTS, GOTS, and FOSS software
  • General understanding of HPC system components including compute, networking, and storage

Nice To Haves

  • Familiarity with Site Reliability Engineering (SRE) principles and practices
  • Experience with the Atlassian Tool Suite including Jira and Confluence
  • Experience using system monitoring tools such as Grafana and Prometheus

Responsibilities

  • Supporting O&M activities for a multi-tenant Kubernetes HPCaaS platform
  • Installing, configuring, integrating, and sustaining Linux-based systems and services
  • Writing and maintaining Bash and Python scripts to automate operational tasks
  • Administering Kubernetes clusters running on bare metal
  • Supporting containerized services using Docker and containerd
  • Managing Infrastructure as Code using tools such as Ansible and Terraform
  • Supporting CI/CD pipelines using tools such as GitLab CI
  • Monitoring system health and performance using Grafana and Prometheus
  • Collaborating with engineers and stakeholders to maintain platform availability and scalability

Benefits

  • $10k annual training and education benefit
  • a 10% 401(k) contribution fully vested on day one
  • annual health and technology allowances
  • access to a state-of-the-art technology lab
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service