Peraton Labs is seeking a poly cleared Senior HPC DevOps Engineer to own the operations and automation lifecycle for an existing HPC/AI compute cluster (Linux). You will work closely with Peraton team members, as well as directly with our Maryland-based customer, in a fast-paced environment at a customer site. In this role you will codify repeatable operations in Ansible and drive execution through an enterprise automation controller to enforce desired state, detect drift, accelerate node onboarding, and streamline incident response via runbook automation integrated with monitoring and ITSM. This position requires full-time on-site work at a customer site near College Park, MD. Key responsibilities may include Automation ownership: Own and manage automation workflows, including job templates, inventories, credentials, RBAC configurations, execution environments, and promotion across environments. Desired-state and drift detection: Enforce desired state across cluster services via code-driven configuration; implement drift detection and alert on deviations; reconcile runtime state vs configured state. Compute node onboarding (Bare-metal/VM): Build and maintain an automated node bootstrap workflow that installs/configures the OS, applies security and performance baselines, enrolls nodes into the scheduler and shared storage ecosystem, validates hardware and service readiness (CPU, network, accelerator, storage mounts), and reports pass/fail results. Patch & vulnerability response: Implement rolling maintenance and patch automation to meet defined vulnerability response SLAs. Maintain version-controlled container build definitions and integrate image scanning into the build/release lifecycle. Logging & observability: Ensure automation and operational workflows emit auditable logs to centralized analytics and integrate with metrics/alerting to enable reliable incident response, proactive detection, and safe auto-remediation. Incident/problem management: Automate responses to common incidents (hung nodes, storage performance alarms, image vulnerabilities, hardware failures) leveraging out-of-band hardware management interfaces and standardized runbooks. Docs-as-code: Keep runbooks and operational documentation versioned alongside automation and publish operator guidance to the orgs documentation platform. This position may be eligible for an increased sign-on bonus. Eligibility, bonus amount, and applicable terms and conditions will be discussed during the recruiting process #MDFSP
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
Ph.D. or professional degree