Site Reliability Engineer (Engineer Systems Architect 2) - 26999

HII's Mission Technologies division•Hanover, MD

4d•Hybrid

About The Position

Mission Technologies, a division of HII, is hiring a Site Reliability Engineer to support its customer Omni Federal. The position is hybrid, allowing work from home with on-site support at a government facility in Hanover, MD as needed. As a Site Reliability Engineer (SRE), you aren't just "keeping the lights on." You are applying an engineering mindset to system operations, treating infrastructure as a software problem. Your goal is to build ultra-scalable, highly resilient systems that ensure our warfighters have access to critical data exactly when they need it, without fail. Why Mission Technologies? Competitive Salary 401K Match Program Comprehensive Health Benefits Paid Time Off Floating Holidays Tuition Assistance Program Student Loan Repayment Program

Requirements

2 years relevant experience with Bachelors in related field; 0 years experience with Masters in related field; or High School Diploma or equivalent and 6 years relevant experience.
Advanced knowledge of Kubernetes (operators, helm charts, and cluster scaling).
Strong proficiency in Python, Go, or Ruby for automating operational tasks.
Proven experience with Terraform or Pulumi in an AWS or Azure environment.
Deep understanding of Linux internals (networking, storage, and kernel tuning).
Familiarity with hardening systems according to STIGs or CIS benchmarks.
Must possess an active US level Secret clearance with the ability to qualify for a TS/SCI clearance

Nice To Haves

You have a "break it to fix it" mentality—you enjoy chaos engineering and proactive testing.
You thrive in an environment where "done" means "automated, documented, and resilient.
An active US level TS/SCI clearance

Responsibilities

Systems Engineering: Design and implement self-healing infrastructure that minimizes manual intervention (eliminating "toil").
Performance Management: Define and monitor SLIs, SLOs, and SLAs to ensure mission-critical applications meet performance benchmarks.
Incident Response & Post-Mortems: Lead the charge in troubleshooting production issues, followed by blameless post-mortems to ensure the same bug never bites twice.
Scalability & Orchestration: Manage large-scale Kubernetes clusters, focusing on resource optimization, service meshes (Istio), and high-availability configurations.
Observability: Build comprehensive monitoring and alerting suites using the ELK stack, Prometheus, Grafana, or New Relic to visualize system health in real-time.