Senior Site Reliability Engineer, Arlington

Onebrief•Chantilly, VA

32d•Onsite

About The Position

We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You'll work closely with fellow SREs, security, and customer success. You will be the first line of support for our mission critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments. Your lessons from the field will shape how our team works, from policy to implementation. In addition to working at the customer, you will contribute directly to solutions that increase stability, performance, and security of our deployments, and improve the overall experience of deploying and managing Onebrief on premise. You are a force multiplier who views reliability as the most critical feature of any application and/or platform and believe that "reliability beats novelty." You see infrastructure and operability as a product to be automated, documented, and continuously improved, always leaving systems easier to operate than you found them. You are equally comfortable leading a post-incident review, designing SLOs in a system design session, or diving into a kubectl shell to triage a complex production issue. You don't just fix problems; you translate constraints and failure modes into clear, automated guardrails and scalable, resilient architecture. For you, robust monitoring, actionable alerting, and insightful runbooks are core parts of the engineering process, not afterthoughts. You mentor others, fostering a culture of blameless postmortems and proactive reliability. You collaborate naturally with application and platform teams, helping them move quickly but safely by building the tools, processes, and observability that make "fast recovery" a reality.

Requirements

3 years of experience in Site Reliability Engineering or a related field, with firsthand experience managing mission-critical systems within DoD's air-gapped environments
An active Top Secret security clearance. U.S. citizenship required.
Experience automating software delivery, deployment, and providing documentation and self-service tools for engineering teams and customers.
A strong understanding of Linux, containerization and orchestration, and virtual machines
Experience with centralized logging, metrics, and observability using tools such as Prometheus, Loki, Grafana, ELK stack, or Datadog.
Networking fundamentals: core protocols and secure configurations.
A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement
Clear, concise writing; strong documentation habits and async communication.
Core skills and technologies: VMWare, Kubernetes, Docker, Helm, Ansible, Terraform, Linux, AWS, DoD compliance, Monitoring and Observability tools, AWS.

Nice To Haves

Experience with compliance frameworks (RMF, STIGs/SRGs, ICD 503).
Securityâminded design for air-gapped environments.
Active Security+ or another DoD 8570.01-approved security credential, or the ability to obtain the valid credentials within 3 months of employment.

Responsibilities

Building a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana). You won't just track metrics; you'll create the actionable insights and automated alerting that allow teams to identify and resolve issues before they impact users.
Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Objectives (SLOs) and increases trust internally and externally. You will be the organization's expert on what it means for our systems to be reliable and how to measure it.
Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents You will lead blameless post-mortems / After Action Reviews (AARs) that identify true root causes and drive automated, long-term solutions to prevent recurrence.
Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible). You will embed security and compliance controls (RMF, STIGs) directly into this automation.
Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation. You will act as a force multiplier by advising other teams on best practices in air-gapped environments and production readiness.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume