Senior Site Reliability Engineer, Tacoma

Onebrief•Tacoma, WA

55d•Onsite

About The Position

We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You’ll work closely with fellow SREs, security, and customer success. You will be the first line of support for our mission critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments. Your lessons from the field will shape how our team works, from policy to implementation. In addition to working at the customer, you will contribute directly to solutions that increase stability, performance, and security of our deployments, and improve the overall experience of deploying and managing Onebrief on premise.

Requirements

An active Secret clearance
5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus.
Proven partner to DevOps/Platform and application teams; collaborates well across functions and shares context openly.
A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement.
Technical expertise Infrastructure as Code: Terraform (or CloudFormation), Ansible.
Containers and orchestration: Kubernetes design, deployment, and operations.
CI/CD: experience building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions).
Scripting: proficiency with at least one of Python, Go, or Bash.
Cloud: Familiarity with AWS or AWS GovCloud.
Observability: Grafana stack, ELK stack, or Datadog.
Networking fundamentals: core protocols and secure configurations.

Nice To Haves

Experience in DoD environments and compliance frameworks (RMF, STIGs, ICD 503).
GitOps practices and toolchains.
Security‑minded design for sensitive environments.
Experience designing and implementing meaningful SLIs/SLOs (including error budgets) for complex, distributed systems.
Familiarity with on‑prem virtualization(VMware, Proxmox, Nutanix, Hyper-V, etc).
Service mesh exposure (Istio, Linkerd).
Relevant certifications (e.g., AWS DevOps Engineer, CKA/CKAD).
Active Security+ or another DoD 8570.01-approved security credential, or the ability to obtain the valid credentials within 3 months of employment.

Responsibilities

Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana). You won't just track metrics; you'll create the actionable insights and automated alerting that allow teams to identify and resolve issues before they impact users.
Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs), increasing trust internally and externally. You will be the organization's expert on what it means for our systems to be reliable and how to measure it.
Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents who will lead blameless post-mortems / After Action Reviews (AARs) that identify true root causes and drive automated, long-term solutions to prevent recurrence.
Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible). You will embed security and compliance controls (RMF, STIGs) directly into this automation.
Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation. You will partner with other teams to share best practices for air-gapped environments and support their readiness for production.