We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You'll work closely with fellow SREs, security, and customer success. You will be the first line of support for our mission critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments. Your lessons from the field will shape how our team works, from policy to implementation. In addition to working at the customer, you will contribute directly to solutions that increase stability, performance, and security of our deployments, and improve the overall experience of deploying and managing Onebrief on premise. You are a force multiplier who views reliability as the most critical feature of any application and/or platform and believe that "reliability beats novelty." You see infrastructure and operability as a product to be automated, documented, and continuously improved, always leaving systems easier to operate than you found them. You are equally comfortable leading a post-incident review, designing SLOs in a system design session, or diving into a kubectl shell to triage a complex production issue. You don't just fix problems; you translate constraints and failure modes into clear, automated guardrails and scalable, resilient architecture. For you, robust monitoring, actionable alerting, and insightful runbooks are core parts of the engineering process, not afterthoughts. You mentor others, fostering a culture of blameless postmortems and proactive reliability. You collaborate naturally with application and platform teams, helping them move quickly but safely by building the tools, processes, and observability that make "fast recovery" a reality.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Industry
Publishing Industries
Education Level
No Education Listed
Number of Employees
101-250 employees