Site Reliability Engineer, Senior

Booz Allen Hamilton•McLean, VA

44d

About The Position

Site Reliability Engineer, Senior The Opportunity: Engineering to make a system more resilient and efficient frees up time and money to build more capabilities. Whether you come from a background in network engineering, systems administration, or software development—if you have a passion for making systems better, we need you! As a Site Reliability Engineer (SRE) on our team, you’ll work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure. You’ll build in redundancy, implement monitoring tools, and automate wherever possible. You’ll reduce toil by scripting routine tasks and automating self-repair. This is your chance to leverage your expertise in automating resiliency in applications, measuring latency and availability across wide range of applications while assisting junior engineers and broadening your knowledge base. Work with us as we help empower our country's technical transformation. Join us. The world can’t wait.

Requirements

5+ years of experience measuring service SLIs using custom metrics, logs, and traces and tools such as Prometheus, Grafana, or OpenTelemetry
5+ years of experience developing Infrastructure as Code (IaC) in Terraform
5+ years of experience automating operational tasks and identifying and reducing toil
5+ years of experience scripting or coding in Python, Go, or Bash
5+ years of experience with designing SLIs, SLOs, and error budgets
5+ years of experience deploying code via CI/CD pipelines using GitLab or GitHub
Experience with working in a cloud platforms such as AWS, Azure, or GCP
Ability to coach, assist, and serve as the SRE Champion for product teams and support the operations and maintenance of applications and services
Ability to obtain a Secret clearance
HS diploma or GED

Nice To Haves

Experience implementing AI Op
Experience integrating with ServiceNow
Experience implementing self-healing solutions
Experience with application programming interfaces (APIs) and applying advanced SRE practices
Knowledge of chaos engineering and resilience testing
Ability to work in an Agile environment and produce operational runbooks and playbooks
Ability to pay strict attention to detail
Cloud Certification