Site Reliability Engineer, Senior

Booz Allen Hamilton•McLean, VA

About The Position

As a Site Reliability Engineer (SRE) on our team, you will work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure. You will build in redundancy, implement monitoring tools, and automate wherever possible. You will reduce toil by scripting routine tasks and automating self-repair. This is your chance to leverage your expertise in automating resiliency in applications, measuring latency and availability across a wide range of applications while assisting junior engineers and broadening your knowledge base. Work with us as we help empower our country's technical transformation.

Requirements

5+ years of experience measuring service SLIs using custom metrics, logs, and traces and tools such as Prometheus, Grafana, or OpenTelemetry
5+ years of experience developing Infrastructure as Code (IaC) in Terraform
5+ years of experience automating operational tasks and identifying and reducing toil
5+ years of experience scripting or coding in Python, Go, or Bash
5+ years of experience designing SLIs, SLOs, and error budgets
5+ years of experience deploying code via CI/CD pipelines using GitLab or GitHub
Experience working in cloud platforms such as AWS, Azure, or GCP
Ability to coach, assist, and serve as the SRE Champion for product teams and support the operations and maintenance of applications and services
Ability to obtain a Secret clearance
HS diploma or GED

Nice To Haves

Experience implementing AI Op
Experience integrating with ServiceNow
Experience implementing self-healing solutions
Experience with application programming interfaces (APIs) and applying advanced SRE practices
Knowledge of chaos engineering and resilience testing
Ability to work in an Agile environment and produce operational runbooks and playbooks
Ability to pay strict attention to detail
Cloud Certification

Responsibilities

Work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure
Build in redundancy
Implement monitoring tools
Automate wherever possible
Reduce toil by scripting routine tasks and automating self-repair
Leverage expertise in automating resiliency in applications
Measure latency and availability across a wide range of applications
Assist junior engineers
Coach, assist, and serve as the SRE Champion for product teams
Support the operations and maintenance of applications and services