Site Reliability Engineer, Senior

Booz Allen Hamilton•McLean, VA

64d

About The Position

Engineering to make a system more resilient and efficient frees up time and money to build more capabilities. Whether you come from a background in network engineering, systems administration, or software development—if you have a passion for making systems better, this role is for you. As a Site Reliability Engineer (SRE) on the team, you’ll work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure. You’ll build in redundancy, implement monitoring tools, and automate wherever possible. You’ll reduce toil by scripting routine tasks and automating self-repair. This is an opportunity to leverage your expertise in automating resiliency in applications, measuring latency and availability across a wide range of applications while assisting junior engineers and broadening your knowledge base. The role contributes to empowering the country's technical transformation.

Requirements

5+ years of experience measuring service SLIs using custom metrics, logs, and traces and tools such as Prometheus, Grafana, or OpenTelemetry
5+ years of experience developing Infrastructure as Code (IaC) in Terraform
5+ years of experience automating operational tasks and identifying and reducing toil
5+ years of experience scripting or coding in Python, Go, or Bash
5+ years of experience designing SLIs, SLOs, and error budgets
5+ years of experience deploying code via CI/CD pipelines using GitLab or GitHub
Experience working in cloud platforms such as AWS, Azure, or GCP
Ability to coach, assist, and serve as the SRE Champion for product teams and support the operations and maintenance of applications and services
Ability to obtain a Secret clearance
HS diploma or GED

Nice To Haves

Experience implementing AIOps
Experience integrating with ServiceNow
Experience implementing self-healing solutions
Experience with application programming interfaces (APIs) and applying advanced SRE practices
Knowledge of chaos engineering and resilience testing
Ability to work in an Agile environment and produce operational runbooks and playbooks
Ability to pay strict attention to detail
Cloud Certification

Responsibilities

Work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure
Build in redundancy
Implement monitoring tools
Automate wherever possible
Reduce toil by scripting routine tasks and automating self-repair
Leverage expertise in automating resiliency in applications
Measure latency and availability across a wide range of applications
Assist junior engineers
Coach, assist, and serve as the SRE Champion for product teams
Support the operations and maintenance of applications and services