Site Reliability Engineer, Senior

Booz Allen HamiltonMcLean, VA

About The Position

Engineering to make a system more resilient and efficient frees up time and money to build more capabilities. Whether you come from a background in network engineering, systems administration, or software development—if you have a passion for making systems better, this role is for you. As a Site Reliability Engineer (SRE) on the team, you’ll work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure. You’ll build in redundancy, implement monitoring tools, and automate wherever possible. You’ll reduce toil by scripting routine tasks and automating self-repair. This is an opportunity to leverage your expertise in automating resiliency in applications, measuring latency and availability across a wide range of applications while assisting junior engineers and broadening your knowledge base. The role contributes to empowering the country's technical transformation.

Requirements

  • 5+ years of experience measuring service SLIs using custom metrics, logs, and traces and tools such as Prometheus, Grafana, or OpenTelemetry
  • 5+ years of experience developing Infrastructure as Code (IaC) in Terraform
  • 5+ years of experience automating operational tasks and identifying and reducing toil
  • 5+ years of experience scripting or coding in Python, Go, or Bash
  • 5+ years of experience designing SLIs, SLOs, and error budgets
  • 5+ years of experience deploying code via CI/CD pipelines using GitLab or GitHub
  • Experience working in cloud platforms such as AWS, Azure, or GCP
  • Ability to coach, assist, and serve as the SRE Champion for product teams and support the operations and maintenance of applications and services
  • Ability to obtain a Secret clearance
  • HS diploma or GED

Nice To Haves

  • Experience implementing AIOps
  • Experience integrating with ServiceNow
  • Experience implementing self-healing solutions
  • Experience with application programming interfaces (APIs) and applying advanced SRE practices
  • Knowledge of chaos engineering and resilience testing
  • Ability to work in an Agile environment and produce operational runbooks and playbooks
  • Ability to pay strict attention to detail
  • Cloud Certification

Responsibilities

  • Work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure
  • Build in redundancy
  • Implement monitoring tools
  • Automate wherever possible
  • Reduce toil by scripting routine tasks and automating self-repair
  • Leverage expertise in automating resiliency in applications
  • Measure latency and availability across a wide range of applications
  • Assist junior engineers
  • Coach, assist, and serve as the SRE Champion for product teams
  • Support the operations and maintenance of applications and services

Benefits

  • Health benefits
  • Life benefits
  • Disability benefits
  • Financial benefits
  • Retirement benefits
  • Paid leave
  • Professional development
  • Tuition assistance
  • Work-life programs
  • Dependent care
  • Recognition awards program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service