Site Reliability Engineer, Senior

Booz Allen HamiltonMcLean, VA

About The Position

As a Site Reliability Engineer (SRE) on our team, you will work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure. You will build in redundancy, implement monitoring tools, and automate wherever possible. You will reduce toil by scripting routine tasks and automating self-repair. This is your chance to leverage your expertise in automating resiliency in applications, measuring latency and availability across a wide range of applications while assisting junior engineers and broadening your knowledge base. Work with us as we help empower our country's technical transformation.

Requirements

  • 5+ years of experience measuring service SLIs using custom metrics, logs, and traces and tools such as Prometheus, Grafana, or OpenTelemetry
  • 5+ years of experience developing Infrastructure as Code (IaC) in Terraform
  • 5+ years of experience automating operational tasks and identifying and reducing toil
  • 5+ years of experience scripting or coding in Python, Go, or Bash
  • 5+ years of experience designing SLIs, SLOs, and error budgets
  • 5+ years of experience deploying code via CI/CD pipelines using GitLab or GitHub
  • Experience working in cloud platforms such as AWS, Azure, or GCP
  • Ability to coach, assist, and serve as the SRE Champion for product teams and support the operations and maintenance of applications and services
  • Ability to obtain a Secret clearance
  • HS diploma or GED

Nice To Haves

  • Experience implementing AI Op
  • Experience integrating with ServiceNow
  • Experience implementing self-healing solutions
  • Experience with application programming interfaces (APIs) and applying advanced SRE practices
  • Knowledge of chaos engineering and resilience testing
  • Ability to work in an Agile environment and produce operational runbooks and playbooks
  • Ability to pay strict attention to detail
  • Cloud Certification

Responsibilities

  • Work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure
  • Build in redundancy
  • Implement monitoring tools
  • Automate wherever possible
  • Reduce toil by scripting routine tasks and automating self-repair
  • Leverage expertise in automating resiliency in applications
  • Measure latency and availability across a wide range of applications
  • Assist junior engineers
  • Coach, assist, and serve as the SRE Champion for product teams
  • Support the operations and maintenance of applications and services

Benefits

  • Health benefits
  • Life benefits
  • Disability benefits
  • Financial benefits
  • Retirement benefits
  • Paid leave
  • Professional development
  • Tuition assistance
  • Work-life programs
  • Dependent care
  • Recognition awards program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service