Site Reliability Engineer, Senior

Booz Allen HamiltonMcLean, VA
1d

About The Position

Site Reliability Engineer, Senior The Opportunity: Engineering to make a system more resilient and efficient frees up time and money to build more capabilities. Whether you come from a background in network engineering, systems administration, or software development—if you have a passion for making systems better, we need you! As a Site Reliability Engineer (SRE) on our team, you’ll work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure. You’ll build in redundancy, implement monitoring tools, and automate wherever possible. You’ll reduce toil by scripting routine tasks and automating self-repair. This is your chance to leverage your expertise in automating resiliency in applications, measuring latency and availability across wide range of applications while assisting junior engineers and broadening your knowledge base. Work with us as we help empower our country's technical transformation. Join us. The world can’t wait.

Requirements

  • 5+ years of experience measuring service SLIs using custom metrics, logs, and traces and tools such as Prometheus, Grafana, or OpenTelemetry
  • 5+ years of experience developing Infrastructure as Code (IaC) in Terraform
  • 5+ years of experience automating operational tasks and identifying and reducing toil
  • 5+ years of experience scripting or coding in Python, Go, or Bash
  • 5+ years of experience with designing SLIs, SLOs, and error budgets
  • 5+ years of experience deploying code via CI/CD pipelines using GitLab or GitHub
  • Experience with working in a cloud platforms such as AWS, Azure, or GCP
  • Ability to coach, assist, and serve as the SRE Champion for product teams and support the operations and maintenance of applications and services
  • Ability to obtain a Secret clearance
  • HS diploma or GED

Nice To Haves

  • Experience implementing AI Op
  • Experience integrating with ServiceNow
  • Experience implementing self-healing solutions
  • Experience with application programming interfaces (APIs) and applying advanced SRE practices
  • Knowledge of chaos engineering and resilience testing
  • Ability to work in an Agile environment and produce operational runbooks and playbooks
  • Ability to pay strict attention to detail
  • Cloud Certification

Responsibilities

  • building a resilient infrastructure
  • building in redundancy
  • implement monitoring tools
  • automate wherever possible
  • scripting routine tasks and automating self-repair
  • automating resiliency in applications
  • measuring latency and availability across wide range of applications
  • assisting junior engineers and broadening your knowledge base

Benefits

  • health
  • life
  • disability
  • financial
  • retirement benefits
  • paid leave
  • professional development
  • tuition assistance
  • work-life programs
  • dependent care
  • recognition awards program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service