About The Position

As a Senior Site Reliability Engineer (SRE) at GitLab, you'll help keep all user-facing services and production systems reliable, scalable, and efficient. Our SREs combine a pragmatic operations mindset with strong software engineering practices to drive automation, reduce toil, and improve resilience across our platform. Within the US Public Sector Services team, you will specialize in operating a large fleet of GitLab environments built to meet FedRAMP compliance requirements. Your role centers on three core responsibilities: automating workflows across the full environment lifecycle—from provisioning new environments to daily operational tasks—deploying features and updates safely and consistently, and maintaining operational excellence across many production environments simultaneously. This specialization uniquely integrates the demands of operating at scale with the rigor of public sector compliance—balancing strategic automation development and hands-on operational execution while upholding the strict security and regulatory standards required for government customers.

Requirements

  • Proven ability to operate and troubleshoot production workloads across multiple tenants or environments.
  • Strong hands-on experience with Terraform, including workspace strategies, state management, and automation patterns that scale.
  • Skilled at diagnosing deployment failures, interpreting pod logs, and debugging scheduling issues and rollback scenarios in live environments.
  • Ability to read and debug code in Go and/or Ruby.
  • Experience supporting infrastructure for many customers or environments simultaneously.
  • Able to reason through complex systems and operational challenges.
  • Proven ability to work across teams and with internal or external customers to solve technical problems.

Nice To Haves

  • Experience with Ansible and templating tools like Jsonnet.
  • On-call experience and can lead technical discussions and incident resolution efforts under pressure.
  • Comfortable using GitLab as a daily tool for infrastructure automation, collaboration, and operational workflows.

Responsibilities

  • Design and implement automation that provisions and manages hundreds of isolated GitLab environments using Terraform, Ansible, and Kubernetes.
  • Troubleshoot issues across Kubernetes clusters, cloud services, and GitLab apps—identifying root causes of failed deployments, crash loops, and scheduling conflicts to ensure service continuity.
  • Replace manual workflows with infrastructure-as-code solutions, including automated version upgrades, configuration rollouts, and provisioning pipelines that operate reliably across all tenants.
  • Build observability systems that detect bottlenecks, predict usage trends, and optimize resource consumption using tools like Prometheus, ELK, and Grafana.
  • Lead incident response and postmortem efforts, applying technical depth to resolve issues and establish operational standards that reduce future risk.
  • Influence architectural decisions around automation, scalability, and operational excellence. Partner with engineering teams to improve automation, platform resilience, and production-readiness.

Benefits

  • All remote, asynchronous work environment
  • Flexible Paid Time Off
  • Team Member Resource Groups
  • Equity Compensation & Employee Stock Purchase Plan
  • Growth and development budget
  • Parental leave
  • Home office support
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service