Senior Site Reliability Engineer, US Public Sector Services

GitLab

89d•$124,300 - $266,400

About The Position

As a Senior Site Reliability Engineer (SRE) at GitLab, you'll help keep all user-facing services and production systems reliable, scalable, and efficient. Our SREs combine a pragmatic operations mindset with strong software engineering practices to drive automation, reduce toil, and improve resilience across our platform. Within the US Public Sector Services team, you will specialize in operating a large fleet of GitLab environments built to meet FedRAMP compliance requirements. Your role centers on three core responsibilities: automating workflows across the full environment lifecycle—from provisioning new environments to daily operational tasks—deploying features and updates safely and consistently, and maintaining operational excellence across many production environments simultaneously. This specialization uniquely integrates the demands of operating at scale with the rigor of public sector compliance—balancing strategic automation development and hands-on operational execution while upholding the strict security and regulatory standards required for government customers.

Requirements

Proven ability to operate and troubleshoot production workloads across multiple tenants or environments.
Strong hands-on experience with Terraform, including workspace strategies, state management, and automation patterns that scale.
Skilled at diagnosing deployment failures, interpreting pod logs, and debugging scheduling issues and rollback scenarios in live environments.
Ability to read and debug code in Go and/or Ruby.
Experience supporting infrastructure for many customers or environments simultaneously.
Able to reason through complex systems and operational challenges.
Proven ability to work across teams and with internal or external customers to solve technical problems.

Nice To Haves

Experience with Ansible and templating tools like Jsonnet.
On-call experience and can lead technical discussions and incident resolution efforts under pressure.
Comfortable using GitLab as a daily tool for infrastructure automation, collaboration, and operational workflows.

Responsibilities

Design and implement automation that provisions and manages hundreds of isolated GitLab environments using Terraform, Ansible, and Kubernetes.
Troubleshoot issues across Kubernetes clusters, cloud services, and GitLab apps—identifying root causes of failed deployments, crash loops, and scheduling conflicts to ensure service continuity.
Replace manual workflows with infrastructure-as-code solutions, including automated version upgrades, configuration rollouts, and provisioning pipelines that operate reliably across all tenants.
Build observability systems that detect bottlenecks, predict usage trends, and optimize resource consumption using tools like Prometheus, ELK, and Grafana.
Lead incident response and postmortem efforts, applying technical depth to resolve issues and establish operational standards that reduce future risk.
Influence architectural decisions around automation, scalability, and operational excellence. Partner with engineering teams to improve automation, platform resilience, and production-readiness.

Benefits

All remote, asynchronous work environment
Flexible Paid Time Off
Team Member Resource Groups
Equity Compensation & Employee Stock Purchase Plan
Growth and development budget
Parental leave
Home office support

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Senior Site Reliability Engineer, US Public Sector Services

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company