Senior Site Reliability Engineer

2K•Austin, TX

12h•Hybrid

About The Position

The Senior SRE at 2K is a hands-on technical leader responsible for shaping production infrastructure across multiple clouds and regions. This role involves partnering with network engineers, systems architects, and game studio developers. It is an ownership role that drives technical direction, influences reliability from architecture review through production operation, and bridges the gap between what engineering ships and what players experience. The 2K SRE team manages the infrastructure for all 2K game services, account platforms, CI/CD pipelines, and developer tooling across AWS, GCP, and on-premises data centers globally. The team takes pride in ensuring millions of players remain connected and focuses on systems rather than people in post-mortems, with automation being the default solution for repetitive tasks.

Requirements

5+ years in SRE, Platform Engineering, or equivalent infrastructure work at production scale.
Deep experience in cloud environments (EKS or GKE preferred), including networking, storage, and multi-cluster patterns.
Strong proficiency with Terraform and/or Pulumi; hands-on with Helm, Terragrunt, and GitOps tooling (ArgoCD or GitHub Actions).
Experience with modern and legacy tech, including AWS, GCP, VMware, and Bare metal servers.
Server configuration using Ansible, Puppet, and AWS Systems Manager.
Experience with Datadog, Prometheus + Grafana, and OpenTelemetry; fluency in operationalizing SLI/SLO/error budgets inside engineering teams.
Production-quality code in Go, Python, or TypeScript for tools, automation, and internal libraries.
Solid understanding of Linux internals, TCP/IP networking, DNS, and TLS proven enough to debug at the system level.
Incident response and post-mortem leadership with a track record of systemic follow-through.

Nice To Haves

Live-service game or large-scale consumer internet experience dealing with millions of concurrent users.
Deep knowledge of Service mesh (Istio, Cilium) and advanced Kubernetes networking.
Experience with FinOps and managing resources efficiently at cloud scale.
Experience with AI and Agentic Development.
Cloud certifications (AWS Solutions Architect, GCP Professional Cloud Architect, CKA/CKS, or equivalent).
Experience mentoring SREs or leading reliability working groups.

Responsibilities

Design, build, and operate scalable multi-cloud and hybrid infrastructure using Terraform, Pulumi, and GitOps workflows (ArgoCD, Flux).
Own Kubernetes platforms (EKS, GKE) end-to-end cluster lifecycle, multi-tenancy, networking (Istio, Cilium), and autoscaling.
Push progressive delivery patterns (blue/green, canary) across game service deployments.
Build and run the full observability stack: Prometheus + Grafana + Datadog.
Define SLI/SLO/error budget policies and build alerting that cuts through the noise.
Lead chaos engineering exercises to surface failure modes before players encounter them.
Drive incident response and post-mortems with a focus on systemic fixes and real follow-through.
Eliminate toil through self-service provisioning, automated remediation, and intelligent scaling.
Harden CI/CD pipelines (GitHub Actions, Jenkins, ArgoCD).
Embed security at the platform layer through secrets management (PasswordState, 1Password, and AWS Secrets Manager) and policy-as-code (OPA/Gatekeeper).
Promote SRE practices across 2K studios through reliability reviews, runbooks, and embedded collaboration.
Shape architectural decisions and author engineering RFCs that move the platform forward.

Benefits

2K Games and its studios never uses instant messaging apps or personal email accounts to contact prospective employees or conduct interviews and when emailing, only use 2K.com accounts.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume