Senior Site Reliability Engineer

SecurityScorecard•Austin, TX

3h•$152,000 - $195,000

About The Position

As a Senior Site Reliability Engineer, you will be a key technical leader driving the design and optimization of our Kubernetes-based infrastructure and CI/CD systems. You will also own the infrastructure behind our AI tooling — building MCP servers and defining safe, auditable AI access patterns for production systems. You'll work hands-on with engineering teams to accelerate delivery, ensure production reliability, and embed best practices for automation, observability, and resilience.

Requirements

6+ years in SRE, DevOps, or Infrastructure roles, with significant production Kubernetes experience.
Hands-on experience integrating AI/LLM tooling into engineering or operational workflows (e.g., MCP servers, AI agents acting on infrastructure), and a clear grasp of the security and governance considerations of giving AI access to production.
Proven success building CI/CD pipelines (GitHub Actions, Jenkins, GitLab CI, or similar).
Strong with Kubernetes internals and managed services like EKS, GKE, or AKS.
Expertise with Infrastructure as Code (Terraform, Helm, Pulumi) and GitOps.
Proficient in Python, Bash, or Go.
Knowledge of observability tooling (Prometheus, Grafana, Datadog, OpenTelemetry).
Production experience with Kafka, Flink, and ClickHouse.
Strong communication and cross-team collaboration skills.

Nice To Haves

Multi-region or multi-cluster Kubernetes experience.
Chaos engineering or resilience testing.
Security scanning, compliance automation, or policy-as-code.
LLM observability/tracing tooling (Langsmith, Langfuse) or MLOps workflows.
Contributions to open-source Kubernetes or CI/CD projects.

Responsibilities

Design, build, and scale Kubernetes infrastructure for secure, multi-tenant, high-availability applications.
Build and operate AI tooling infrastructure — stand up MCP servers and establish secure, governed AI access and guardrails for production systems.
Optimize and maintain CI/CD pipelines, improving reliability, speed, and rollback safety.
Implement progressive delivery strategies such as blue/green and canary deployments.
Advance Infrastructure as Code with Terraform, Helm, and Argo CD, defining reusable patterns for the org.
Operate and optimize streaming and analytics infrastructure: Kafka, Flink, and ClickHouse.
Build automated testing into the CI/CD lifecycle.
Improve system observability — define SLOs, alerts, and dashboards.
Lead incident response and postmortems, focusing on root cause and durable fixes.
Mentor engineers across teams on Kubernetes, CI/CD, and cloud infrastructure.