Senior Site Reliability Engineer

SecurityScorecardAustin, TX
$152,000 - $195,000

About The Position

As a Senior Site Reliability Engineer, you will be a key technical leader driving the design and optimization of our Kubernetes-based infrastructure and CI/CD systems. You will also own the infrastructure behind our AI tooling — building MCP servers and defining safe, auditable AI access patterns for production systems. You'll work hands-on with engineering teams to accelerate delivery, ensure production reliability, and embed best practices for automation, observability, and resilience.

Requirements

  • 6+ years in SRE, DevOps, or Infrastructure roles, with significant production Kubernetes experience.
  • Hands-on experience integrating AI/LLM tooling into engineering or operational workflows (e.g., MCP servers, AI agents acting on infrastructure), and a clear grasp of the security and governance considerations of giving AI access to production.
  • Proven success building CI/CD pipelines (GitHub Actions, Jenkins, GitLab CI, or similar).
  • Strong with Kubernetes internals and managed services like EKS, GKE, or AKS.
  • Expertise with Infrastructure as Code (Terraform, Helm, Pulumi) and GitOps.
  • Proficient in Python, Bash, or Go.
  • Knowledge of observability tooling (Prometheus, Grafana, Datadog, OpenTelemetry).
  • Production experience with Kafka, Flink, and ClickHouse.
  • Strong communication and cross-team collaboration skills.

Nice To Haves

  • Multi-region or multi-cluster Kubernetes experience.
  • Chaos engineering or resilience testing.
  • Security scanning, compliance automation, or policy-as-code.
  • LLM observability/tracing tooling (Langsmith, Langfuse) or MLOps workflows.
  • Contributions to open-source Kubernetes or CI/CD projects.

Responsibilities

  • Design, build, and scale Kubernetes infrastructure for secure, multi-tenant, high-availability applications.
  • Build and operate AI tooling infrastructure — stand up MCP servers and establish secure, governed AI access and guardrails for production systems.
  • Optimize and maintain CI/CD pipelines, improving reliability, speed, and rollback safety.
  • Implement progressive delivery strategies such as blue/green and canary deployments.
  • Advance Infrastructure as Code with Terraform, Helm, and Argo CD, defining reusable patterns for the org.
  • Operate and optimize streaming and analytics infrastructure: Kafka, Flink, and ClickHouse.
  • Build automated testing into the CI/CD lifecycle.
  • Improve system observability — define SLOs, alerts, and dashboards.
  • Lead incident response and postmortems, focusing on root cause and durable fixes.
  • Mentor engineers across teams on Kubernetes, CI/CD, and cloud infrastructure.

Benefits

  • competitive salary
  • stock options
  • Health benefits
  • unlimited PTO
  • parental leave
  • tuition reimbursements
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service