About The Position

We’re seeking a Senior Site Reliability Engineer with deep infrastructure expertise to design, build, and operate the foundational systems that power our products. This role focuses on reliability, scalability, automation, and operational excellence across cloud and on‑prem environments. You’ll work closely with engineering teams to evolve our infrastructure platform, reduce toil, and ensure our systems are robust, observable, and efficient.

Requirements

  • 8+ years of experience in SRE, DevOps, or infrastructure engineering with hands‑on ownership of production systems.
  • Deep knowledge of Linux systems, networking fundamentals, and distributed systems.
  • Proficiency with IaC tools such as Terraform, Ansible, and CloudFormation.
  • Experience with containerization and orchestration (Docker, Kubernetes).
  • Solid programming or scripting skills in Python, Go, Bash, or similar.
  • Hands‑on experience with CI/CD systems and automated deployment pipelines.
  • Strong observability background using Prometheus, Grafana, ELK, OpenTelemetry, or similar.
  • Proven incident management experience in high‑availability environments.

Nice To Haves

  • Experience with hybrid or multi‑cloud environments.
  • Knowledge of infrastructure security including secrets management and zero‑trust principles.
  • Background with large‑scale distributed systems or high‑throughput architectures.
  • Open‑source contributions in SRE, infrastructure, or cloud‑native ecosystems.

Responsibilities

  • Architect and maintain core infrastructure systems across compute, storage, networking, and cloud services.
  • Develop automation and tooling to eliminate manual operations and improve system consistency.
  • Implement and manage infrastructure‑as‑code using modern frameworks and best practices.
  • Drive reliability engineering practices including SLOs, SLIs, error budgets, and incident response.
  • Enhance observability through metrics, logging, tracing, and actionable alerting.
  • Optimize system performance and capacity to support growth and cost efficiency.
  • Lead complex troubleshooting efforts across distributed systems and production environments.
  • Collaborate with engineering teams to ensure infrastructure supports evolving product needs.
  • Strengthen security and compliance posture through hardened infrastructure and best practices.
  • Mentor engineers and contribute to a culture of operational excellence.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service