Senior Site Reliability Engineer

Branch MetricsVancouver, BC
CA$123,000 - CA$160,000Remote

About The Position

We are seeking a highly experienced Senior Site Reliability Engineer to own the reliability, performance, and operational excellence of our large-scale, distributed infrastructure. You will lead design and execution of systems that power mission critical services, shaping engineering practices, influencing architectural decisions, and driving automation and resiliency across the organization.

Requirements

  • 6+ years in SRE, systems engineering, or software engineering roles, ideally within fast-paced, rapidly scaling environments.
  • Proven track record as a senior reliability or production engineer, with ownership of large, distributed, customer-facing systems.
  • Expert level proficiency in Kubernetes, AWS, Linux internals, and distributed system fundamentals.
  • Strong programming skills in Go, Python, Java, Kotlin, Bash, or similar languages, with an emphasis on building reliable automation and tooling.
  • Hands-on experience with modern observability stacks (Prometheus, Grafana, AlertManager, Loki, PagerDuty).
  • Familiarity with large scale data and streaming ecosystems such as Kafka, Spark, Aerospike, FoundationDB, and the broader Hadoop ecosystem.
  • Deep experience with Terraform, CloudFormation, or related IaC tooling, and the ability to guide teams in IaC best practices.
  • Proven incident management leadership in production SaaS systems, including on call excellence, postmortem execution, and long-term reliability improvements.
  • Exceptional problem solving skills and the ability to lead complex investigations across multiple system layers.
  • Strong communication, cross-functional leadership, and ability to influence engineering best practices.
  • Hands-on expertise with ArgoCD, GitOps workflows, and CI/CD architectures.

Responsibilities

  • Architect, design, and evolve complex distributed systems to improve reliability, operational efficiency, and performance at scale.
  • Partner closely with product, security, and data engineering teams to translate business needs into resilient and scalable system designs.
  • Drive reliability through automation and advanced observability, ensuring proactive detection, reduced mean time to recovery, and consistent system hygiene.
  • Lead and mentor in high stakes situations, owning debugging efforts for critical issues and establishing durable prevention strategies.
  • Perform deep infrastructure cost audits, identifying areas of inefficiency and implementing solutions that reduce waste without compromising performance or security.
  • Own and maintain key distributed data platforms, including Aerospike and FoundationDB, ensuring durability, consistency, and performance.
  • Guide teams in defining SLIs/SLOs and operational best practices, elevating system reliability and engineering rigor across the org.
  • Continuously identify and eliminate bottlenecks, improving system throughput, latency, and overall efficiency.
  • Champion Infrastructure as Code (IaC) to automate provisioning, configuration, and lifecycle management using modern IaC tools and principles.
  • Lead our GitOps and deployment strategy using Argo CD to implement secure, repeatable, and scalable delivery workflows across Kubernetes environments.

Benefits

  • health and wellness programs
  • paid time off
  • retirement planning options
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service