Senior Site Reliability Engineer

Branch Metrics•Vancouver, BC

7h•CA$123,000 - CA$160,000•Remote

About The Position

We are seeking a highly experienced Senior Site Reliability Engineer to own the reliability, performance, and operational excellence of our large-scale, distributed infrastructure. You will lead design and execution of systems that power mission critical services, shaping engineering practices, influencing architectural decisions, and driving automation and resiliency across the organization.

Requirements

6+ years in SRE, systems engineering, or software engineering roles, ideally within fast-paced, rapidly scaling environments.
Proven track record as a senior reliability or production engineer, with ownership of large, distributed, customer-facing systems.
Expert level proficiency in Kubernetes, AWS, Linux internals, and distributed system fundamentals.
Strong programming skills in Go, Python, Java, Kotlin, Bash, or similar languages, with an emphasis on building reliable automation and tooling.
Hands-on experience with modern observability stacks (Prometheus, Grafana, AlertManager, Loki, PagerDuty).
Familiarity with large scale data and streaming ecosystems such as Kafka, Spark, Aerospike, FoundationDB, and the broader Hadoop ecosystem.
Deep experience with Terraform, CloudFormation, or related IaC tooling, and the ability to guide teams in IaC best practices.
Proven incident management leadership in production SaaS systems, including on call excellence, postmortem execution, and long-term reliability improvements.
Exceptional problem solving skills and the ability to lead complex investigations across multiple system layers.
Strong communication, cross-functional leadership, and ability to influence engineering best practices.
Hands-on expertise with ArgoCD, GitOps workflows, and CI/CD architectures.

Responsibilities

Architect, design, and evolve complex distributed systems to improve reliability, operational efficiency, and performance at scale.
Partner closely with product, security, and data engineering teams to translate business needs into resilient and scalable system designs.
Drive reliability through automation and advanced observability, ensuring proactive detection, reduced mean time to recovery, and consistent system hygiene.
Lead and mentor in high stakes situations, owning debugging efforts for critical issues and establishing durable prevention strategies.
Perform deep infrastructure cost audits, identifying areas of inefficiency and implementing solutions that reduce waste without compromising performance or security.
Own and maintain key distributed data platforms, including Aerospike and FoundationDB, ensuring durability, consistency, and performance.
Guide teams in defining SLIs/SLOs and operational best practices, elevating system reliability and engineering rigor across the org.
Continuously identify and eliminate bottlenecks, improving system throughput, latency, and overall efficiency.
Champion Infrastructure as Code (IaC) to automate provisioning, configuration, and lifecycle management using modern IaC tools and principles.
Lead our GitOps and deployment strategy using Argo CD to implement secure, repeatable, and scalable delivery workflows across Kubernetes environments.