Senior Site Reliability Engineer

ScalePad•Vancouver, BC

2d•CA$130,000 - CA$165,000•Hybrid

About The Position

We are looking for a Senior Site Reliability Engineer (SRE) to help strengthen and scale our multi-cloud platform and developer experience. This is a hands-on senior individual contributor role for an engineer who enjoys solving complex infrastructure challenges, improving reliability, and helping teams ship and operate software more effectively. You’ll work closely with engineering leadership and alongside SREs across product domains. Reliability, infrastructure as code, internal tooling, and developer productivity will all be part of your day-to-day focus. You’ll spend your time building, operating, and improving the systems that engineering teams rely on while contributing to best practices and operational excellence across the organization.

Requirements

5+ years of experience in software engineering, infrastructure, or related technical disciplines, with a focus on Site Reliability Engineering (SRE), DevOps, Platform Engineering, or similar roles.
Strong expertise in cloud infrastructure, distributed systems, networking, and observability practices
Experience designing and operating highly available, scalable production systems
Deep understanding of scripting, automation, infrastructure as code, CI/CD, and operational best practices
Experience implementing SLO/SLI frameworks and reliability engineering methodologies
Incident management, troubleshooting, and on-call experience in complex production environments
Passion for mentoring engineers and improving engineering culture

Nice To Haves

Experience rolling out AI tooling in an engineering organization
Experience leading tooling and platform migrations such as Jira, Confluence, or observability stacks
Experience with chaos engineering practices and reliability testing
Experience optimizing large-scale cloud infrastructure costs

Responsibilities

Operate production infrastructure across AWS and Azure, including networking, IAM, and cost
Build and operate Terraform modules and state at scale, keeping our infrastructure as code clean and reviewable
Run Kubernetes in production: upgrades, scaling, troubleshooting, and platform improvements
Operate and improve CI/CD pipelines that the entire engineering org depends on
Operationalize SLO/SLI frameworks and observability practices alongside the SRE team
Drive incident response practice, on-call tooling, and incident review follow-through
Reduce operational toil through automation across secret rotation, access management, and environment provisioning
Contribute to capacity planning, disaster recovery, and resilience work across critical systems
Build and maintain internal developer tooling that removes friction across engineering
Lead rollouts of AI-native tooling for code review, testing, and engineering productivity, e.g., CodeRabbit, Copilot-class assistants, and internal AI workflows
Own migrations and consolidation of internal platforms such as Jira, Confluence, ticketing, and documentation systems
Partner with engineering and product leadership to identify and remove the biggest DX bottlenecks, and align infrastructure and reliability investments with business goals
Mentor engineers and technical leads, fostering growth and knowledge-sharing within the organization
Lead post-mortems and continuous improvement initiatives to strengthen reliability practices
Evaluate and introduce new technologies, tools, and approaches to improve scalability and efficiency
Drive standardization and modernization efforts across infrastructure and operational practices
Lead proof-of-concept and experimentation initiatives to validate new reliability solutions