About The Position

At ScalePad, we hire thoughtful builders who want their work to matter. Our roles are designed for people who thrive on driving impact, see ambiguity as an opportunity, and believe that raising the bar is a team sport. We don’t bring people in to run playbooks. We hire people who want to rewrite them. And in this role, you’ll get to do that, while shaping the future of managed services for our global partners. At ScalePad, we’re building more than software; we’re building confidence and clarity for the people who manage the technology businesses rely on every day. Our mission: help MSPs evolve into MVPs (their clients’ most valuable partner). Our tools turn them from reactive service providers into strategic advisors through a consistent, scalable Customer Success motion. Our product suite unifies risk insights, client planning, and service delivery so MSPs can have smarter conversations, show clients their value, and grow their revenue. But our purpose goes beyond our software. We’re creating a workplace where curious, growth-minded people can do their best work, where ideas are valued, progress is shared, and everyone belongs. Together, we’re creating a future where MSPs don’t just keep businesses running, they help them thrive. We believe that when our partners succeed, we all do. With offices in Vancouver, Toronto, Montreal, and Phoenix and a global-first mindset. ScalePad has grown into a category leader trusted by 12,000+ partners across 60+ countries. We’ve been recognized for our products and corporate culture by MSP Today, G2, and Great Place to Work™, to name a few. We're looking for a Staff Site Reliability Engineer (SRE) to be the senior technical anchor across our multi-cloud platform and developer experience. This is a hands-on senior individual contributor role for an engineer who wants to own real systems, unblock teams day to day, and raise the bar on how engineering ships and operates at ScalePad. You'll work directly with engineering leadership and alongside SREs across product domains. Reliability, infrastructure as code, internal tooling, and developer productivity all sit inside your scope. You'll spend your time building, operating, and improving the systems the rest of engineering depends on.

Requirements

  • 8+ years of experience in software engineering, infrastructure, or related technical disciplines, with at least 5 years focused on Site Reliability Engineering (SRE), DevOps, Platform Engineering, or similar roles.
  • Strong expertise in cloud infrastructure, distributed systems, networking, and observability practices
  • Experience designing and operating highly available, scalable production systems
  • Deep understanding of scripting, automation, infrastructure as code, CI/CD, and operational best practices
  • Experience implementing SLO/SLI frameworks and reliability engineering methodologies
  • Incident management, troubleshooting, and on-call experience in complex production environments
  • Proven ability to lead large-scale technical initiatives across multiple teams
  • Track record of cross-team technical influence without formal authority, excellent communication and collaboration skills with both technical and non-technical stakeholders
  • Passion for mentoring engineers and improving engineering culture
  • Demonstrated ability to thoughtfully integrate AI-assisted tooling into engineering and operational workflows to improve efficiency, reliability, and developer experience

Nice To Haves

  • Experience rolling out AI tooling in an engineering organization
  • Experience leading tooling and platform migrations such as Jira, Confluence, or observability stacks
  • Experience with chaos engineering practices and reliability testing
  • Experience optimizing large-scale cloud infrastructure costs

Responsibilities

  • Own production infrastructure across AWS and Azure, including networking, IAM, and cost
  • Build and operate Terraform modules and state at scale, keeping our infrastructure as code clean and reviewable
  • Run Kubernetes in production: upgrades, scaling, troubleshooting, and platform improvements
  • Operate and improve CI/CD pipelines that the entire engineering org depends on
  • Operationalize SLO/SLI frameworks and observability practices alongside the SRE team
  • Own incident response practice, on-call tooling, and incident review follow-through
  • Reduce operational toil through automation across secret rotation, access management, and environment provisioning
  • Execute on capacity planning, disaster recovery, and resilience work across critical systems
  • Build and maintain internal developer tooling that removes friction across engineering
  • Lead rollouts of AI-native tooling for code review, testing, and engineering productivity, e.g., CodeRabbit, Copilot-class assistants, and internal AI workflows
  • Own migrations and consolidation of internal platforms such as Jira, Confluence, ticketing, and documentation systems
  • Partner with engineering and product leadership to identify and remove the biggest DX bottlenecks, and align infrastructure and reliability investments with business goals
  • Mentor engineers and technical leads, fostering growth and knowledge-sharing within the organization
  • Lead post-mortems and continuous improvement initiatives to strengthen reliability practices
  • Evaluate and introduce new technologies, tools, and approaches to improve scalability and efficiency
  • Drive standardization and modernization efforts across infrastructure and operational practices
  • Lead proof-of-concept and experimentation initiatives to validate new reliability solutions

Benefits

  • Employee Stock Ownership Plan (ESOP)
  • RRSP matching
  • Parental leave programs
  • Structured mentorship programs
  • Annual professional development budget
  • Brand new, top-of-the-line hardware and equipment
  • Monthly stipend for hybrid or remote work environment
  • 100% employer-paid benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service