Senior Site Reliability Engineer

iManageToronto, ON
Hybrid

About The Position

SRE is part of a global organization that leverages the latest technology to communicate with our colleagues across the globe. We organize ourselves into distributed teams -- SRE teams are anchored to iManage offices across the globe. Tuesdays and Fridays are dedicated to in-office collaboration, rapid innovation, and developing a sense of belonging at iManage. Mondays and Fridays are reserved for (remote-friendly) focus time to get things done. Have the best of both work styles in a workplace that is intentional about belonging, collaboration, and accomplishment. Being a Senior Site Reliability Engineer at iManage Means… You are an engineer, a builder, and a systems thinker. You’ll create middleware and platform guardrails that empower developers to innovate quickly and reliably. You combine deep technical judgment with empathy to eliminate customer pain, especially when working with enthusiastic teams stewarding the world’s most privileged data. You uplift those around you, act as a subject matter expert, mentor others, and drive change. You chase contributing factors over root causes, value code over documentation, and documentation over process. You’ll engage in and often lead architectural discussions, reduce toil, and deliver scalable, resilient platforms that support our customers and organization. As a Senior SRE, you’ll help scale our cloud platform, collaborate across teams to promote standardization and resiliency, and participate in on-call rotations. You’ll be a key voice in observability, change management, and service scalability, providing guidance during complex technical decisions and high impact events. iManage is experiencing explosive growth in its flagship cloud product. We’re seeking senior software and systems engineers specializing in reliability and platform services to join our transformative cloud journey. This requires rethinking technical decisions with a beginner’s mindset and a focus on resilience and sustainability. If you write code, think in systems, embrace complexity and automation, and are passionate about service resilience and scalability — we want to talk to you.

Requirements

  • Experience writing design documents, postmortems, and refactoring application code.
  • Built automation to reduce operational burden or developed internal SaaS tools.
  • Ability to advocate for SRE principles (e.g., SLOs vs SLAs) and introduce them effectively.
  • Experience in public cloud or hosted datacenter environments (Azure and AKS preferred).
  • A passion for collaborative teamwork and influencing reliability best practices across teams.

Nice To Haves

  • Hands-on experience with Linux server stacks (Ubuntu/Debian preferred).
  • Knowledge of cloud provisioning platforms (Terraform preferred).
  • Exposure to configuration management tools (Chef preferred).
  • Experience with containerization/clustering technologies (Docker preferred).
  • Familiarity with observability and alerting tools (Prometheus/Grafana or ELK/EFK).
  • Practical experience with CI/CD pipelines and rollout strategies.
  • A bachelor’s degree (or equivalent experience) in Computer Engineering or related field.
  • Proficiency in one or more programming languages (e.g., Java, Python, Golang).
  • Familiarity with scripting languages (e.g., PowerShell, Bash, Python, Ruby).

Responsibilities

  • Eliminating TOIL through automation and software development.
  • Partnering cross-functionally with application teams and internal stakeholders.
  • Creating a modern, cloud-native platform that is resilient, cost-effective, and secure by default.
  • Scaling cloud infrastructure to support our Kubernetes-based ecosystem.
  • Maintaining the freshness and utility of platform services.
  • Improving the security posture of our products.
  • Designing automation, orchestration, observability, and disaster readiness into our products.
  • Participating in production support and on-call rotations, providing senior-level guidance during critical events.
  • Leading incident management and post-incident retrospectives, and coaching teams in these practices.

Benefits

  • Flexible work hours
  • Unlimited access to LinkedIn Learning courses and interactive Microsoft courses & training
  • Comprehensive Health/Vision/Dental/Life Insurance
  • Registered Retirement Savings Plan with a company match up to 5%
  • Enhanced leave for expecting parents; 20 weeks 100% paid for primary leave, and 10 weeks 100% paid for secondary leave
  • Flexible time off policy
  • Multiple company wellness days each year
  • Access to RethinkCare, a global behavioral health platform
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service