Lead Site Reliability Developer - CSRE Consulting / Développeur lead en fiabilité des sites - Consultation CSRE

Ticketmaster•Toronto, ON

1d•$120,000 - $150,000•Remote

About The Position

This role is part of the Central SRE Consulting team, which partners with product and platform engineering teams throughout Ticketmaster to improve reliability, resilience, and sustainable engineering practices. The team's goal is to increase the adoption and maturity of SRE principles across Ticketmaster, ensuring services are appropriately scaled and reliable. The team supports global teams, with most teammates operating in UTC/UTC+1 time zones, and is expanding to other time zones. As a Lead Site Reliability Developer in CSRE Consulting, you will lead reliability consulting work across multiple teams or a domain, aligning stakeholders on priorities and driving the delivery of sustained improvements. You will translate reliability goals into sequenced workstreams, align dependencies, and ensure teams can maintain the mechanisms after your engagement. You will mentor other consultants, codify reusable patterns, and influence shared platforms so reliability improvements propagate beyond any single team or engagement.

Requirements

Deep practical understanding of SRE principles, including SLO governance and error budget policy in practice.
Proven ability to lead cross-team technical work and influence without authority.
Strong experience designing and troubleshooting distributed systems with cross-service failure modes.
Experience shaping observability and alerting strategy and improving operational signal quality.
Strong Kubernetes and AWS experience, including governance and cost trade-offs.
Ability to design reliability automation and tooling that is reusable and adopted by multiple teams.
Experience leading production readiness and resilience practices, including DR validation and controlled testing.
Strong software engineering fundamentals with the ability to deliver and review high-quality changes in enterprise codebases.
Advanced incident analysis skills focused on systemic risk reduction and organizational learning.
Excellent communication skills, including exec-ready summaries and clear technical diagrams.
Lead with service and humility, creating clarity and momentum without relying on authority.
Build relationships across teams and functions, and set clear expectations for how you partner and deliver.
Facilitate alignment by framing problems, surfacing trade-offs, and running working sessions that end in decisions.
Persuade with evidence and empathy, adapting your narrative for engineers, product, and senior stakeholders.
Coach and mentor deliberately, helping others grow in reliability thinking and consulting craft.
Maintain psychological safety while raising standards, giving direct feedback with respect.
Stay persistent and patient in complex organizations, keeping work moving despite slow dependencies.
Hold ambiguity comfortably and turn messy inputs into clear plans, options, and next steps.
Favor simple mechanisms that scale adoption, not bespoke one-offs that require you to maintain them.
Operate at a sustainable pace and discourage hero culture by designing systems that do not need it.
Take pride in quality, including documentation and decision records that help teams sustain the work.
Remain adaptable, switching between hands-on debugging, stakeholder management, and planning as needed.
High level of proficiency in English, both verbal and written.

Nice To Haves

Working knowledge of French.

Responsibilities

Lead consulting work from discovery through delivery by aligning stakeholders on priorities, sequencing work, and communicating measurable outcomes.
Establish working cadence and facilitate decision forums to surface risks, map dependencies, and drive clear ownership and timelines.
Align product, platform, and engineering stakeholders on reliability targets and trade-offs using SLOs and error budgets.
Partner regularly with Engineering Managers, product managers, Staff and Principal engineers, and platform leads to keep dependencies, decisions, and delivery aligned.
Identify systemic risks across shared dependencies and coordinate remediation across multiple teams to reduce recurring incidents.
Drive change adoption by embedding reliability mechanisms into partner team routines such as planning, PRRs, and on-call practices.
Design and implement reusable reliability mechanisms, templates, and tooling that can be adopted across teams.
Establish and evolve production readiness review practices with partner teams to improve launch quality and change safety.
Drive observability strategy for partner domains by improving signal quality, alerting philosophy, and operational dashboards.
Lead complex incident investigations and ensure learnings translate into durable fixes with clear owners and verification.
Lead reliability-focused design and code reviews and guide teams toward simpler, safer architectures.
Mentor Senior engineers and other consultants through pairing, reviews, and structured coaching to multiply impact.
Partner with internal platform engineering to influence roadmaps and deliver shared capabilities that accelerate SRE adoption.
Improve CSRE Consulting playbooks and operating practices based on repeated patterns observed across teams.