Site Reliability Engineer

S&P Global Mobility•London, ON

About The Position

We are looking to hire a Site Reliability Engineer who will help in building and maintaining the observability platform across multiple business lines, helping to establish observability best practices. What you'll be doing: Build and improve observability and reliability solutions that help engineering teams operate and support their services with confidence. Partner with engineering teams to design monitoring, alerting, dashboards, and service health standards early in the software delivery lifecycle. Write and maintain code, infrastructure definitions, and automation that reduce manual work and improve reliability. Help engineers instrument services and systems so teams can quickly detect, diagnose, and resolve issues. Support the adoption and standardization of telemetry patterns across metrics, logs, and traces, including OpenTelemetry-based instrumentation where appropriate. Improve the reliability of our AWS and Kubernetes environments, including EKS, through durable engineering solutions rather than repetitive operational work. Participate in incident response and follow-up activities, including troubleshooting, root cause analysis, and the implementation of lasting fixes. Identify opportunities to reduce toil and improve the developer experience through automation, reusable patterns, and better engineering practices. Continuously evaluate our tooling, reliability practices, and engineering processes for opportunities to improve.

Requirements

Experience in Site Reliability Engineering, DevOps, Platform Engineering, or Software Engineering roles, with meaningful ownership of reliability-focused solutions.
Proven experience building, automating, and maintaining engineering solutions, not just operating existing systems.
Experience writing production-quality code, scripts, or automation. Go is preferred; experience in other languages such as JavaScript/TypeScript or Ruby is also valuable.
Experience managing cloud infrastructure with Infrastructure as Code. Terraform preferred.
Experience working with AWS and Kubernetes environments, including EKS.
Experience with distributed systems and the trade-offs involved in designing for reliability, resiliency, and durability.
Experience with observability tooling such as Prometheus, Grafana, New Relic, CloudWatch, Google Observability, or similar platforms.
Familiarity with telemetry standards and instrumentation patterns, including OpenTelemetry, is strongly preferred.
Experience designing useful monitoring and alerting for applications and infrastructure, with an understanding of how to balance signal, noise, and actionable response.
Experience with logging and telemetry pipelines at scale.
Strong troubleshooting skills and the ability to work collaboratively during incidents to restore service and address root causes.
Strong communication skills, with the ability to document standards, guide engineering teams, and influence reliability best practices.
A strong bias toward automation, simplification, and reducing toil for yourself and your teammates.

Nice To Haves

Experience working with OpenSearch, Elasticsearch, ELK, or similar logging and search platforms.
Experience with telemetry pipeline design, routing, sampling, or retention decisions.
Experience supporting applications written in Go, JavaScript/TypeScript, or Ruby.
Bindplane experience is a plus.

Responsibilities

Build and improve observability and reliability solutions that help engineering teams operate and support their services with confidence.
Partner with engineering teams to design monitoring, alerting, dashboards, and service health standards early in the software delivery lifecycle.
Write and maintain code, infrastructure definitions, and automation that reduce manual work and improve reliability.
Help engineers instrument services and systems so teams can quickly detect, diagnose, and resolve issues.
Support the adoption and standardization of telemetry patterns across metrics, logs, and traces, including OpenTelemetry-based instrumentation where appropriate.
Improve the reliability of our AWS and Kubernetes environments, including EKS, through durable engineering solutions rather than repetitive operational work.
Participate in incident response and follow-up activities, including troubleshooting, root cause analysis, and the implementation of lasting fixes.
Identify opportunities to reduce toil and improve the developer experience through automation, reusable patterns, and better engineering practices.
Continuously evaluate our tooling, reliability practices, and engineering processes for opportunities to improve.

Benefits

equal employment opportunity (EEO) to all persons regardless of age, color, national origin, citizenship status, physical or mental disability, race, religion, creed, gender, sex, sexual orientation, gender identity and/or expression, genetic information, marital status, status with regard to public assistance, veteran status, or any other characteristic protected by federal, state or local law.
reasonable accommodations for qualified individuals with disabilities.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume