Senior SRE/Platform Engineer

Equifax•Toronto, ON

3d•Onsite

About The Position

Site Reliability Engineering (SRE)/Platform Engineering at Equifax is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to Equifax engineering principles. SRE is also an engineering approach to building and running production systems – we engineer solutions to operational problems. Our SREs are responsible for overall system operation and we use a breadth of tools and approaches to solve a broad set of problems. Practices such as limiting time spent on operational work, blameless postmortems, proactive identification, and prevention of potential outages. Our SRE culture of diversity, intellectual curiosity, problem solving and openness is key to its success. Equifax brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big, and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn, grow and take pride in our work.

Requirements

Requires 7–10+ years of enterprise-scale experience in Platform Engineering, Site Reliability Engineering (SRE), or DevOps
Proven mastery managing production-grade environments across AWS and Google Cloud (GCP), plus Azure experience specifically for cost governance
4+ years of hands-on experience provisioning and managing EKS and GKE clusters, including production upgrades, hardening, and namespace isolation
Advanced proficiency with Terraform for multi-cloud resource provisioning, utilizing modular, reusable code and state management.
Experience building declarative workflows using ArgoCD or Flux, alongside automated pipelines that integrate security scanning, testing, and validation.
A proven track record of executing Canary deployments for high-traffic online services and Blue-Green deployments for large-scale batch/offline workloads.
Expertise in hybrid architectures (Transit Gateways, Shared VPCs, Direct Connect/Cloud Interconnect) combined with Kubernetes Network Policies and cloud IAM management.
Hands-on experience with DataDog APM for distributed tracing, dashboard creation, defining SLIs/SLOs, and configuring alerting logic to reduce MTTR.
Capability to lead cloud financial initiatives through workload rightsizing, strategic use of Spot/Preemptible instances, and building automated policy enforcement for cloud spend
Experience collaborating with Enterprise Architects to design systems across the "5 Pillars" (Well-Architected Framework).
CKA (Required)
AWS Solutions Architect Professional
Google Professional Cloud Architect
FinOps Certified Practitioner (FCP)

Nice To Haves

Ability to treat infrastructure as a product to champion the developer experience, leveraging internal portals like Backstage.
Experience building custom CLI tools to streamline and simplify the development "inner loop."
Possession of a Certified Kubernetes Security Specialist (CKS) credential or deep experience managing production runtime security.
Hands-on experience implementing cloud-native security and compliance using OPA (Open Policy Agent), Kyverno, or Falco.
Advanced proficiency with Istio, Linkerd, or Consul to govern complex service-to-service communication, mTLS, and traffic shifting.
Strong engineering skills in Go or Rust to build custom Kubernetes Operators and CRDs for tailored automation
Experience executing proactive resilience testing and "game days" using Gremlin, AWS Fault Injection Simulator, or Chaos Mesh.
Capability to calculate the exact unit cost of a transaction or service to align cloud architecture with business ROI.
Experience managing GPU-accelerated workloads on Kubernetes (NVIDIA device plugins) and model pipelines via Vertex AI or SageMaker.
Active engagement with the CNCF community and a history of contributing directly to core ecosystem tools like Terraform providers or ArgoCD plugins.

Responsibilities

Design, provision, and manage hardened, secure, cost-optimized GKE and AWS EKS production clusters.
Standardize automated, cross-cloud infrastructure delivery utilizing Terraform.
Maintain a GitOps model via ArgoCD to match environment state directly to code repositories.
Execute Canary deployments (online, live-traffic validation) and Blue-Green deployments (offline/batch, zero-downtime, instant rollback).
Architect complex topologies including VPCs, Shared VPCs, Peering, Transit Gateways, and Cloud Interconnect/Direct Connect.
Manage cross-cloud connectivity and enforce zero-trust network policies within Kubernetes.
Implement end-to-end distributed tracing and infrastructure monitoring using DataDog.
Build custom dashboards, monitors, and SLO/SLI alerts for deep visibility into app and infra health.
Translate Enterprise Architects' high-level blueprints into automated, scalable, and secure technical implementations.
Drive AWS/GCP/Azure cost-saving (rightsizing, Spot/Preemptible instances, storage tiers) and automated governance (tagging, lifecycle policies, budget alerts).