Director of SRE

LSEG•Charlotte, NC

1d•Hybrid

About The Position

We are seeking a highly technical and strategic Director of Site Reliability Engineering (SRE) to lead the design, operation and continuous improvement of highly available, scalable and resilient platforms across FTSE Russell Engineering. Reporting to the COO, FTSE Russell Engineering, this role will drive operational and engineering excellence in observability, incident management, automation and resilience, ensuring mission-critical financial systems meet stringent reliability, performance and regulatory requirements.

Requirements

Proven experience leading SRE, DevOps, and/or Platform Engineering teams in large-scale, regulated environments.
Deep technical expertise in both on-premise and AWS cloud-native architecture and systems design, including: AWS services such as EKS/ECS, Lambda, API Gateway, DynamoDB, Aurora, and S3.
Database and Data Platforms : SQL Server, Sybase, PostgreSQL (including Aurora)
Unix/Linux, Java, C#.NET, Python application support and development
Strong experience with observability tooling and frameworks for metrics, logs, and traces (Prometheus, Grafana, OpenTelemetry, ELK/EFK stacks, CloudWatch, Datadog, or similar)
Demonstrated ownership of incident management and production operations at scale.
Hands-on experience with CI/CD pipelines, Git, automation, scripting, and IaC tools (Python, Go, Ansible, Terraform, etc.).
Demonstrable application of agentic AI engineering in observability, incident management, and automated recovery.
Strong understanding of networking, security, and reliability engineering principles.
Experience defining and implementing SLOs, SLIs, and error budgets.

Nice To Haves

Experience in financial services or other data intensive, regulated industries.
Familiarity with multi-region AWS architecture, high-availability, mission critical applications.
Knowledge of chaos engineering practices and resilience testing.
Exposure to AIOps and intelligent automation frameworks.

Responsibilities

Lead, mentor, and scale a high-performing global SRE organization.
Partner with product, platform, operations and security teams to embed reliability into the software development lifecycle (SDLC).
Define and track KPIs for reliability, performance, and operational efficiency.
Foster a culture of continuous improvement, accountability, and engineering excellence.
Promote automation-first principles and self-service observability tooling for engineering teams.
Own and evolve production incident management frameworks, including detection, triage, escalation, and resolution, working closely with existing development and support teams.
Lead major incident response (MIR) for critical outages, ensuring rapid mitigation and clear stakeholder communication.
Implement and enforce blameless postmortems, ensuring actionable follow-ups and systemic improvements.
Establish runbooks, playbooks, and operational readiness standards across all services.
Drive continuous improvement in incident response processes, tooling, and team readiness.
Drive automation of operational processes including incident response, failover, scaling, and recovery.
Implement self-healing mechanisms using auto-remediation, event-driven workflows, and AI/ML-assisted operations where applicable.
Promote Infrastructure as Code (IaC) using tools such as Terraform or CloudFormation.
Reduce manual toil through CI/CD pipelines, automated testing, and deployment strategies (blue/green, canary releases).
Define and implement best-in-class observability frameworks across metrics, logs, and traces.
Standardize tooling (e.g., Prometheus, Grafana, OpenTelemetry, ELK/EFK stacks, CloudWatch, Datadog, or similar) across engineering teams.
Champion distributed tracing and real-time telemetry to enable deep system visibility and rapid root cause analysis.
Drive a data-driven reliability culture, using observability insights to proactively identify and eliminate system risks.
Establish and enforce SRE principles including SLIs, SLOs, SLAs, and error budgets across all critical services.
Drive adoption of resilient design patterns (multi-region failover, active-active architectures, circuit breakers, bulkheads).
Ensure platforms are designed for high availability (HA), fault tolerance, and disaster recovery (DR) aligned with financial market uptime requirements.
Lead deep-dive investigations into platform failures, performance bottlenecks, and systemic issues across complex, multi-tiered architectures.