IB CTO Team - Lead Site Reliability Engineer (SRE) - Vice President

Deutsche Bank•Cary, NC

59d•Hybrid

About The Position

Investment Banking is a technology‑centric business driven by real‑time processing, sophisticated integrated systems, and vast data access, making technology critical to business success. We are seeking a visionary and experienced Vice President, Lead Site Reliability Engineer (SRE) to join the Investment Banking Chief Technology Office (IB CTO) team in Cary, US, where this role will be instrumental in shaping the strategic direction and execution of SRE across critical applications and platforms. As a senior leader, you will elevate the overall reliability posture, drive architectural resilience, and champion the adoption of cloud‑native patterns across a diverse application portfolio. You will translate complex technical challenges into actionable SRE roadmaps and execute them across multiple global teams and technology stacks. The role also focuses on proactively mitigating systemic risk, optimizing cost efficiency through SRE principles, and providing technical thought leadership for highly complex, distributed systems underpinning core Investment Banking functions.

Requirements

Deep mastery of SRE practices (SLOs/SLIs, error budgets, incident management) with a proven ability to drive SRE adoption and cultural change across teams
Expert in designing and optimizing large‑scale GCP platforms (GKE, IAM, networking, security, data services), with multi‑cloud or hybrid experience a plus
Hands‑on leadership operating large, production Kubernetes environments, including service mesh and shared platform capabilities
Extensive experience leading Terraform‑based IaC, GitOps deployments (ArgoCD / FluxCD), and modern CI/CD‑driven SDLC transformations
Advanced observability and AIOps expertise in monitoring, alerting and logging strategies, backed by strong programming skills (e.g. Python, Go, Java) to deliver scalable automation and shared tooling
Deep expertise in diagnosing and resolving complex, production‑critical issues through rigorous root‑cause analysis across diverse application domains
Proven leader with the ability to influence, align cross‑functional teams, and clearly communicate complex technical topics to both technical and senior business stakeholders

Nice To Haves

Experience in highly regulated environments, ideally financial services, with strong understanding of compliance and security requirements for critical infrastructure
Excellent communicator who bridges technical risk and business impact for non‑technical stakeholders, and actively mentor engineers to promote scalable knowledge‑sharing

Responsibilities

Lead the platform reliability, performance, and scalability strategy across highly complex, distributed systems on GCP and on‑prem, providing architectural guidance to ensure resilience and fault tolerance across IB CTO applications
Define and institutionalize SRE operational excellence, including advanced incident management, blameless post‑mortems, and proactive problem‑prevention practices across engineering teams
Drive automation and tooling innovation to reduce toil across multiple applications, leveraging advanced automation, self‑healing capabilities, and operational intelligence, while mentoring engineers on sustainable solutions
Establish and drive enterprise‑wide adoption of SLIs/SLOs for mission‑critical services, aligning reliability metrics with business objectives and communicating outcomes to senior leadership and stakeholders
Act as a trusted technical advisor across application teams, leading cross‑functional initiatives to improve system stability, reliability culture, and complex troubleshooting
Provide architectural stewardship across Infrastructure as Code (IaC), capacity planning, and operational documentation, ensuring scalability, cost efficiency, security, disaster recovery, and knowledge sharing across the portfolio
Partner closely with application development leads and platform engineering teams to embed SRE principles into system design and delivery
Act as a trusted technical advisor to senior technology and business stakeholders on reliability, risk, and operational resilience topics
Foster a shared culture of reliability, learning, and continuous improvement across geographically distributed teams

Benefits

A hybrid working model, allowing for in-office / work from home flexibility
Generous vacation, personal and volunteer days
Employee Resource Groups support an inclusive workplace for everyone and promote community engagement
Competitive compensation packages
Health and wellbeing benefits
Retirement savings plans
Parental leave
Family building benefits
Educational resources
Matching gift and volunteer programs