IB CTO Team - Site Reliability Engineer (SRE) - Assistant Vice President

Deutsche Bank•Cary, NC

1d•Hybrid

About The Position

We are looking for a Site Reliability Engineer (SRE) to join our global team. This role will focus on ensuring the operational health, reliability, performance, and scalability of the CARE platform and multi-tenant applications, encompassing Global Control Programme(GCP)/on-prem infrastructure, application deployment, and the underlying CARE services. You will be instrumental in defining and implementing SRE best practices to maintain a highly available and resilient platform. As a senior IB SRE, you will be crucial in ensuring the continuous operation and improvement of the platform.

Requirements

Strong understanding of SRE principles and practices, including SLOs/SLIs, incident management, post-mortems, and toil reduction
Deep understanding of GCP services such as GKE, Identity and Access Management or Illiquid Asset Monitization (IAM), identity services, CloudSQL, Cloud Monitoring, Cloud Logging, and related operational aspects.
Extensive experience with Kubernetes and container orchestration, including configuration, troubleshooting, and performance tuning.
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Splunk, Google Cloud Monitoring) and defining effective alerts and dashboards
Solid experience with Git and GitHub, including Git workflow for managing code and deployment tooling such as ArgoCD for deployments and managing application lifecycles
Programming/scripting (e.g., Python, Go, Java, Bash) and Infrastructure as Code (e.g. Terraform) experience for automation, tooling development, data analysis and managing infrastructure

Nice To Haves

Experience with Service Mesh (e.g., Istio) is highly desirable
Strong understanding of Software Development Life cycle(SDLC) / DevOps best practices, with a focus on continuous integration, continuous delivery, and automated testing from an operational perspective
Excellent problem-solving skills and the ability to diagnose and resolve complex technical issues in distributed systems
Experience with production support and on-call rotations in a critical environment

Responsibilities

Proactively monitor, troubleshoot, and resolve issues related to platform availability, performance, and capacity on both GCP and on-prem infrastructure
Develop, implement, and maintain SRE best practices, including incident response, post-mortems, root cause analysis, and proactive problem prevention
Drive automation efforts to reduce manual toil across operational tasks, deployment, scaling, and recovery. This includes developing and improving monitoring, alerting, and self-healing systems
Define, monitor, and report on Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key platform services, working to continuously improve them
Liaise with application teams (tenants) to understand their operational needs, provide guidance on platform best practices for reliability, capacity planning, and assist with complex troubleshooting
Collaborate with security teams to ensure the platform adheres to security policies and compliance requirements, focusing on operational security aspects

Benefits

A diverse and inclusive environment that embraces change, innovation, and collaboration
A hybrid working model, allowing for in-office / work from home flexibility, generous vacation, personal and volunteer days
Employee Resource Groups support an inclusive workplace for everyone and promote community engagement
Competitive compensation packages including health and wellbeing benefits, retirement savings plans, parental leave, and family building benefits
Educational resources, matching gift and volunteer programs
physical, emotional, and financial wellness benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume