About The Position

We are looking for a Site Reliability Engineer (SRE) to join our global team. This role will focus on ensuring the operational health, reliability, performance, and scalability of the CARE platform and multi-tenant applications, encompassing Global Control Programme(GCP)/on-prem infrastructure, application deployment, and the underlying CARE services. You will be instrumental in defining and implementing SRE best practices to maintain a highly available and resilient platform. As a senior IB SRE, you will be crucial in ensuring the continuous operation and improvement of the platform.

Requirements

  • Strong understanding of SRE principles and practices, including SLOs/SLIs, incident management, post-mortems, and toil reduction
  • Deep understanding of GCP services such as GKE, Identity and Access Management or Illiquid Asset Monitization (IAM), identity services, CloudSQL, Cloud Monitoring, Cloud Logging, and related operational aspects.
  • Extensive experience with Kubernetes and container orchestration, including configuration, troubleshooting, and performance tuning.
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Splunk, Google Cloud Monitoring) and defining effective alerts and dashboards
  • Solid experience with Git and GitHub, including Git workflow for managing code and deployment tooling such as ArgoCD for deployments and managing application lifecycles
  • Programming/scripting (e.g., Python, Go, Java, Bash) and Infrastructure as Code (e.g. Terraform) experience for automation, tooling development, data analysis and managing infrastructure

Nice To Haves

  • Experience with Service Mesh (e.g., Istio) is highly desirable
  • Strong understanding of Software Development Life cycle(SDLC) / DevOps best practices, with a focus on continuous integration, continuous delivery, and automated testing from an operational perspective
  • Excellent problem-solving skills and the ability to diagnose and resolve complex technical issues in distributed systems
  • Experience with production support and on-call rotations in a critical environment

Responsibilities

  • Proactively monitor, troubleshoot, and resolve issues related to platform availability, performance, and capacity on both GCP and on-prem infrastructure
  • Develop, implement, and maintain SRE best practices, including incident response, post-mortems, root cause analysis, and proactive problem prevention
  • Drive automation efforts to reduce manual toil across operational tasks, deployment, scaling, and recovery. This includes developing and improving monitoring, alerting, and self-healing systems
  • Define, monitor, and report on Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key platform services, working to continuously improve them
  • Liaise with application teams (tenants) to understand their operational needs, provide guidance on platform best practices for reliability, capacity planning, and assist with complex troubleshooting
  • Collaborate with security teams to ensure the platform adheres to security policies and compliance requirements, focusing on operational security aspects

Benefits

  • A diverse and inclusive environment that embraces change, innovation, and collaboration
  • A hybrid working model, allowing for in-office / work from home flexibility, generous vacation, personal and volunteer days
  • Employee Resource Groups support an inclusive workplace for everyone and promote community engagement
  • Competitive compensation packages including health and wellbeing benefits, retirement savings plans, parental leave, and family building benefits
  • Educational resources, matching gift and volunteer programs
  • physical, emotional, and financial wellness benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service