Senior Vice President, Site Reliability Engineering (SRE)

Oaktree Capital ManagementLos Angeles, CA
11d$225,000 - $250,000

About The Position

The Senior Vice President, Site Reliability Engineering (SRE) is a hands-on engineering leader responsible for defining, driving, and scaling reliability practices across Oaktree’s global technology ecosystem. This executive and engineer will work in close partnership with software engineering teams, architects, security experts, infrastructure and cloud engineers, as well as key business stakeholders to ensure applications, platforms, and architectures meet the highest standards of resilience, reliability, performance, and operational excellence. The SVP will spearhead Oaktree’s enterprise-wide SRE strategy, including SLO/SLA frameworks, RTO/RPO definitions, error-budget practices, observability maturity, incident processes, and related automation initiatives. As Oaktree accelerates its migration to Azure, this leader will bring deep experience in cloud-native SRE practices. This leader will drive innovation by leveraging Agentic AI to augment SRE functions. The SVP will own Oaktree’s observability platform, including technology selection, budgeting, vendor management, and governance.

Requirements

  • 10-15 years of SRE experience, with 5+ years in leadership.
  • Hands-on engineering expertise across cloud and hybrid systems.
  • Deep Microsoft Azure experience.
  • Strong knowledge of SLO/SLA frameworks and operational governance.
  • Proven ownership of incident and problem management.
  • Expertise with Observability/ APM and related tools (Preferably Datadog, Dynatrace, New Relic, PagerDuty, Cribl, Prometheus/Grafana, Azure Monitor, etc).
  • Background in Prompt Engineering, automation, IaC, and CI/CD.
  • Strong development background, background in infrastructure, and knowledge of architectural needs.

Nice To Haves

  • AZ-400 Certification.
  • SRE Foundation Certification.
  • Familiarity with Google SRE principles.
  • Experience with Agentic AI.
  • Experience in chaos engineering.
  • Knowledge of ITIL, Agile, DevOps best practices.
  • Master’s degree in Technology Management, Information Technology, or a related field, a plus.

Responsibilities

  • Define and execute the enterprise SRE vision.
  • Act as an enabling team. Foster SRE best practices in stream enabled teams.
  • Establish reliability frameworks including SLAs, SLOs, RTOs, RPOs, and error budgets.
  • Partner with engineering, architecture, security, and operations teams to effect changes in the spirit of appropriate reliability.
  • Lead reliability engineering for applications and infrastructure in Microsoft Azure.
  • Develop Agentic AI capabilities for SRE workflows.
  • Own enterprise observability strategies and platforms (preferable experience in Datadog and Cribl)
  • Build unified dashboards for system health and reliability insights.
  • Own the practices on major incident management, blameless postmortems, and problem management. Act as an enabling team and foster best practices.
  • Automate incident response processes. Foster AiOps.
  • Champion and roadmap chaos engineering and resilience testing.
  • Track and report on SLO adherence, DORA metrics, and reliability trends.
  • Manage budgets, vendor contracts, and platform procurement.

Benefits

  • In addition to a competitive base salary, you will be eligible to receive discretionary bonus incentives, a comprehensive benefits package and a flexible work arrangement.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service