Senior Vice President, Site Reliability Engineering (SRE)

Oaktree Capital ManagementLos Angeles, CA
5d$225,000 - $250,000

About The Position

Our Company Oaktree is a leader among global investment managers specializing in alternative investments, with over $200 billion in assets under management. The firm emphasizes an opportunistic, value-oriented and risk-controlled approach to investments in credit, private equity, real assets and listed equities. The firm has over 1400 employees and offices in 25 cities worldwide. We are committed to cultivating an environment that is collaborative, curious, inclusive and honors diversity of thought. Providing training and career development opportunities and emphasizing strong support for our local communities through philanthropic initiatives are essential to our culture. The Technology department at Oaktree Capital Management delivers secure, scalable, and innovative solutions that power the firm’s global investment and business operations. Through strong partnerships across the company, we drive digital transformation, advance operational efficiency, and provide a trusted data foundation to create measurable impact for Oaktree’s teams, clients, and partners. For additional information please visit our website at www.oaktreecapital.com Role Summary The Senior Vice President, Site Reliability Engineering (SRE) is a hands-on engineering leader responsible for defining, driving, and scaling reliability practices across Oaktree’s global technology ecosystem. This executive and engineer will work in close partnership with software engineering teams, architects, security experts, infrastructure and cloud engineers, as well as key business stakeholders to ensure applications, platforms, and architectures meet the highest standards of resilience, reliability, performance, and operational excellence. The SVP will spearhead Oaktree’s enterprise-wide SRE strategy, including SLO/SLA frameworks, RTO/RPO definitions, error-budget practices, observability maturity, incident processes, and related automation initiatives. As Oaktree accelerates its migration to Azure, this leader will bring deep experience in cloud-native SRE practices. This leader will drive innovation by leveraging Agentic AI to augment SRE functions. The SVP will own Oaktree’s observability platform, including technology selection, budgeting, vendor management, and governance.

Requirements

  • 10-15 years of SRE experience, with 5+ years in leadership.
  • Hands-on engineering expertise across cloud and hybrid systems.
  • Deep Microsoft Azure experience.
  • Strong knowledge of SLO/SLA frameworks and operational governance.
  • Proven ownership of incident and problem management.
  • Expertise with Observability/ APM and related tools (Preferably Datadog, Dynatrace, New Relic, PagerDuty, Cribl, Prometheus/Grafana, Azure Monitor, etc).
  • Background in Prompt Engineering, automation, IaC, and CI/CD.
  • Strong development background, background in infrastructure, and knowledge of architectural needs.

Nice To Haves

  • AZ-400 Certification.
  • SRE Foundation Certification.
  • Familiarity with Google SRE principles.
  • Experience with Agentic AI.
  • Experience in chaos engineering.
  • Knowledge of ITIL, Agile, DevOps best practices.

Responsibilities

  • Define and execute the enterprise SRE vision.
  • Act as an enabling team.
  • Foster SRE best practices in stream enabled teams.
  • Establish reliability frameworks including SLAs, SLOs, RTOs, RPOs, and error budgets.
  • Partner with engineering, architecture, security, and operations teams to effect changes in the spirit of appropriate reliability.
  • Lead reliability engineering for applications and infrastructure in Microsoft Azure.
  • Develop Agentic AI capabilities for SRE workflows.
  • Own enterprise observability strategies and platforms (preferable experience in Datadog and Cribl)
  • Build unified dashboards for system health and reliability insights.
  • Own the practices on major incident management, blameless postmortems, and problem management.
  • Act as an enabling team and foster best practices.
  • Automate incident response processes.
  • Foster AiOps.
  • Champion and roadmap chaos engineering and resilience testing.
  • Track and report on SLO adherence, DORA metrics, and reliability trends.
  • Manage budgets, vendor contracts, and platform procurement.

Benefits

  • In addition to a competitive base salary, you will be eligible to receive discretionary bonus incentives, a comprehensive benefits package and a flexible work arrangement.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service