VP of Engineering, Reliability

Jobgether

4d•$260,000 - $360,000

About The Position

This leadership role is responsible for defining and executing the strategy, operating model, and organizational development for a high-impact reliability engineering function. You will lead teams responsible for infrastructure, site reliability, database engineering, observability, and incident management, ensuring the platform scales securely and efficiently. The role requires balancing feature velocity with system stability through SLOs, error budgets, and operational excellence frameworks. You will partner closely with executives to guide risk trade-offs, optimize cloud infrastructure costs, and enable AI-native workloads. This is a strategic, hands-on position where you will build and mentor high-performing teams while shaping the future reliability posture of a rapidly evolving platform. Success in this role demands deep technical expertise, executive presence, and the ability to transform reliability into a business accelerator.

Requirements

15+ years of engineering experience, including 7+ years leading infrastructure, reliability, or platform teams at scale.
Proven experience managing organizations of 40+ engineers across multiple disciplines, with multi-layer management and career development expertise.
Deep expertise in SRE principles, including production-hardened SLOs, error budgets, incident management, and toil reduction.
Strong technical command of cloud-native systems (AWS), container orchestration, Terraform (IaC), CI/CD pipelines, and observability tooling.
Experience leading reliability organizations through scaling inflection points and evolving operating models.
Familiarity with AI/ML infrastructure requirements, including model serving, vector search, and data pipeline reliability.
Ability to operate in high-trust, regulated environments (Legal Tech, FinTech, Healthcare, Government) with compliance and data sensitivity requirements.
Executive presence and credibility to influence C-suite decision-making on risk trade-offs, investment priorities, and operational transformation.

Responsibilities

Define and execute the reliability engineering roadmap, aligning infrastructure, cloud, and AI-native architecture with enterprise growth objectives.
Evolve the reliability operating model to balance centralized capabilities with distributed ownership across engineering teams.
Establish SLOs, SLIs, and error budget frameworks to provide shared metrics for system stability and delivery velocity.
Lead infrastructure cost management, capacity planning, and disaster recovery to meet enterprise commitments.
Build, scale, and mentor a multi-disciplinary organization, fostering ownership, craftsmanship, and clear career growth for DevOps, SRE, DBRE, and Tooling teams.
Drive operational excellence using DORA metrics, incident analysis, and systematic toil reduction to improve availability and deployment health.
Enable feature teams through self-service tooling, guardrails, and documentation, empowering them to operate independently while maintaining reliability.
Serve as primary engineering liaison for security and compliance initiatives, translating requirements into actionable engineering strategies.

Benefits

Competitive salary: $260,000â$360,000 annually.
Paid time off policy and flexible work arrangements.
Medical, dental, and vision insurance for full-time employees.
Maternity and paternity leave.
Short- and long-term disability coverage.
Opportunity to lead and learn from a highly experienced leadership team.
Dynamic, rapidly growing environment focused on innovation and operational excellence.
Company swag and culture that celebrates collaboration and achievement.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume