Director, Site Reliability Engineering

VertaforeDenver, CO
2d$175,000 - $220,000

About The Position

Vertafore is a leading technology company whose innovative software solution are advancing the insurance industry. Our suite of products provides solutions to our customers that help them better manage their business, boost their productivity and efficiencies, and lower costs while strengthening relationships. Our mission is to move InsurTech forward by putting people at the heart of the industry. We are leading the way with product innovation, technology partnerships, and focusing on customer success. Our fast-paced and collaborative environment inspires us to create, think, and challenge each other in ways that make our solutions and our teams better. We are headquartered in Denver, Colorado, with offices across the U.S., Canada, and India. The Director, Site Reliability Engineering (SRE) will lead reliability, performance, and observability initiatives for a portfolio of Vertafore products. This role owns SLIs/SLOs, incident response, automation, and CI/CD practices for assigned product families. Directors will manage multiple teams and collaborate with Product Development, Architecture, Cloud Operations, Information Security, and other SRE leaders to ensure operational excellence. This role is responsible for bridging the gap between development and operations by applying a software engineering mindset to system administration. You will own the lifecycle of services - from inception and design, through deployment, operation, and refinement.

Responsibilities

  • Product Reliability Leadership
  • Define and enforce SLIs/SLOs for a subset of Vertafore flagship products.
  • Drive observability strategy across application and infrastructure layers.
  • Release Engineering & Toil Reduction
  • Oversee CI/CD pipelines for product deployments using tools like GitLab, Jenkins, Ansible, LaunchDarkly.
  • Monitor and cap "Toil" (manual, repetitive operational work) at 50% using Automation and AI tools, ensuring the team spends the remaining time on project work that scales the system.
  • Error Budget Management
  • Manage "Error Budgets" to balance the velocity of feature releases with the stability of the platform, ensuring clear consequences when budgets are exhausted.
  • Incident Management
  • Define and participate in 24x7 on-call rotations for assigned products; ensure rapid resolution and blameless postmortems.
  • Cross-Functional Collaboration
  • Partner with Cloud Ops on capacity planning, OS patching (app tier), and load balancing (ALB, F5).
  • Align reliability goals with product roadmaps and customer SLAs.
  • Team Leadership
  • Manage a group of Managers and Engineers, mentor teams on automation, observability, and reliability best practices.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service