Principal Site Reliability Engineer

Vertafore•Denver, CO

23h•Hybrid

About The Position

Vertafore is a leading technology company whose innovative software solutions are advancing the insurance industry. Our suite of products provides solutions to our customers that help them better manage their business, boost their productivity and efficiencies, and lower costs while strengthening relationships. Our mission is to move InsurTech forward by putting people at the heart of the industry. We are leading the way with product innovation, technology partnerships, and focusing on customer success. Our fast-paced and collaborative environment inspires us to create, think, and challenge each other in ways that make our solutions and our teams better. We are headquartered in Denver, Colorado, with offices across the U.S., Canada, and India. We are seeking a Principal Site Reliability Engineer to define the strategic vision and own the enterprise-wide reliability, scalability, and performance of our critical production services. As a foundational pillar of our engineering organization, this role drives architectural standards for the full-service lifecycle—from initial design and deployment readiness to proactive production operations. At Vertafore, we view reliability as a core engineering responsibility. You will operate autonomously across AWS, hybrid data centers, and customer-hosted environments, setting the technical direction for how we treat operations as a software engineering challenge. This role is pivotal in transitioning cross-departmental teams toward a highly proactive, engineering-first culture.

Requirements

12 to 15+ years of hands-on Cloud Operations, SRE, or reliability-focused engineering experience, with a proven track record of end-to-end enterprise service ownership.
Demonstrated ability to operate at a Principal/Architect scope, driving large-scale reliability outcomes and operational excellence across global organizations.
Expert-level software engineering skills in C#, .NET, Java, Python, or React.
Deep expertise in scaling core SRE principles (SLIs, SLOs, error budgets) across complex, distributed systems.
Mastery of AWS, Kubernetes, CI/CD pipelines, Infrastructure-as-Code, and extensive knowledge of Linux and Windows environments and relational databases.
Bachelor’s or Master’s degree in Computer Science or a related technical field.
Participation in an executive on-call rotation with flexible hours as required

Responsibilities

Define the standards for end-to-end service ownership, holding the organization accountable for availability, performance, and overall operational health.
Lead cross-departmental initiatives to influence system design at the architectural level, driving fault tolerance, strict compliance, and operational sustainability across public and private clouds.
Dictate the enterprise strategy for observability frameworks, ensuring the Four Golden Signals (Latency, Traffic, Errors, and Saturation) provide actionable, predictive insights across all platforms.
Establish the governance models for defining and managing SLIs and SLOs across multiple product lines.
Champion Error Budgets as the ultimate technical arbiter at the executive level, balancing feature velocity with the absolute requirement for platform stability.
Lead incident response for the most critical, high-severity events.
Foster a "Win Together" environment by championing a Blameless Postmortem culture globally, ensuring root cause analyses focus strictly on systemic and process improvements rather than individual error.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume