Director, Site Reliability Engineering (Hybrid/Flexible)

Insulet Corporation•San Diego, CA

7d•Hybrid

About The Position

The Director of Site Reliability Engineering (SRE) will provide strategic leadership and technical direction for the reliability, scalability, and performance of our mission‑critical systems and services. This role combines deep SRE expertise with strong engineering leadership, driving organizational transformation toward reliability-first principles. The ideal candidate brings a strong software engineering foundation, a passion for automation, and a proven ability to lead and develop high‑performing teams. The Director will partner with engineering, product, operations, and business stakeholders to design, deliver, and operate resilient, high‑availability systems that support our customers and business objectives at scale.

Requirements

Expertise with observability and monitoring platforms such as Datadog, Prometheus, Dynatrace, Grafana, ELK, or similar.
Strong proficiency in programming languages such as Python, Go, or Java.
Deep understanding of cloud platforms (AWS, Azure, GCP) and container orchestration technologies (Docker, Kubernetes).
Advanced knowledge of AWS services including VPC, Lambda, IAM, ELB, EC2, ECS, CloudWatch, API Gateway, S3, SQS, SNS, WAF, and Route53.
Hands-on experience with infrastructure‑as‑code tools such as Terraform, Ansible, or equivalents.
Expert troubleshooting and problem-solving skills across distributed systems.
Strong leadership and communication skills with a proven ability to work cross-functionally.
Demonstrated success leading and mentoring engineering teams.
Strong understanding of security best practices, compliance frameworks, and implementation of security controls.
Experience with chaos engineering, resilience testing, and failure-injection methodologies.
Familiarity with applying AI/ML approaches to reliability, operations, and incident management.
Bachelor’s in computer science, Engineering, or a related field.
16 years of experience in the field including 6+ Site Reliability Engineering, DevOps, or a similar role.
Proven experience architecting and managing highly available, scalable, and fault-tolerant systems.
Ability to define a clear reliability vision and inspire teams and stakeholders toward long‑term reliability goals.
Demonstrated sound judgment and calm decision‑making under pressure, particularly during high-severity incidents.
Strong people leadership skills, with experience coaching, mentoring, and developing engineering talent.
Strategic planning skills with a track record of aligning technical direction with organizational objectives.
Excellent communication skills; able to translate complex technical issues into clear, actionable insights for executive and non‑technical audiences.
Highly collaborative, with the ability to work effectively across engineering, product, operations, and business functions.
Skilled at navigating conflict and fostering healthy team dynamics.
Proactive problem solver who identifies risks and drives innovative solutions.
Strong sense of accountability for team outcomes, reliability standards, and operational excellence.

Responsibilities

Provide strategic direction for the organization-wide adoption, evolution, and maturity of SRE principles, cultivating a culture centered on reliability, efficiency, and continuous improvement.
Develop and oversee automation strategies, tools, and frameworks that improve system reliability, reduce operational toil, and enhance team productivity.
Architect and evolve robust observability, monitoring, and alerting systems to ensure availability, performance, and real‑time operational insight.
Lead and govern high‑severity incident response practices—ensuring rapid triage, thorough root cause analysis, and follow‑through on corrective and preventative actions.
Analyze reliability, performance, and capacity metrics to drive proactive optimization and long‑term system resilience.
Partner with engineering, product, and operations teams to embed SRE practices throughout the development lifecycle and influence architectural decisions for reliability.
Build, mentor, and develop a high‑performing SRE organization, fostering technical excellence, career growth, and a strong culture of knowledge sharing.
Oversee capacity planning, scalability assessments, and future‑state demand forecasting across critical systems.
Establish and maintain comprehensive documentation of SRE processes, standards, frameworks, and best practices.