Site Reliability Engineering Manager

Shippo

119d•$192,000 - $261,000

About The Position

Here at Shippo, we are the shipping layer of the internet and we consider ourselves to be one of the core building blocks of e-commerce. Our mission is to make merchants successful through world class shipping. With our products and solutions, we level the playing field by providing our customers with best-in-class solutions that otherwise wouldn’t be available to them. Through Shippo e-commerce businesses, marketplaces, platforms and a variety of logistics infrastructure providers are able to connect to shipping carriers around the world from one API and dashboard. We provide our customers with the most competitive shipping rates, print labels, automated international documents, shipment tracking, facilitate the returns process and more. As the SRE Manager at Shippo, you will lead a team of engineers responsible for building platforms, tooling, and infrastructure that enable product teams to operate reliable, performant, and scalable services. You will establish frameworks for observability, deployment automation, and infrastructure management that allow product teams to own their service reliability. You will maintain a strong support oriented team while building automation and enabling engineering productivity and operational excellence across the organization.

Requirements

3+ years of hands-on engineering management experience
9+ years as a software or systems engineer with deep experience building platforms, tooling, or infrastructure
BS or MS degree in Computer Science or equivalent experience
Expert-level experience designing and operating platforms that enable other engineering teams (internal platform-as-a-product experience)
Strong operational experience with Kubernetes in production environments, including experience building Kubernetes platforms for application teams
Deep expertise with at least one public cloud provider (AWS, GCP) including networking, compute, storage, and managed services
Experience building or maintaining CI/CD systems and deployment automation (GitHub Actions, GitLab CI, ArgoCD, Flux, etc.)
Strong background in infrastructure-as-code tools and patterns (Terraform, Pulumi, CloudFormation, etc.)
Experience designing and implementing observability platforms (Prometheus, Grafana, ELK stack, Datadog, New Relic, etc.)
Proficiency in at least one programming language for tooling and automation (Python, Go, or similar)
Experience establishing reliability frameworks (SLO/SLI/error budgets) that other teams can adopt
Understanding of developer experience and ability to build self-service tooling that reduces friction
Track record of designing disaster recovery solutions and implementing security and compliance best practices for infrastructure
Exceptional verbal, written, and interpersonal communication skills with ability to influence product teams and engineering leadership
Deep understanding of enabling product team success through platform capabilities

Responsibilities

Lead and develop a team of platform-focused SRE engineers, providing technical mentorship, career development, and performance management while fostering a culture of automation, self-service, and continuous improvement
Build and maintain internal platforms and tooling that enable product teams to deploy, monitor, and operate their services reliably
Manage observability platforms (metrics, logs, traces, dashboards) that provide product teams visibility into their services
Own the infrastructure and Kubernetes platform that all Shippo services run on, ensuring it scales ahead of business needs through capacity planning and performance optimization
Establish frameworks and tooling for SLO/SLI definition, error budget tracking, and reliability measurement that product teams can adopt
Design and maintain CI/CD pipelines, deployment automation, and release tooling that enable safe, frequent deployments
Build infrastructure-as-code foundations and self-service capabilities that allow product teams to provision and manage their infrastructure
Create automation to eliminate toil and prevent infrastructure problems before they impact product teams
Drive infrastructure cost optimization initiatives through analysis, rightsizing recommendations, reserved capacity planning, and waste elimination across the cloud platform
Participate in leadership rotation for Sev1 incidents affecting services or the platform itself
Manage the SRE team’s on-call rotation
Design, implement, and test disaster recovery capabilities and ensure infrastructure security and compliance
Partner with Engineering Managers and TPMs to understand product team needs, prioritize platform investments, and communicate platform roadmap and capabilities
Establish platform SLOs for infrastructure reliability, deployment success rates, build times, and other developer experience metrics

Benefits

Healthcare coverage for medical, dental, and vision
Take-as-much-as-you-need vacation policy & flexible working
One week-long company wide winter shutdown
3 Volunteer Days Off (VTOs)
WFH stipend to set up your home office
Charity donation match up to $100
Dedicated programs, coaching, tools, and resources for your professional and career growth as well as an individual learning stipend for your personal and focused growth
Fun team in person time through our Shippos Everywhere program which includes regular team and company off-sites throughout the year as well as local Shippos gatherings

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Manager

Education Level

Bachelor's degree

Number of Employees

101-250 employees

Site Reliability Engineering Manager

About The Position

Requirements

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company