Site Reliability Engineering Manager

Shippo
70d$192,000 - $261,000

About The Position

Here at Shippo, we are the shipping layer of the internet and we consider ourselves to be one of the core building blocks of e-commerce. Our mission is to make merchants successful through world class shipping. With our products and solutions, we level the playing field by providing our customers with best-in-class solutions that otherwise wouldn’t be available to them. Through Shippo e-commerce businesses, marketplaces, platforms and a variety of logistics infrastructure providers are able to connect to shipping carriers around the world from one API and dashboard. We provide our customers with the most competitive shipping rates, print labels, automated international documents, shipment tracking, facilitate the returns process and more. As the SRE Manager at Shippo, you will lead a team of engineers responsible for building platforms, tooling, and infrastructure that enable product teams to operate reliable, performant, and scalable services. You will establish frameworks for observability, deployment automation, and infrastructure management that allow product teams to own their service reliability. You will maintain a strong support oriented team while building automation and enabling engineering productivity and operational excellence across the organization.

Requirements

  • 3+ years of hands-on engineering management experience
  • 9+ years as a software or systems engineer with deep experience building platforms, tooling, or infrastructure
  • BS or MS degree in Computer Science or equivalent experience
  • Expert-level experience designing and operating platforms that enable other engineering teams (internal platform-as-a-product experience)
  • Strong operational experience with Kubernetes in production environments, including experience building Kubernetes platforms for application teams
  • Deep expertise with at least one public cloud provider (AWS, GCP) including networking, compute, storage, and managed services
  • Experience building or maintaining CI/CD systems and deployment automation (GitHub Actions, GitLab CI, ArgoCD, Flux, etc.)
  • Strong background in infrastructure-as-code tools and patterns (Terraform, Pulumi, CloudFormation, etc.)
  • Experience designing and implementing observability platforms (Prometheus, Grafana, ELK stack, Datadog, New Relic, etc.)
  • Proficiency in at least one programming language for tooling and automation (Python, Go, or similar)
  • Experience establishing reliability frameworks (SLO/SLI/error budgets) that other teams can adopt
  • Understanding of developer experience and ability to build self-service tooling that reduces friction
  • Track record of designing disaster recovery solutions and implementing security and compliance best practices for infrastructure
  • Exceptional verbal, written, and interpersonal communication skills with ability to influence product teams and engineering leadership
  • Deep understanding of enabling product team success through platform capabilities

Responsibilities

  • Lead and develop a team of platform-focused SRE engineers, providing technical mentorship, career development, and performance management while fostering a culture of automation, self-service, and continuous improvement
  • Build and maintain internal platforms and tooling that enable product teams to deploy, monitor, and operate their services reliably
  • Manage observability platforms (metrics, logs, traces, dashboards) that provide product teams visibility into their services
  • Own the infrastructure and Kubernetes platform that all Shippo services run on, ensuring it scales ahead of business needs through capacity planning and performance optimization
  • Establish frameworks and tooling for SLO/SLI definition, error budget tracking, and reliability measurement that product teams can adopt
  • Design and maintain CI/CD pipelines, deployment automation, and release tooling that enable safe, frequent deployments
  • Build infrastructure-as-code foundations and self-service capabilities that allow product teams to provision and manage their infrastructure
  • Create automation to eliminate toil and prevent infrastructure problems before they impact product teams
  • Drive infrastructure cost optimization initiatives through analysis, rightsizing recommendations, reserved capacity planning, and waste elimination across the cloud platform
  • Participate in leadership rotation for Sev1 incidents affecting services or the platform itself
  • Manage the SRE team’s on-call rotation
  • Design, implement, and test disaster recovery capabilities and ensure infrastructure security and compliance
  • Partner with Engineering Managers and TPMs to understand product team needs, prioritize platform investments, and communicate platform roadmap and capabilities
  • Establish platform SLOs for infrastructure reliability, deployment success rates, build times, and other developer experience metrics

Benefits

  • Healthcare coverage for medical, dental, and vision
  • Take-as-much-as-you-need vacation policy & flexible working
  • One week-long company wide winter shutdown
  • 3 Volunteer Days Off (VTOs)
  • WFH stipend to set up your home office
  • Charity donation match up to $100
  • Dedicated programs, coaching, tools, and resources for your professional and career growth as well as an individual learning stipend for your personal and focused growth
  • Fun team in person time through our Shippos Everywhere program which includes regular team and company off-sites throughout the year as well as local Shippos gatherings
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service