Site Reliability Engineer

Peloton•New York, NY

58d•Hybrid

About The Position

At Peloton, we view Platform as a Product. A phenomenal platform unlocks speed of development and learning. It allows us to scale easily, enabling our engineers to maximize attention on new features and capabilities. A key to crafting a phenomenal platform is data-driven insights and understanding where we should focus our attention to create the best outcomes for our members. Platform at Peloton is a force-multiplier that enables Peloton to move faster and scale safely with minimal effort. Core to this mission is creation of the best developer experience in the tech industry for the entire spectrum of Peloton's technology. We work across an incredible range of technology domains: hardware, firmware, web, mobile, backend, data, messaging, content, streaming, and machine learning. We get to apply these to create a platform of products loved by millions of customers all over the world. Peloton is looking for a Site Reliability Engineer with an operations focus to work with teams across the organization to help build and maintain a monitorable, performant, reliable, and highly-scalable deployment platform. We are a growing team of engineers tackling exciting problems to handle thousands of nodes and pods spread across many deployments.

Requirements

Experience maintaining scalable and stable Kubernetes clusters
Knowledge of best practices when it comes to the observability and monitoring required of running Kubernetes at scale
Knowledge of best practices in regards to securing a Kubernetes cluster and its deployments at scale
A passion for helping development teams make the transition to a container-native world
Experience with CI/CD Systems such as for example: Jenkins, ArgoCD, Harness, Tekton, etc.
Experience deployment infrastructure using Infrastructure as Code utilities such as Terraform or Pulumi
Know when to triage and when to dive down into a root-cause analysis
Passion for reliable, scalable, observable software with a strong sense of ownership
Experience with a programming language like Python, Golang, Java, C

Responsibilities

Automatic, fast auto scaling for live rides and special large events
Host a critical infrastructure that ensures that our members have the best experience possible on tens of thousands of pods across multiple clusters
Provide a platform for machine learning (and other awesome workloads)
Allow developers to move quickly and experiment, without getting in the way
Promote best practices for building and operating highly reliable systems
Serve as domain expert in observability and monitoring
Consult in system design to meet reliability and capacity requirements
Automate everything, from infrastructure down to day-to-day tasks
Conduct timely post-mortems of infrastructure incidents
Assist with all aspects of operational security and compliance
Seek out potential threats to security and reliability and advocate solutions