Site Reliability Engineer

PelotonNew York, NY
Hybrid

About The Position

At Peloton, we view Platform as a Product. A phenomenal platform unlocks speed of development and learning. It allows us to scale easily, enabling our engineers to maximize attention on new features and capabilities. A key to crafting a phenomenal platform is data-driven insights and understanding where we should focus our attention to create the best outcomes for our members. Platform at Peloton is a force-multiplier that enables Peloton to move faster and scale safely with minimal effort. Core to this mission is creation of the best developer experience in the tech industry for the entire spectrum of Peloton's technology. We work across an incredible range of technology domains: hardware, firmware, web, mobile, backend, data, messaging, content, streaming, and machine learning. We get to apply these to create a platform of products loved by millions of customers all over the world. Peloton is looking for a Site Reliability Engineer with an operations focus to work with teams across the organization to help build and maintain a monitorable, performant, reliable, and highly-scalable deployment platform. We are a growing team of engineers tackling exciting problems to handle thousands of nodes and pods spread across many deployments.

Requirements

  • Experience maintaining scalable and stable Kubernetes clusters
  • Knowledge of best practices when it comes to the observability and monitoring required of running Kubernetes at scale
  • Knowledge of best practices in regards to securing a Kubernetes cluster and its deployments at scale
  • A passion for helping development teams make the transition to a container-native world
  • Experience with CI/CD Systems such as for example: Jenkins, ArgoCD, Harness, Tekton, etc.
  • Experience deployment infrastructure using Infrastructure as Code utilities such as Terraform or Pulumi
  • Know when to triage and when to dive down into a root-cause analysis
  • Passion for reliable, scalable, observable software with a strong sense of ownership
  • Experience with a programming language like Python, Golang, Java, C

Responsibilities

  • Automatic, fast auto scaling for live rides and special large events
  • Host a critical infrastructure that ensures that our members have the best experience possible on tens of thousands of pods across multiple clusters
  • Provide a platform for machine learning (and other awesome workloads) Allow developers to move quickly and experiment, without getting in the way
  • Promote best practices for building and operating highly reliable systems
  • Serve as domain expert in observability and monitoring
  • Consult in system design to meet reliability and capacity requirements
  • Automate everything, from infrastructure down to day-to-day tasks
  • Conduct timely post-mortems of infrastructure incidents
  • Assist with all aspects of operational security and compliance
  • Seek out potential threats to security and reliability and advocate solutions

Benefits

  • Medical, dental and vision insurance
  • Generous paid time off policy
  • Short-term and long-term disability
  • Access to mental health services
  • 401k, tuition reimbursement and student loan paydown plans
  • Employee Stock Purchase Plan
  • Fertility and adoption support and up to 18 weeks of paid parental leave
  • Child care and family care discounts
  • Free access to Peloton Digital App and apparel and product discounts
  • Commuter benefits and Citi Bike Discount
  • Pet insurance and so much more!
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service