About The Position

We’re looking for a strategic and hands-on Sr. Manager of Site Reliability Engineering to lead reliability at scale for one of retail’s most complex engineering platforms. You’ll lead a team of talented engineers serving ~2,500 internal developer customers, championing automation and operational excellence to ensure our platform infrastructure enables engineering velocity and business innovation. A day in the life... Lead & Inspire - Build and mentor a high-performing SRE team that takes pride in platform ownership. Foster a culture of growth, initiative, and continuous improvement. Drive Reliability - Own the availability and performance of critical services through proactive monitoring, disciplined incident response, and thorough root cause analysis — catching problems before developers ever feel them. Automate Everything - Drive meaningful reduction of manual toil through automation across deployment, recovery, and scaling processes — freeing your team to focus on higher-impact work. Champion AI-Augmented Operations - Lead adoption of AI tooling across SRE workflows including automated incident triage, anomaly detection, and AI-assisted on-call response. Partner with the AI & ML Enablement team to build intelligent operational capabilities that give us a meaningful edge. Monitor & Observe - Define and execute observability strategies across our stack using OpenTelemetry, CloudWatch, Amazon Timestream, Splunk, and other tools — building the telemetry foundation to detect and resolve issues before they impact developers. Collaborate & Align - Build strong partnerships across engineering, product, and operations — translating reliability goals into business priorities and vice versa. Plan for Scale - Lead capacity planning and performance tuning for services running on our multi-cloud Kubernetes platform spanning AWS EKS and GCP GKE, with HPA/VPA/KEDA autoscaling across clusters. Measure & Improve - Establish and track SLOs, SLAs, and error budgets. Use them to drive continuous improvement in system reliability and team efficiency, and report progress regularly to executive leadership.

Requirements

  • Experience - 5+ years in SRE, DevOps, or infrastructure engineering, with 4+ years in a leadership role, ideally managing multi-team or platform engineering organizations.
  • Technical Depth - Strong expertise in cloud platforms (AWS and GCP), container orchestration (Kubernetes, EKS), and CI/CD pipelines including supply chain security (container signing, SBOM, OPA policy validation).
  • Programming Skills - Proficiency in Python, Go, or Java.
  • Tool Mastery - Hands-on experience with OpenTelemetry, CloudWatch, Amazon Timestream, Splunk, Kubernetes, Kafka
  • Problem Solver - Strong analytical skills and a genuine passion for root cause analysis and continuous improvement.
  • Communicator - A clear, concise, and collaborative communicator who can translate technical complexity for executive audiences and work hands-on with engineers.
  • Education - bachelor’s degree in computer science, Engineering, or equivalent experience.

Nice To Haves

  • Experience with large-scale distributed systems in a multi-cloud environment (AWS and GCP).
  • Experience with AI-assisted SRE operations: incident triage, anomaly detection, or AI-augmented on-call tooling.
  • Familiarity with developer platform SRE: internal developer platforms (IDPs), platform reliability metrics, and developer experience measurement.
  • Cloud certifications (e.g., AWS Solutions Architect, Google Cloud Professional Engineer).

Responsibilities

  • Lead & Inspire - Build and mentor a high-performing SRE team that takes pride in platform ownership. Foster a culture of growth, initiative, and continuous improvement.
  • Drive Reliability - Own the availability and performance of critical services through proactive monitoring, disciplined incident response, and thorough root cause analysis — catching problems before developers ever feel them.
  • Automate Everything - Drive meaningful reduction of manual toil through automation across deployment, recovery, and scaling processes — freeing your team to focus on higher-impact work.
  • Champion AI-Augmented Operations - Lead adoption of AI tooling across SRE workflows including automated incident triage, anomaly detection, and AI-assisted on-call response. Partner with the AI & ML Enablement team to build intelligent operational capabilities that give us a meaningful edge.
  • Monitor & Observe - Define and execute observability strategies across our stack using OpenTelemetry, CloudWatch, Amazon Timestream, Splunk, and other tools — building the telemetry foundation to detect and resolve issues before they impact developers.
  • Collaborate & Align - Build strong partnerships across engineering, product, and operations — translating reliability goals into business priorities and vice versa.
  • Plan for Scale - Lead capacity planning and performance tuning for services running on our multi-cloud Kubernetes platform spanning AWS EKS and GCP GKE, with HPA/VPA/KEDA autoscaling across clusters.
  • Measure & Improve - Establish and track SLOs, SLAs, and error budgets. Use them to drive continuous improvement in system reliability and team efficiency, and report progress regularly to executive leadership.

Benefits

  • Medical/Vision
  • Dental
  • Retirement and Paid Time Away
  • Life Insurance and Disability
  • Merchandise Discount and EAP Resources
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service