Senior Site Reliability Engineer - Managed Kubernetes

LambdaSan Francisco, CA
2dOnsite

About The Position

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. Note: This position requires presence in our San Francisco/San Jose or Bellevue office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.

Requirements

  • 6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systems
  • Strong programming skills in Go and Python ; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators
  • Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)
  • Can work either independently with limited direction or as part of a team
  • Can work with customers during incidents either via tickets, live messaging, or as part of a larger call.
  • Familiarity with observability tools like Prometheus, Grafana, FluentBit , and CI/CD pipelines
  • Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar

Nice To Haves

  • Deep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experience
  • Exposure to HPC clusters, AI/ML workloads, or large-scale GPU clusters
  • Hybrid or multi-cloud Kubernetes environment experience
  • Contributions to CNCF projects or Kubernetes SIGs

Responsibilities

  • Operate and maintain bare-metal Kubernetes clusters , scaling up to thousands of nodes
  • Handle cluster degradation, recovery, resizing, and incident response using fleet management tools
  • Participate in a well-managed on-call rotation for critical incidents
  • Assist customers with Kubernetes questions, workload integration, storage, and authentication
  • Work closely with our HPC Ops and Datacenter Ops teams for low-level or cross-functional issues
  • Use Python and Golang to create tooling and automate the validation of platform quality.
  • Design, build, and maintain scalable control plane services, operators, and custom controllers for Kubernetes
  • Develop automation for cluster lifecycle management : provisioning, upgrades, patching, and deletion.
  • Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability.

Benefits

  • Work on cutting-edge Managed Kubernetes platforms for AI/ML workloads
  • Influence the platform roadmap and help shape operations and reliability best practices
  • Collaborate with a highly skilled engineer
  • Opportunity to mentor and grow within a fast-growing, technology-driven environment
  • We offer generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan that we all actually use
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service