Etsy's Services Infrastructure group is looking for a Site Reliability Engineer II to join us in our mission of building and supporting reliable large scale Kubernetes infrastructure. The SRE team owns several aspects of business critical services(search retrieval and ranking) & Machine Learning Models infrastructure(Kubernetes hosted on Google Cloud) that enable engineers to efficiently build and release, as well as support the uptime of critical systems behind etsy.com. You will be playing an instrumental role in crafting the future architecture of how we run our systems in the cloud while being part of a dynamic international team. You'll get exposure to a variety of technologies ranging from Kubernetes, Golang, LLMs, Model Serving, Search Retrieval & Ranking and more as you build systems to support the services that support our 86M active buyers and 5.5M sellers! As the Software Engineer II, SRE you will drive the adoption of containers and Kubernetes, improve reliability, automating the operations and providing a self-service runtime platform to accelerate Etsy's product & ML engineering, and contribute to the design and implementation of Observability & CI/CD on top of Kubernetes. Do you find joy in improving developer velocity and have the itch to work on complex large-scale distributed systems? If so, this could be the perfect match. This is a full-time position reporting to the Senior Engineering Manager. In addition to salary, you will also be eligible for an equity package, an annual performance bonus, and our competitive benefits that support you and your family as part of your total rewards package at Etsy. For this role, we are considering candidates based in the United States. Candidates living within commutable distance of Etsy's Brooklyn Office Hub or in the San Francisco Bay Area may be the first to be considered. For candidates within commutable distance, Etsy requires in-office attendance once or twice per week depending on your proximity to the office. Etsy offers different work modes to meet the variety of needs and preferences of our team. Learn more details about our work modes and workplace safety policies here . What's this team like at Etsy? This team improves the Developer experience around build, deploy, release and observing services and ML Models transparently on Google Kubernetes Engine. They work on 20+ Kubernetes clusters with hundreds of nodes running services with low latency requirements. This team also standardizes cluster and application security with common admission policies and container vulnerability, as well as establishing standard SLI/O for all services running on Kubernetes. This team works closely with many product and enablement teams across Etsy. This team handles 20+ Kubernetes clusters with hundreds of nodes running services with low latency requirements. Build and support the CI/CD platform (Buildkite) used by more than a few hundred engineers to deploy their workloads to GKE. Maintain and upgrade GKE addons(CertManager, Gatekeeper), ingress controllers (Contour, Envoy), and various telemetry components (kube-prometheus, AlertManager, Karma) and Container Security. Here's a sneak peek into our Roadmap for the next year Support multiple Search, ML & Gen AI teams to efficiently utilise GPUs across different zones and regions. Evaluate Build vs Buy decisions within LLM space. Enable service mesh across GKE and enable a native way of accessing services across the stack. Standardizing cluster and application security and container vulnerability scanning (both during build and run time)
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed