Site Reliability Engineer

Priceline•Norwalk, CT

52d•Hybrid

About The Position

Site Reliability Engineer Our Technology team is the backbone of our company: constantly creating, testing, learning and iterating to better meet the needs of our customers. If you thrive in a fast-paced, ideas-led environment, you’re in the right place. Why this job’s a big deal: As Priceline caters to the global market, enhancing performance of highly scalable, high performance web-based products continues to be a main focus. We strive to build infrastructure and eliminate manual work through automation. Practices such as limiting time spent on operational work and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.

Requirements

Bachelor’s Degree in Computer Science or equivalent hands‑on experience
2+ years working with Linux in production or containerized environments (e.g., RHEL/Rocky, Ubuntu, Alpine/Distroless)
Minimum 2 years of working experience in cloud environments, preferably GCP , running and supporting real workloads
Minimum 2 years of working experience in Kubernetes (GKE preferred), including deploying, operating, and troubleshooting applications at scale.
Demonstrated experience in one or more scripting languages with strong proficiency in Python and Bash ( familiarity with Go is a plus)
Solid understanding of infrastructure scalability issues and distributed systems design.
Comfort with large scale production systems and technologies (load balancing, monitoring, distributed system and configuration management; experience with service mesh technologies such as Istio is a plus)
General knowledge of Infrastructure as Code tools and Configuration management tools with Terraform and Ansible as primary tools
Experience with modern observability tools (e.g., Splunk, New Relic, Prometheus, Grafana, OpenTelemetry) and in designing effective alerts and SLIs.
Understanding of SRE principles including SLOs, SLIs, error budgets, and toil reduction
Familiarity with AI‑assisted development tools (e.g., Claude Code, GitHub Copilot, Cursor) and a willingness to integrate them into daily workflows
Illustrated history of living the values necessary to Priceline: Customer, Innovation, Team, Accountability and Trust.
The Right Results, the Right Way is not just a motto at Priceline; it’s a way of life.
Unquestionable integrity and ethics is essential.

Nice To Haves

Experience building or contributing to internal platforms, self‑service workflows, or operational APIs used by other engineers is a plus

Responsibilities

Actively participate in deploying and supporting applications in our Cloud and Kubernetes environments.
Collaborate with development teams to support and evolve our cloud‑native architecture, providing platforms and self‑service capabilities to developers.
Reduce manual intervention and turn‑around time to solve repetitive problems while automating operational workflows and improving observability (logging, metrics, traces, alerting) for Priceline’s commercial website and services.
Develop software and provide hands-on technical knowledge to design, deploy, and optimize large‑scale, massively distributed, fault‑tolerant systems running on Cloud SaaS and Kubernetes.
Take ownership of automating, scripting, and tooling of new/existing scripts to help the team achieve maximum practical automation of daily tasks and to build reusable internal platforms and self‑service workflows.
Support services before they go live through activities such as system design review, capacity management and launch reviews, with a focus on reliability, observability, and safe deployment practices.
Be part of an on‑call rotation to support production systems, coordinate incident response, and participate in blameless postmortems.
Improve, tune and perform operational efficiency within the Linux‑based infrastructure and production environment, with a focus on containerized workloads and cloud‑native services.
Operate and scale hundreds of applications and services in multiple geo locations using clusters, load balancers, and service mesh (Istio) rather than manual server management.
Design, implement, and iterate on SLOs, SLIs, and error budgets with product teams to guide reliability decisions and balance system stability with delivery speed.
Use AI‑assisted development tools (e.g., Claude Code, GitHub Copilot, Cursor) to accelerate automation, diagnostics, and documentation while maintaining strong code review and testing practices.
Contribute to and troubleshoot GitOps‑based deployment workflows (GitHub Actions, Helm/Kustomize, ArgoCD) and policy‑as‑code (e.g., Kyverno) for safer, auditable changes.
Participate in an on‑call rotation, lead incident response when needed, and contribute to blameless postmortems that drive concrete reliability improvements.
Build and evolve internal developer platforms and self‑service tooling (APIs, CLIs, web UIs) that reduce SRE toil and increase developer autonomy.