Staff Observability Platform Engineer (SRE)

CVS Health•Scottsdale, AZ

About The Position

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time. POSITION SUMMARY CVS Health PBM is looking for hands-on, passionate people who want to join a high energy and growing team, who want to be on the forefront of digital innovation that aims to reinvent what a pharmacy and a health care company can be in the digital world. As a Lead Platform Reliability Engineer, you will design and implement metrics and observability frameworks with a strong focus on service level objectives (SLOs), service level indicators (SLIs), error budgets, and cloud infrastructure scaling and capacity estimation. This individual contributor role is critical to enhancing our monitoring and observability capabilities, while also driving automation initiatives related to quality gates within the release engineering process. You will work closely with cross‑functional teams to ensure the reliability, performance, and scalable growth of our cloud‑based systems.

Requirements

10+ years of experience in Software Engineering, Platform Engineering, or SRE.
7+ years of experience with observability practices, including SLIs/SLOs/SLAs, alerting, and incident management.
7+ years building production-grade backend services in Java/python.
7+ years implementing and operating OpenTelemetry, including OTLP, semantic conventions, and instrumentation patterns.
7+ years with cloud-native and containerized platforms (Docker, Kubernetes, Argo CD).
7+ years working with public cloud platforms (AWS, GCP, or Azure).
5+ years designing and scaling distributed, high‑volume data pipelines.
5+ years working with Grafana OSS or comparable observability backends (e.g., Grafana, Loki, Tempo, Prometheus).
5+ years with relational databases (PostgreSQL, MySQL).
Bachelor’s degree or equivalent experience (HS diploma + 4 years relevant experience)

Nice To Haves

Excellent analytical skills and the ability to communicate complex technical concepts to non-technical stakeholders
Experience with service meshes and networking technologies such as Envoy and Istio
Experience integrating or operating commercial observability platforms (Splunk, AppDynamics, etc.)
Experience with streaming and data platforms such as Kafka, Pulsar, or similar technologies
Familiarity with time-series, NoSQL, or analytical databases (ClickHouse, Bigtable, Cassandra, etc.)
Experience with Infrastructure as Code tools such as Terraform or CloudFormation
Experience with cost optimization and capacity planning for large-scale cloud infra
Experience with chaos engineering, resiliency testing, or fault injection
Background in security‑aware platform design, including secure service‑to‑service communication
Experience mentoring senior engineers and influencing platform standards across organizations
Strong operational experience supporting 24x7 production systems, including on‑call responsibilities
Knowledge of security best practices in cloud environments

Responsibilities

Define, implement, and maintain key performance metrics, SLOs, and SLIs to measure system reliability and performance. Ensure alignment with business objectives and operational goals.
Manage error budgets effectively, collaborating with development teams to balance reliability and feature delivery. Analyze incidents and outages to inform adjustments to error budgets.
Design and implement comprehensive monitoring solutions to provide real-time visibility into system health. Utilize tools such as Prometheus, Grafana, Loki, Temp and other observability platforms to create dashboards and alerts.
Architect, design, and implement scalable cloud infrastructure capable of supporting multiple business applications, ensuring reliability, performance, and future growth.
Develop and implement automated quality gates that ensure all releases meet defined reliability and performance standards. Lead the release Devops team to integrate these gates into the CI/CD pipeline.
Assist in incident response efforts by providing insights from metrics and monitoring tools. Conduct post-mortem analyses to identify root causes and recommend preventive measures.