Staff Observability Platform Engineer (SRE)

CVS HealthScottsdale, AZ

About The Position

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time. POSITION SUMMARY CVS Health PBM is looking for hands-on, passionate people who want to join a high energy and growing team, who want to be on the forefront of digital innovation that aims to reinvent what a pharmacy and a health care company can be in the digital world. As a Lead Platform Reliability Engineer, you will design and implement metrics and observability frameworks with a strong focus on service level objectives (SLOs), service level indicators (SLIs), error budgets, and cloud infrastructure scaling and capacity estimation. This individual contributor role is critical to enhancing our monitoring and observability capabilities, while also driving automation initiatives related to quality gates within the release engineering process. You will work closely with cross‑functional teams to ensure the reliability, performance, and scalable growth of our cloud‑based systems.

Requirements

  • 10+ years of experience in Software Engineering, Platform Engineering, or SRE.
  • 7+ years of experience with observability practices, including SLIs/SLOs/SLAs, alerting, and incident management.
  • 7+ years building production-grade backend services in Java/python.
  • 7+ years implementing and operating OpenTelemetry, including OTLP, semantic conventions, and instrumentation patterns.
  • 7+ years with cloud-native and containerized platforms (Docker, Kubernetes, Argo CD).
  • 7+ years working with public cloud platforms (AWS, GCP, or Azure).
  • 5+ years designing and scaling distributed, high‑volume data pipelines.
  • 5+ years working with Grafana OSS or comparable observability backends (e.g., Grafana, Loki, Tempo, Prometheus).
  • 5+ years with relational databases (PostgreSQL, MySQL).
  • Bachelor’s degree or equivalent experience (HS diploma + 4 years relevant experience)

Nice To Haves

  • Excellent analytical skills and the ability to communicate complex technical concepts to non-technical stakeholders
  • Experience with service meshes and networking technologies such as Envoy and Istio
  • Experience integrating or operating commercial observability platforms (Splunk, AppDynamics, etc.)
  • Experience with streaming and data platforms such as Kafka, Pulsar, or similar technologies
  • Familiarity with time-series, NoSQL, or analytical databases (ClickHouse, Bigtable, Cassandra, etc.)
  • Experience with Infrastructure as Code tools such as Terraform or CloudFormation
  • Experience with cost optimization and capacity planning for large-scale cloud infra
  • Experience with chaos engineering, resiliency testing, or fault injection
  • Background in security‑aware platform design, including secure service‑to‑service communication
  • Experience mentoring senior engineers and influencing platform standards across organizations
  • Strong operational experience supporting 24x7 production systems, including on‑call responsibilities
  • Knowledge of security best practices in cloud environments

Responsibilities

  • Define, implement, and maintain key performance metrics, SLOs, and SLIs to measure system reliability and performance. Ensure alignment with business objectives and operational goals.
  • Manage error budgets effectively, collaborating with development teams to balance reliability and feature delivery. Analyze incidents and outages to inform adjustments to error budgets.
  • Design and implement comprehensive monitoring solutions to provide real-time visibility into system health. Utilize tools such as Prometheus, Grafana, Loki, Temp and other observability platforms to create dashboards and alerts.
  • Architect, design, and implement scalable cloud infrastructure capable of supporting multiple business applications, ensuring reliability, performance, and future growth.
  • Develop and implement automated quality gates that ensure all releases meet defined reliability and performance standards. Lead the release Devops team to integrate these gates into the CI/CD pipeline.
  • Assist in incident response efforts by providing insights from metrics and monitoring tools. Conduct post-mortem analyses to identify root causes and recommend preventive measures.

Benefits

  • medical
  • dental
  • vision coverage
  • paid time off
  • retirement savings options
  • wellness programs

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

Associate degree

Number of Employees

5,001-10,000 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service