Software Development Engineer - SRE

CVS Health

20h

About The Position

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time. We are seeking a Site Reliability Engineer with a strong focus on observability to design, implement, and operate monitoring and alerting solutions for mission-critical enterprise applications. This role will be responsible for building proactive, actionable observability across services, batch workloads, infrastructure, databases, and logs using tools such as Grafana, Prometheus, Loki, and Tempo. The ideal candidate is passionate about reliability engineering, signal-to-noise optimization, and enabling teams to detect and resolve issues before they impact customers.

Requirements

5+ years of experience in Site Reliability Engineering, DevOps, or Production Operations.
Hands-on expertise with Prometheus, Grafana, Loki, and Tempo in large-scale, production environments.
Strong understanding of monitoring distributed systems spanning both On-Premises and Cloud environments (GCP, Azure).
Experience defining SLOs/SLIs and building alerting strategies based on reliability engineering best practices.
Exceptional attention to detail with the ability to think through complex systems end-to-end, anticipate edge cases, failure modes, and cascading impacts, and proactively design monitoring and alerting to cover both common and rare operational scenarios.
Bachelor’s degree or, equivalent experience (HS diploma + 4 years relevant experience)

Responsibilities

Design and maintain a comprehensive observability platform using Grafana, Prometheus, Loki, and Tempo.
Implement proactive monitoring and alerting for: Microservices and APIs (latency, error rates, availability) Batch jobs, scheduled workloads, and ETL/data pipelines (success/failure, duration, SLA adherence) Server and container health (CPU, memory, disk, network, capacity trends) Database health and performance (availability, replication, query latency, resource utilization) Application and infrastructure logging, including centralized log ingestion, indexing, and search.
Build actionable alerts with clear runbooks, ownership, and escalation paths to minimize mean time to detect (MTTD) and mean time to resolve (MTTR).
Partner with application, platform, and DevOps teams to instrument services with metrics, traces, and structured logs.
Continuously improve signal quality by reducing alert noise, eliminating false positives, and optimizing thresholds based on historical trends.
Create and maintain dashboards for real-time operational visibility and executive-level health reporting.
Support incident response and post-incident reviews by providing high-fidelity telemetry and contributing to root cause analysis.