Senior Software Development Engineer (Site Reliability)

CVS Health•Richardson, TX

2d•$92,700 - $203,940•Hybrid

About The Position

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time. Position Summary The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, performance, and operational scalability of the myPBM platform. This role applies software engineering practices to operations, with a focus on automation, observability, incident management, and continuous improvement to support the stable, scalable delivery of client-facing services. The SRE partners closely with DevOps, Engineering, Infrastructure, and Security teams to balance system reliability with delivery velocity while maintaining compliance with enterprise standards. We prefer this person is hybrid in Richardson, TX, Northbrook, IL or Scottsdale, AZ

Requirements

5+ years of experience in site reliability engineering, DevOps, or platform engineering
Experience with Monitoring and observability tools such as Splunk and AppDynamics
Cloud platforms, preferably Azure, including AKS and Kubernetes
CI/CD pipelines such as GitHub Actions, Jenkins, or similar tools
Strong understanding of Incident management and root cause analysis, Monitoring, alerting, and logging practices, and Infrastructure and networking fundamentals
Scripting experience with Python, Bash, or PowerShell.

Nice To Haves

Experience in healthcare or other regulated environments.
Knowledge of site reliability engineering principles, including SLIs, SLOs, and error budgets.
Familiarity with DevSecOps practices and compliance requirements.
Experience supporting large-scale distributed systems.

Responsibilities

Ensure high availability, resiliency, and performance of myPBM applications and infrastructure.
Define and manage SLIs, SLOs, and SLAs for critical services.
Monitor production systems and proactively identify issues before customer impact.
Lead incident response, triage, and root cause analysis (RCA).
Drive continuous improvement to reduce repeat incidents and operational toil.
Implement and maintain end-to-end observability across UI, APIs, and infrastructure layers.
Build and manage monitoring solutions using: AppDynamics (APM, RUM, synthetic monitoring) Splunk (logs, dashboards, and error tracking)
Design actionable alerts and escalation workflows using tools such as xMatters and MIR3.
Standardize dashboards and ensure data accuracy and visibility.
Continuously optimize alerting to reduce noise and improve signal quality.
Support and enhance CI/CD pipelines, including GitHub Actions and enterprise pipeline solutions.
Enforce deployment guardrails, release governance, and production readiness checks.
Support build and deployment failure triage and rollback strategies.
Partner with development teams to improve deployment reliability and automation.
Ensure adherence to change management (CAB/SNOW) and release policies
Manage and support cloud infrastructure, including AKS, compute, storage, and networking.
Ensure platform health, capacity monitoring, and performance optimization.
Support infrastructure provisioning and environment setup.
Drive disaster recovery (DR) readiness and failover validation, including RTO and RPO objectives.
Enable application onboarding onto standardized enterprise platforms.
Implement continuous security monitoring and vulnerability remediation.
Manage secrets, certificates, and identity integration, including IAM onboarding.
Ensure compliance with CVS security standards, audit requirements, and production readiness controls.
Enforce shift-left security practices in CI/CD pipelines.
Participate in 24x7 on-call rotation and incident response.
Partner with Production Support to resolve incidents.
Ensure monitoring and alerting gaps are identified and closed.
Maintain incident documentation and improve standard operating procedures.
Support the full issue detection, triage, resolution, and prevention lifecycle.
Automate repetitive operational tasks to reduce toil.
Implement infrastructure as code (IaC) practices.
Continuously improve deployment pipelines, monitoring, and observability.
Enable predictive insights and proactive issue prevention.
Work closely with engineering, DevOps, infrastructure, and security teams.
Enable a shared ownership model for reliability and operations.
Provide guidance on production readiness and operational best practices.