Principal Software Engineer – AI Platform (Production Engineering / Reliability)

CVS HealthWork At Home-Texas, TX
$144,200 - $288,400

About The Position

We are seeking a Principal Individual Contributor (IC) to lead production engineering, observability, and operational excellence for our AI Platform. This role sits at the intersection of ML systems, distributed infrastructure, and production reliability, ensuring that our AI services are scalable, observable, and resilient in real-world environments. As a senior technical leader, you will define and drive best-in-class production practices, build robust monitoring and alerting ecosystems, and partner across engineering, ML, and platform teams to ensure mission-critical AI systems meet high availability, performance, and reliability standards.

Requirements

  • 10+ years in software engineering, production engineering, or SRE roles
  • Deep experience operating large-scale distributed systems in production
  • Proven track record building monitoring, observability, and alerting systems
  • Strong expertise in incident management and production support models
  • Experience working with cloud platforms (Azure, AWS, GCP)

Nice To Haves

  • Experience supporting AI/ML platforms or data-intensive systems
  • Familiarity with model lifecycle management and MLOps practices
  • Knowledge of: OpenTelemetry, Prometheus, Grafana, Datadog
  • Kubernetes and containerized workloads
  • Streaming systems (Kafka, Event Hub, etc.)
  • Experience defining and implementing SLO-driven engineering
  • Background in high-availability, low-latency systems
  • Systems thinking and ability to reason about complex, interdependent systems
  • Strong bias for automation, scalability, and long-term solutions
  • Exceptional debugging and incident management skills
  • Ability to influence without authority across multiple teams
  • Passion for operational excellence and reliability

Responsibilities

  • Own and evolve production operations strategy for AI/ML platforms and services
  • Define SLOs, SLIs, and error budgets for AI systems (online & batch/inference pipelines)
  • Lead root cause analysis (RCA) and drive systemic improvements post-incident
  • Establish operational readiness standards for launching new AI capabilities
  • Build frameworks for on-call excellence, incident response, and escalation
  • Design and implement end-to-end observability systems across AI workloads: Model performance monitoring, Data pipeline health, Infrastructure metrics
  • Build and scale monitoring and alerting frameworks using modern tooling (e.g., Prometheus, Grafana, OpenTelemetry, Datadog, Azure Monitor, etc.)
  • Define actionable, low-noise alerts tied to business and system impact
  • Develop dashboards and telemetry standards for real-time visibility across services
  • Drive adoption of golden signals (latency, errors, throughput, saturation) in AI systems
  • Ensure reliable deployment and operation of: Real-time inference services, Model pipelines (training, validation, deployment), Data ingestion and feature pipelines
  • Implement model observability (drift detection, data skew, performance degradation)
  • Partner with ML engineers to improve production readiness of models
  • Establish lifecycle standards for models in production environments
  • Build internal platforms and tooling for: Automated incident detection and response, Self-healing systems, Deployment validation and canarying
  • Drive Infrastructure as Code (IaC) and policy automation
  • Improve system resilience through chaos testing and fault injection
  • Act as a trusted technical advisor across platform, ML, and product teams
  • Set direction for operational excellence in AI systems at org scale
  • Mentor senior engineers and influence cross-team architectural decisions
  • Lead adoption of industry best practices in reliability engineering and observability

Benefits

  • medical
  • dental
  • vision coverage
  • paid time off
  • retirement savings options
  • wellness programs
  • CVS Health bonus
  • commission or short-term incentive program
  • equity award program
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service