Executive Director, AI Infrastructure & Platform Engineering

CVS HealthYork, PA
$175,100 - $334,750Onsite

About The Position

The Executive Director, AI Infrastructure & Platform Engineering is a senior engineering leadership role responsible for standing up, operating, and continuously improving CVS Health's on-premises AI compute platform. This position owns the physical and platform layers of CVS’s Enterprise AI Factory — a frontier-class GPU compute environment running NVIDIA Blackwell systems across a high-throughput RoCE v2 fabric, hosted in co-located data center facilities, with multi-site expansion underway. Reporting to the Global Head of Infrastructure/AI Operations and Service Delivery, this leader will establish operational baselines across the full infrastructure stack — hardware, network fabric, GPU clusters, storage, and the operating systems and orchestration layers above — and build the Site Reliability Engineering practice that delivers the availability, reliability, and performance that frontier AI workloads demand. This is a greenfield organizational build. The Executive Director will define the operating model, set the engineering standards, hire and develop the team, and establish the long-term operations capability that will govern CVS's AI infrastructure for years ahead.

Requirements

  • 10+ years of engineering leadership experience, with substantial time directly owning physical infrastructure at data center scale — including hardware lifecycle, capacity planning, and facility coordination (power, cooling, rack-and-stack execution).
  • Hands-on production ownership of bare-metal Kubernetes or OpenShift. Managed cloud services (EKS, GKE, AKS) alone do not substitute for the practitioner expertise this role requires.
  • Fluency with high-speed cluster fabrics — RoCE v2, InfiniBand, EVPN-VXLAN, or carrier-grade equivalent — and the operational discipline these fabrics require (PFC, ECN, lossless tuning, congestion management).
  • 5+ years leading multiple technical teams simultaneously, including 24/7 operations organizations, with measurable team health, retention, and performance outcomes.
  • Proven success establishing and enforcing operational baselines, SLO / SLI / error-budget frameworks, and observability-driven continuous improvement in physical-infrastructure-anchored environments.
  • Hardware lifecycle, vendor accountability, and facility coordination experience — including capacity planning, RMA management, and multi-vendor escalation.
  • Experience leading operational transitions or organizational build-outs at scale, with business continuity and minimal disruption as non-negotiables.
  • Executive-level stakeholder communication, vendor negotiation, and budget ownership.
  • Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related technical field.

Nice To Haves

  • Hands-on experience with Cisco UCS, NVIDIA HGX / DGX / Blackwell systems, and VAST or comparable distributed NVMe storage.
  • Direct experience operating GPU clusters of 32 or more GPUs in production environments — including HPC, AI training, research computing, or comparable workloads.
  • NVIDIA AI Enterprise, NVIDIA Run:AI, NVIDIA Base Command Manager, or comparable GPU orchestration platform experience.
  • Healthcare or other regulated-industry background (HIPAA, NIST AI RMF, SOX, FedRAMP, ITAR).
  • Chaos engineering and AI-driven operations experience — predictive alerting and automated remediation patterns.
  • Background in innovation programs, POD structures, or centers of excellence.

Responsibilities

  • Define and execute the long-range vision and strategy for AI infrastructure and platform engineering, with availability (>99.99%), reliability, and platform performance as the primary measures of success.
  • Recruit, hire, develop, and retain a high-performing engineering organization spanning infrastructure, network, platform reliability, observability, security, 24/7 operations, change and release management, and FinOps.
  • Establish clear ownership, accountability, and performance expectations across all functional teams; foster a culture of operational excellence, engineering rigor, and continuous improvement.
  • Provide executive-level communication to senior leadership on platform status, milestones, risk posture, and strategic initiatives.
  • Own the physical layer of the AI compute environment — GPU compute, storage, network fabric, capacity planning, and hardware lifecycle accountability.
  • Direct bare-metal Kubernetes and OpenShift operations, including cluster administration, GPU quota governance, infrastructure-as-code adoption, and availability baseline enforcement.
  • Govern high-performance network fabric operations — RoCE v2, spine-leaf topology, lossless Ethernet tuning, congestion management, and segmentation.
  • Establish and enforce operational baselines across every layer of the stack — hardware, fabric, platform, and workload — with deviations detected, escalated, and resolved within defined SLAs.
  • Direct Innovation POD strategy to develop self-healing and autonomous capabilities that proactively prevent service degradation before it impacts availability.
  • Build and sustain a high-performing 24/7 operations model — designed for sustainable, predictable coverage with no mandatory overtime and measurable team health and retention.
  • Drive end-to-end observability across the physical and platform layers, with continuous feedback loops connecting monitoring data to incident response, change decisions, and improvement cycles.
  • Oversee change management so every modification is risk-assessed, monitored during rollout, and baseline-validated post-deployment.
  • Ensure configuration consistency and drift detection across all platform components to prevent baseline degradation over time.
  • Lead GPU FinOps governance — utilization optimization, tenant quota enforcement, and cost reduction — in partnership with the Finance organization.
  • Empower the Security SRE Lead to maintain a world-class security posture across the infrastructure and platform layers, with robust compliance to frameworks including HIPAA and NIST AI RMF.
  • Govern access controls, audit logging, vulnerability management, and network segmentation across the AI compute environment.
  • Lead the operational transition from program-launch staffing to permanent CVS-owned operations — governing phased handoffs, competency validation, and milestone sign-offs to ensure minimal disruption to platform availability and business operations.
  • Establish and lead the long-term operating model by institutionalizing key technical, architectural, and delivery leadership capabilities into permanent CVS roles, ensuring the organization is fully self-sustaining at program close.
  • Own vendor relationships, contract performance, and accountability across the hardware, networking, platform, and managed-services stack.
  • Manage budget ownership for the AI infrastructure and platform engineering organization, including capital planning and operational expense governance.

Benefits

  • medical
  • dental
  • vision coverage
  • paid time off
  • retirement savings options
  • wellness programs
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service