Principal Network Engineer - AI Infrastructure

CVS Health•New York, NY

2d•$144,200 - $288,400

About The Position

The Principal Network Engineer – AI Infrastructure plays a key role in building the high‑performance network infrastructure that powers the organization’s AI and GPU‑driven workloads. This position is responsible for designing and delivering scalable data center solutions that support large‑scale training and inference platforms. By leveraging modern architectures such as leaf‑spine fabrics, and aligning with leading vendor and industry reference designs, the role helps enable reliable, high‑throughput environments that directly support critical business initiatives. Working closely with engineering, platform, and security partners, this role helps connect network, compute, and security capabilities into a cohesive, high‑performing ecosystem. In addition to hands‑on technical contribution, the position provides guidance on best practices, supports the development of other engineers, and helps shape the future direction of the organization’s AI infrastructure. Through continuous improvement, thoughtful design, and a focus on performance and resilience, this role contributes to a secure and scalable foundation that supports long‑term growth and innovation.

Requirements

10+ years of experience in network engineering, with at least 5+ years in a leadership, architectural, or lead engineering role delivering enterprise or cloud network initiatives end-to-end.
5+ years of experience designing and operating large-scale data center networks, including Layer 2/3 architectures (leaf-spine/Clos), EVPN/VXLAN overlays, and high-speed networking (100/200/400Gb+).
5+ years of experience with enterprise routing, switching, and network platforms, including Cisco-centric data center fabrics, protocols (BGP, OSPF, MPLS, STP), and hybrid connectivity (SD-WAN, VPN, remote access).
5+ years of experience implementing network security technologies, including Palo Alto Networks firewalls (required), NGFW, IDS/IPS, ZTNA, DLP, and micro-segmentation, with understanding of application-aware and zero trust architectures.
3+ years of experience supporting AI/ML or GPU-based environments, including NVIDIA reference architectures and performance-optimized networking for distributed training workloads (e.g., traffic flow optimization, congestion management).
3+ years of experience with application delivery and observability technologies, including F5 load balancing, network performance monitoring tools (e.g., NetFlow, Wireshark, SolarWinds), and traffic analysis for performance tuning.

Nice To Haves

Experience designing and supporting AI factory / GPU cluster environments at scale (training and inference platforms).
Familiarity with high-performance compute networking enhancements (RDMA over Converged Ethernet – RoCE, PFC, ECN).
Experience with Cisco Nexus, ACI, or equivalent data center switching platforms supporting AI workloads.
Strong technical expertise with Networking and Software-Defined Networking (SDN) principles.
Strong technical expertise with developing and interpreting Network, Sequence, and Dataflow diagrams.
Understanding of at least one compliance framework (HIPAA, HITRUST, PCI, NIST, CSA).
Strong technical expertise in defining and implementing cyber resilience standards, policies, and programs for distributed cloud and network infrastructure, ensuring robust redundancy and system reliability.
Experience in influencing industry standards and contributing to open-source projects or security communities, highlighting a broader impact beyond the immediate organizations.
Experience with network automation and Infrastructure as Code
Background in high-availability and disaster recovery design
Certifications: CCIE/CCNP, JNCIE, AWS/Azure/GCP Networking, PCNSE/PAN or Security Specialty, CISSP

Responsibilities

Partner with compute, storage, platform, and security teams to design integrated AI infrastructure solutions.
Serve as a senior technical authority aligning network designs with NVIDIA, Cisco, and industry reference architecture.
Influence enterprise network and security strategy through collaboration with engineering leadership and stakeholders.
Design and implement high-performance data center networks optimized for AI/GPU workloads, including leaf‑spine and EVPN/VXLAN fabrics.
Integrate networking with GPU clusters and high-performance storage systems supporting training and inference workloads.
Optimize network performance (latency, throughput, congestion) for large-scale distributed environments.
Evaluate and deploy advanced networking technologies to improve scalability, reliability, and security.
Support 24/7 infrastructure operations, including on-call responsibilities across cloud, on-prem, and colocation environments.
Lead incident response and resolution for network-related issues, driving root cause analysis and resilience improvements.
Mentor and develop engineers, promoting best practices in networking and security.
Support knowledge sharing through training sessions and technical enablement.
Evaluate and adopt emerging AI infrastructure and networking technologies (e.g., high-speed interconnects, next gen switching).
Contribute to research, innovation, and continuous improvement of network and security capabilities.
Define and drive the data center network strategy supporting AI/ML platforms and business initiatives.
Establish standards and reference architecture aligned with industry best practices.
Guide long-term roadmap decisions, balancing performance, scalability, security, and risk.