Senior Manager, GPU Cloud Infrastructure - GeForce NOW

NVIDIA•Santa Clara, CA

About The Position

GeForce NOW is the global leader in cloud gaming, dedicated to making high-end play accessible on any device, from smartphones to VR headsets. We leverage NVIDIA’s premier data centers to stream over 2,000 games at up to 5K resolution and 120 FPS, ensuring a local-feel experience with ultra-low latency. Joining the GFN team means playing a vital role in advancing interactive entertainment at scale. We are looking for a Senior Manager to lead the design, scaling, and operations of high-performance networking for GPU-based cloud infrastructure. This role is critical to enabling cloud gaming workloads, AI/ML training, and inference platforms by delivering ultra-low-latency, high-throughput, and highly reliable interconnects across data centers and cloud environments.

Requirements

12+ overall years of proven experience in networking, cloud infrastructure, or distributed systems with 5+ years of experience directly managing technical teams.
Mastery of data center networking, including Clos/spine-leaf architectures and high-performance fabrics like RDMA, RoCE, or InfiniBand.
Hands-on experience with BGP, EVPN/VXLAN, and kernel-level development for routing and switching.
Skilled in using Ansible or Terraform for infrastructure automation, paired with monitoring tools like Prometheus and Grafana.
Practical experience designing for large-scale configurations using SR-IOV, Xen virtualization, or Open Virtual Switch.
Bachelor’s or Master’s degree in Computer Science or a related engineering field (or equivalent experience).
Ability to ensure all infrastructure meets rigorous internal policies and regulatory standards like GDPR.

Nice To Haves

Proven success managing networking for large-scale GPU clusters or hyperscale cloud environments.
Familiarity with optical networking and high-speed interconnects reaching 400G or 800G.
Experience in debugging and improving code for Mellanox/Cumulus Linux or managing Palo Alto and Netscaler appliances.
A strong grasp of streaming telemetry and operational signals (SNMP, Syslog) to proactively resolve complex architectural bottlenecks.
Relevant top-tier certifications, such as CCIE or specialized cloud networking designations.

Responsibilities

Build and mentor a specialized team of network architects focused on high-performance GPU infrastructure.
Oversee the design of intra-cluster and inter-cluster connectivity, utilizing RoCE, Ethernet-based AI fabrics, and high-bandwidth data center interconnects.
Drive technical tuning to reduce latency, jitter, and increase throughput while implementing congestion control and packet-loss mitigation strategies.
Define the roadmap for networking strategies that support gaming, AI/ML training, and real-time inference at scale.
Engage with ISPs to optimize low-latency edge networks and ensure a seamless connection from our data centers to end clients.
Implement Infrastructure as Code (IaC) and observability frameworks to automate provisioning, scaling, and real-time cluster health monitoring.
Work directly with AI platform teams, hardware vendors, and SRE groups to influence technology direction and vendor selection.
Establish protocols for fault tolerance and lead incident response and root cause analysis for complex network issues.