Machine Learning Systems Engineer, Networking

NVIDIA•Santa Clara, CA

10d•$152,000 - $287,500

About The Position

Join our team of innovative engineers who are building an AI Data Center AIOps platform that turns raw, high-volume telemetry into reliable, job-centric insights and automation for GPU fleets. As an ML Engineer on this team, you'll design and implement ML algorithms that run in real-time streaming pipelines, detecting anomalies and surfacing insights across massive-scale infrastructure before they impact AI training and inference. The core challenge of this role is building ML algorithms that are simultaneously accurate and efficient —processing millions of telemetry streams in real time within tight CPU and memory budgets. You'll need both the data science depth to design and validate algorithms and the engineering discipline to implement them in production at scale.

Requirements

A BS (or equivalent experience) and 5+ years of experience, MS and 3+ years, or PhD with 1+ years in Computer Science, Statistics, or a related field
Strong mathematical foundation: statistics, probability, linear algebra, and algorithm analysis
Proven experience implementing and optimizing ML algorithms in production — this is a coding-first role; strong implementation skills are required
Strong programming skills in one or more of Go, C/C++, Rust, or Scala; Python working knowledge is a plus
Familiarity with time-series databases and streaming data architectures
Ability to work independently and navigate ambiguity in a fast-paced engineering environment

Nice To Haves

Data Science background with hands-on experience building and validating ML models — bridging research and production implementation
Experience implementing ML algorithms directly in systems languages for latency-sensitive or resource-constrained environments
Research experience: knowing the latest ML literature and translating advances into practical improvements
Experience with Kafka-based streaming pipelines and real-time feature engineering at scale

Responsibilities

Implement production ML algorithms in Go — optimized for real-time streaming pipelines operating at massive scale under strict resource constraints
Design and develop new ML algorithms where needed: anomaly detection, health scoring, and predictive analytics on high-volume time-series telemetry from GPU and network infrastructure
Improve and extend existing algorithms and experiment with new approaches suited to real-time streaming constraints
Build and maintain end-to-end ML pipelines — from data ingestion and schema design through model inference — optimized for on-premises, latency-sensitive deployments
Partner with the Data Science team on algorithm design, prototype evaluation, and translating research findings into platform requirements