Machine Learning Systems Engineer, Networking

NVIDIASanta Clara, CA
$152,000 - $287,500

About The Position

Join our team of innovative engineers who are building an AI Data Center AIOps platform that turns raw, high-volume telemetry into reliable, job-centric insights and automation for GPU fleets. As an ML Engineer on this team, you'll design and implement ML algorithms that run in real-time streaming pipelines, detecting anomalies and surfacing insights across massive-scale infrastructure before they impact AI training and inference. The core challenge of this role is building ML algorithms that are simultaneously accurate and efficient —processing millions of telemetry streams in real time within tight CPU and memory budgets. You'll need both the data science depth to design and validate algorithms and the engineering discipline to implement them in production at scale.

Requirements

  • A BS (or equivalent experience) and 5+ years of experience, MS and 3+ years, or PhD with 1+ years in Computer Science, Statistics, or a related field
  • Strong mathematical foundation: statistics, probability, linear algebra, and algorithm analysis
  • Proven experience implementing and optimizing ML algorithms in production — this is a coding-first role; strong implementation skills are required
  • Strong programming skills in one or more of Go, C/C++, Rust, or Scala; Python working knowledge is a plus
  • Familiarity with time-series databases and streaming data architectures
  • Ability to work independently and navigate ambiguity in a fast-paced engineering environment

Nice To Haves

  • Data Science background with hands-on experience building and validating ML models — bridging research and production implementation
  • Experience implementing ML algorithms directly in systems languages for latency-sensitive or resource-constrained environments
  • Research experience: knowing the latest ML literature and translating advances into practical improvements
  • Experience with Kafka-based streaming pipelines and real-time feature engineering at scale

Responsibilities

  • Implement production ML algorithms in Go — optimized for real-time streaming pipelines operating at massive scale under strict resource constraints
  • Design and develop new ML algorithms where needed: anomaly detection, health scoring, and predictive analytics on high-volume time-series telemetry from GPU and network infrastructure
  • Improve and extend existing algorithms and experiment with new approaches suited to real-time streaming constraints
  • Build and maintain end-to-end ML pipelines — from data ingestion and schema design through model inference — optimized for on-premises, latency-sensitive deployments
  • Partner with the Data Science team on algorithm design, prototype evaluation, and translating research findings into platform requirements

Benefits

  • competitive salaries
  • generous benefits package
  • equity
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service