Senior Machine Learning Engineer, DevOps/SRE

RokuSan Jose, CA
$148,750 - $361,000Hybrid

About The Position

Roku is seeking a talented and experienced Senior Software Engineer, MLOps/DevOps, to join the Advertising Performance team. This role is critical in supporting and scaling the Machine Learning infrastructure. The ideal candidate will have a strong background in DevOps/SRE practices, cloud infrastructure management, and MLOps tooling, with a passion for building platforms that accelerate ML experimentation and deployment at internet scale. The role involves partnering closely with ML Scientists and Engineers to streamline the end-to-end ML lifecycle across training, evaluation, deployment, and monitoring on a modern, cloud-native stack running on GCP and AWS using technologies like Kubernetes, Apache Airflow, Spark, Ray, MLflow, and Chronon.

Requirements

  • BS or MS in Computer Science, Engineering, or a related quantitative field
  • 8+ years of experience in DevOps, SRE, or ML infrastructure, including 4+ years supporting large-scale ML or AI systems
  • Strong programming skills in Python, and/or Scala, or Java for platform automation and tooling
  • Deep experience with Kubernetes and container orchestration on GCP (GKE) and/or AWS (EKS)
  • Expertise with NoSQL or low-latency data stores such as Aerospike or similar technologies
  • Hands-on experience with data and orchestration technologies such as Apache Spark, Apache Flink, Apache Airflow, and Kafka
  • Experience building and maintaining CI/CD systems using tools such as Jenkins or GitLab Runner
  • Familiarity with feature engineering platforms such as Chronon and model lifecycle tools such as MLflow
  • Strong infrastructure-as-code experience with Terraform or similar tooling
  • Experience with observability platforms such as Prometheus, Grafana, and Datadog
  • Excellent communication and cross-functional collaboration skills

Nice To Haves

  • Experience in the Advertising domain is a plus

Responsibilities

  • Lead the design and operation of scalable, production-grade cloud infrastructure for ML workloads across AWS and GCP, including GPU/TPU-based training and inference environments
  • Architect and improve CI/CD systems for ML models and platform services to enable fast, reliable, and safe production releases
  • Own and evolve low-latency infrastructure for real-time model inference, including KV store and vector databases
  • Define and enforce observability standards for ML systems, including model performance monitoring, drift detection, capacity planning, and pipeline health metrics
  • Participate in on-call rotation, leading incident response and root-cause analysis for critical ML training and serving infrastructure
  • Partner with data scientists and ML engineers to improve platform usability, accelerate model iteration, and implement strong MLOps and SRE best practices
  • Champion operational excellence across ML infrastructure through automation, resilience engineering, disaster recovery planning, and continuous improvement

Benefits

  • health insurance
  • equity awards
  • life insurance
  • disability benefits
  • parental leave
  • wellness benefits
  • paid time off
  • global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • accident
  • commuter
  • retirement options (401(k)/pension)
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service