Senior Machine Learning Engineer, DevOps/SRE

Roku•San Jose, CA

3d•$148,750 - $361,000•Hybrid

About The Position

Roku is seeking a talented and experienced Senior Software Engineer, MLOps/DevOps, to join the Advertising Performance team. This role is critical in supporting and scaling the Machine Learning infrastructure. The ideal candidate will have a strong background in DevOps/SRE practices, cloud infrastructure management, and MLOps tooling, with a passion for building platforms that accelerate ML experimentation and deployment at internet scale. The role involves partnering closely with ML Scientists and Engineers to streamline the end-to-end ML lifecycle across training, evaluation, deployment, and monitoring on a modern, cloud-native stack running on GCP and AWS using technologies like Kubernetes, Apache Airflow, Spark, Ray, MLflow, and Chronon.

Requirements

BS or MS in Computer Science, Engineering, or a related quantitative field
8+ years of experience in DevOps, SRE, or ML infrastructure, including 4+ years supporting large-scale ML or AI systems
Strong programming skills in Python, and/or Scala, or Java for platform automation and tooling
Deep experience with Kubernetes and container orchestration on GCP (GKE) and/or AWS (EKS)
Expertise with NoSQL or low-latency data stores such as Aerospike or similar technologies
Hands-on experience with data and orchestration technologies such as Apache Spark, Apache Flink, Apache Airflow, and Kafka
Experience building and maintaining CI/CD systems using tools such as Jenkins or GitLab Runner
Familiarity with feature engineering platforms such as Chronon and model lifecycle tools such as MLflow
Strong infrastructure-as-code experience with Terraform or similar tooling
Experience with observability platforms such as Prometheus, Grafana, and Datadog
Excellent communication and cross-functional collaboration skills

Nice To Haves

Experience in the Advertising domain is a plus

Responsibilities

Lead the design and operation of scalable, production-grade cloud infrastructure for ML workloads across AWS and GCP, including GPU/TPU-based training and inference environments
Architect and improve CI/CD systems for ML models and platform services to enable fast, reliable, and safe production releases
Own and evolve low-latency infrastructure for real-time model inference, including KV store and vector databases
Define and enforce observability standards for ML systems, including model performance monitoring, drift detection, capacity planning, and pipeline health metrics
Participate in on-call rotation, leading incident response and root-cause analysis for critical ML training and serving infrastructure
Partner with data scientists and ML engineers to improve platform usability, accelerate model iteration, and implement strong MLOps and SRE best practices
Champion operational excellence across ML infrastructure through automation, resilience engineering, disaster recovery planning, and continuous improvement