Senior Machine Learning Engineer

RokuAustin, TX
Hybrid

About The Position

Roku is seeking a talented and experienced Senior Software Engineer, MLOps/DevOps to join the Advertising Performance team. This role is critical in supporting and scaling our Machine Learning infrastructure. The ideal candidate will have a strong background in DevOps/SRE practices, cloud infrastructure management, and MLOps tooling, with a passion for building platforms that accelerate ML experimentation and deployment at internet scale. You will collaborate closely with ML Scientists and Engineers to streamline the end-to-end ML lifecycle, including training, evaluation, deployment, and monitoring, on a modern, cloud-native stack running on GCP and AWS using technologies like Kubernetes, Apache Airflow, Spark, Ray, MLflow, and Chronon.

Requirements

  • BS or MS in Computer Science, Engineering, or a related quantitative field.
  • 8+ years of experience in DevOps, SRE, or ML infrastructure, including 4+ years supporting large-scale ML or AI systems.
  • Strong programming skills in Python and/or Scala or Java for platform automation and tooling.
  • Deep experience with Kubernetes and container orchestration on GCP (GKE) and/or AWS (EKS).
  • Expertise with NoSQL or low-latency data stores such as Aerospike or similar technologies.
  • Hands-on experience with data and orchestration technologies such as Apache Spark, Apache Flink, Apache Airflow, and Kafka.
  • Experience building and maintaining CI/CD systems using tools such as Jenkins or GitLab Runner.
  • Familiarity with feature engineering platforms such as Chronon and model lifecycle tools such as MLflow.
  • Strong infrastructure-as-code experience with Terraform or similar tooling.
  • Experience with observability platforms such as Prometheus, Grafana, and Datadog.
  • Excellent communication and cross-functional collaboration skills.

Nice To Haves

  • Experience in the Advertising domain is a plus.

Responsibilities

  • Lead the design and operation of scalable, production-grade cloud infrastructure for ML workloads across AWS and GCP, including GPU/TPU-based training and inference environments.
  • Architect and improve CI/CD systems for ML models and platform services to enable fast, reliable, and safe production releases.
  • Own and evolve low-latency infrastructure for real-time model inference, including KV store and vector databases.
  • Define and enforce observability standards for ML systems, including model performance monitoring, drift detection, capacity planning, and pipeline health metrics.
  • Participate in on-call rotation, leading incident response and root-cause analysis for critical ML training and serving infrastructure.
  • Partner with data scientists and ML engineers to improve platform usability, accelerate model iteration, and implement strong MLOps and SRE best practices.
  • Champion operational excellence across ML infrastructure through automation, resilience engineering, disaster recovery planning, and continuous improvement.

Benefits

  • Global access to mental health and financial wellness support and resources.
  • Healthcare (medical, dental, and vision).
  • Life, accident, and disability insurance.
  • Commuter benefits.
  • Retirement options (401(k)/pension).
  • Time off in accordance with local leave policies and other personal needs.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service