Core & ML Ops Team Lead

Jobgether
10hRemote

About The Position

This role is ideal for an experienced technical leader in MLOps and distributed systems, responsible for building and maintaining the scalable infrastructure that supports mission-critical services. You will lead a cross-functional team in designing platforms for model training, orchestration, deployment, and monitoring while ensuring high performance, reliability, and security. The position combines hands-on engineering with strategic team leadership, driving adoption of best practices, automation, and observability across the organization. You will collaborate with product, operations, and security teams to implement robust platforms that empower engineers to build and deploy services confidently. Mentorship, knowledge sharing, and establishing production-ready standards are central to your impact. This role allows you to shape platform strategy while staying deeply engaged in cutting-edge technologies and ML operations at scale.

Requirements

  • 5+ years building distributed systems; 3+ years in MLOps or ML platform engineering
  • Strong knowledge of Linux/OS internals, networking, concurrency, and performance profiling
  • Deep expertise in Kubernetes (bonus: Mesos) and GPU infrastructure management
  • Proficiency in high-performance programming (Java, Rust, Go, C++; strong Python skills)
  • Experience designing and operating production model platforms (registry, training, serving, monitoring)
  • Proven experience leading technical teams and implementing organization-wide platform solutions
  • Familiarity with CI/CD, SRE practices, observability, and reliability enablement
  • Strong collaboration, mentoring, and communication skills

Nice To Haves

  • Experience with streaming/workflow tools (Kafka, Argo, Temporal, Airflow)
  • Hands-on work with eBPF observability, perf tooling, or io_uring
  • Expertise in ML/AI cost optimization, multi-tenant quotas, and fairness
  • Experience authoring Golden Paths (service templates, CI/CD blueprints, scaffolds)

Responsibilities

  • Lead the Core & MLOps team, overseeing roadmap, prioritization, delivery, and mentoring
  • Design, develop, and maintain scalable infrastructure for model training, serving, and monitoring
  • Build and maintain the Golden Path: reference repositories, scaffold CLIs, CI/CD pipelines, runtime contracts, and production-ready defaults
  • Operate secure, multi-tenant model registries and orchestration platforms with standardized experiment and evaluation frameworks
  • Integrate AI/ML capabilities as managed platform services with cost and governance controls
  • Collaborate with product engineering, operations, and security teams on adoption and rollout plans
  • Promote best practices in observability, reliability, cost governance, and platform standardization

Benefits

  • Flexible remote work environment, fully distributed globally
  • Exposure to cutting-edge open-source technologies and ML infrastructure
  • Collaborative, multi-cultural team fostering innovation and knowledge sharing
  • Freedom to shape platform architecture and engineering practices
  • Opportunities for career growth and technical leadership impact
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service