ML Platform Engineer

AvrideAustin, TX
13h

About The Position

As an ML Platform Engineer at Avride, you'll own critical pieces of the ML stack: workflow orchestration, distributed execution, resource governance, performance.You will shape how ML teams across the company run experiments and train models at scale. You will build the abstractions and services that make training workloads reliable, cost-efficient, and fast, helping ML teams run at scale on Kubernetes with strong reliability and excellent developer experience.

Requirements

  • Strong proficiency in Python or Go; C++ is a plus
  • Track record of designing and building scalable, maintainable systems and services
  • Experience operating production services end-to-end: APIs, reliability practices, observability
  • Deep knowledge of Kubernetes: how scheduling, resource management, controllers, and pod lifecycle actually behave under pressure
  • Solid Linux and systems debugging skills: performance investigation, networking, storage/IO
  • Ability to troubleshoot complex production issues across logs, metrics, and traces and drive them to resolution

Nice To Haves

  • Experience with Argo Workflows, Ray, MLflow, or comparable distributed ML tooling
  • Hands-on experience building or operating large-scale ML training systems: GPU scheduling, distributed training, training data pipelines
  • Track record of optimizing resource usage and performance in distributed environments

Responsibilities

  • Build and scale our ML compute platform on Kubernetes, using Argo Workflows for training, evaluation, and data processing orchestration
  • Design and implement core platform capabilities, including a Ray-based internal SDK for distributed execution, and multi-tenant resource governance — scheduling, priorities, quotas, and policy enforcement across GPU, CPU, memory, and IO
  • Improve end-to-end training throughput and platform efficiency by optimizing data access patterns, caching, and removing bottlenecks in storage, network, and resource contention
  • Work directly with ML teams to debug complex workload issues, drive root-cause analysis, and turn recurring problems into platform-level fixes
  • Evaluate, integrate and extend open-source tooling (Argo Workflows, Ray, Kubernetes ecosystem) to meet evolving platform needs

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

1-10 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service