ML Platform Engineer

Avride•Austin, TX

13h

About The Position

As an ML Platform Engineer at Avride, you'll own critical pieces of the ML stack: workflow orchestration, distributed execution, resource governance, performance.You will shape how ML teams across the company run experiments and train models at scale. You will build the abstractions and services that make training workloads reliable, cost-efficient, and fast, helping ML teams run at scale on Kubernetes with strong reliability and excellent developer experience.

Requirements

Strong proficiency in Python or Go; C++ is a plus
Track record of designing and building scalable, maintainable systems and services
Experience operating production services end-to-end: APIs, reliability practices, observability
Deep knowledge of Kubernetes: how scheduling, resource management, controllers, and pod lifecycle actually behave under pressure
Solid Linux and systems debugging skills: performance investigation, networking, storage/IO
Ability to troubleshoot complex production issues across logs, metrics, and traces and drive them to resolution

Nice To Haves

Experience with Argo Workflows, Ray, MLflow, or comparable distributed ML tooling
Hands-on experience building or operating large-scale ML training systems: GPU scheduling, distributed training, training data pipelines
Track record of optimizing resource usage and performance in distributed environments

Responsibilities

Build and scale our ML compute platform on Kubernetes, using Argo Workflows for training, evaluation, and data processing orchestration
Design and implement core platform capabilities, including a Ray-based internal SDK for distributed execution, and multi-tenant resource governance — scheduling, priorities, quotas, and policy enforcement across GPU, CPU, memory, and IO
Improve end-to-end training throughput and platform efficiency by optimizing data access patterns, caching, and removing bottlenecks in storage, network, and resource contention
Work directly with ML teams to debug complex workload issues, drive root-cause analysis, and turn recurring problems into platform-level fixes
Evaluate, integrate and extend open-source tooling (Argo Workflows, Ray, Kubernetes ecosystem) to meet evolving platform needs

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume