Applied Research Engineer – Training Infra

Snorkel AI•Redwood City, CA

62d•$150,000 - $180,000•Remote

About The Position

As an Applied Research Engineer at Snorkel AI, you will own the infrastructure that powers our model training and evaluation work. This is a hands-on role where you will build and operate GPU cluster infrastructure, training pipelines, and the tooling that allows our research and engineering teams to run experiments reliably and at scale. You will work closely with research scientists and engineers, translating training requirements into robust, reproducible systems—and proactively removing infrastructure blockers before they slow down the work that matters most. Snorkel AI operates in a fast-paced, high-impact environment. We are looking for someone who takes pride in operational excellence, loves solving complex distributed systems problems, and thrives when given real ownership. This role is a great fit for engineers who love building reliable systems close to the frontier of AI research. We welcome applicants from a wide range of backgrounds—whether your experience comes from industry, research labs, or direct hands-on work with distributed infrastructure at scale.

Requirements

Hands-on experience managing GPU clusters on major cloud providers, including provisioning, network configuration, and cost management.
Experience with distributed compute orchestration tools such as Kubernetes, Slurm, or equivalent cluster management systems.
Working knowledge of distributed training concepts: parallelism strategies, memory optimization techniques, and inter-node communication.
Experience with setting up, managing, and integrating ML experiment tracking and data/model versioning tools..
Strong Python proficiency and solid software engineering fundamentals such as version control, modular design, and automation.
Ability to work in a fast-moving, iterative environment and take end-to-end ownership of ambiguous infrastructure problems.

Nice To Haves

Hands-on experience with post-training workflows such as supervised fine-tuning (SFT) or reinforcement learning (RLHF, GRPO, or similar) is a strong plus, but not required.

Responsibilities

Set up and manage GPU cluster infrastructure on major cloud providers (e.g., AWS HyperPod) for distributed model training, including networking, provisioning, and cost tracking.
Build and operate job orchestration and scheduling systems (e.g., Kubernetes, Slurm, or cloud-native equivalents) to reliably launch and manage training, rollout, and evaluation jobs across multi-node clusters.
Integrate and maintain ML training frameworks and post-training pipelines, ensuring they run stably and reproducibly at scale.
Set up and maintain experiment tracking, dataset versioning, and model artifact management to support fast iteration.
Monitor and optimize cluster health, inter-node communication, and resource utilization; implement fault tolerance and auto-recovery so long-running jobs survive node failures.
Work closely with research scientists and ML engineers to understand requirements, unblock experiments, and evolve infrastructure as our training workloads needs change.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume