Member of Technical Staff - Training Platform

Prime Intellect•San Francisco, CA

5h•$150,000 - $300,000•Hybrid

About The Position

Prime Intellect is building the open superintelligence stack - from frontier agentic models to the infrastructure that lets anyone create, train, and deploy them. We aggregate and orchestrate global compute into a single control plane and pair it with the full RL post-training stack: environments, secure sandboxes, verifiable evals, and our async RL trainer. We enable researchers, startups, and enterprises to run end-to-end reinforcement learning at frontier scale, adapting models to real tools, workflows, and deployment contexts. We recently raised $15M in funding (taking total funding to $20M), led by Founders Fund with participation from Menlo Ventures and prominent angels including Andrej Karpathy (Eureka Labs, Tesla, OpenAI), Tri Dao (Chief Scientist, Together AI), Dylan Patel (SemiAnalysis), Clem Delangue (Hugging Face), Emad Mostaque (Stability AI), and many others. You'll help build our hosted training platform - the product that lets users launch LoRA and full fine-tuning runs on managed GPU clusters with a single API call or a few clicks. The role spans the developer-facing platform and the underlying Kubernetes-based training infrastructure that runs the jobs.

Requirements

Strong working knowledge of the modern AI stack - open model families, finetuning techniques (LoRA, QLoRA, full FT, RLHF/RLAIF), inference engines (vLLM, SGLang, TensorRT-LLM)
Familiarity with GPU hardware tradeoffs (H100 / H200 / B200, NVLink, interconnects, memory hierarchy) and what they mean for training and inference workloads
Understanding of distributed training fundamentals (data/tensor/pipeline/expert parallelism, NCCL, multi-node scheduling)
Awareness of what's happening at the frontier - new models, training methods, infra patterns - and the ability to translate that into product decisions
Strong Kubernetes operations experience - Helm, CRDs, operators, KEDA, gang scheduling, GPU operator
Comfortable debugging real production clusters (kubectl, pod lifecycle, node issues, networking)
Cloud platform experience (GCP preferred - GCS, GKE, Cloud Run, Cloud Tasks)
Infrastructure automation (Helm, Terraform, Ansible) and a GitOps mindset
Observability: Prometheus, Grafana, Loki, OpenTelemetry, DCGM
Linux fundamentals: networking, namespaces, performance tuning
Strong Python backend development (FastAPI, async, SQLAlchemy)
Comfortable building Python control-plane agents that talk to Kubernetes APIs
Modern frontend development (TypeScript, React/Next.js, Tailwind, shadcn) - enough to ship product surfaces end-to-end
REST and tRPC API design
Experience building developer tools, dashboards, and live-monitoring UIs

Nice To Haves

We value potential over perfection - if you're passionate about democratizing AI development and have experience in either platform or infrastructure development (ideally both), we want to talk to you.

Responsibilities

Design and operate Kubernetes-based training and inference orchestration across multi-cluster, multi-cloud GPU fleets
Build and maintain Helm charts that compose trainers, inference servers, environment servers, and supporting services into reproducible "Training stacks"
Develop the Python control-plane agents that watch pods, report run state to the platform, and keep clusters in sync
Implement scheduling and autoscaling for heterogeneous hardware (H100/H200/B200) using KEDA, LeaderWorkerSet, taints/tolerations, and gang scheduling
Run a tight GitOps workflow - every change ships through PRs, Helm values, and CI
Build node-local model caches, checkpoint pipelines, and shared storage for fast cold starts
Operate the observability stack (Prometheus, Grafana, Loki, DCGM) and make GPU cluster debugging fast
Build the developer-facing surfaces for hosted training: job submission, live run monitoring, logs, metrics, model/adapter management, comparisons
Develop FastAPI backend services and REST APIs that bridge the platform to running clusters
Build real-time monitoring and debugging tools (streaming logs, step-level metrics, failure analysis)
Ship product UI in Next.js / React / TypeScript with shadcn, Tailwind, tRPC, and TanStack Query
Interface with the RL trainer, inference servers, and environment servers running inside our clusters
Productize new training capabilities (new model architectures, RL algorithms, modes)