At HUD, we’re building the future of how companies and individuals train and evaluate AI. We believe that in the near future, most post-training data used to align and improve LLMs will flow through HUD. We build a platform and developer tools that let teams create post-training data through RL environments and run reinforcement fine-tuning (RFT) reliably, reproducibly, and at scale. We’re trusted by foundation labs, Fortune 500s, and fast-growing startups. We’re also a high-caliber team: former founders, published ML researchers, Olympiad medalists, and engineers who have built products with real adoption. We run lean, move fast, and hold an extremely high bar. The Role We run a platform + SDK/dev tools for creating RL environments/post-training data and running reinforcement fine-tuning at scale. A key part of that experience is our infra and developer sandboxes: fast, reliable, observable, Dockerized compute environments with massive parallelization. We’re looking for an infrastructure owner who is obsessed with performance and reliability—someone who treats shaving seconds off sandbox lifecycle and runtime performance as a sport. You’ll own DevOps, infrastructure and architecture decisions as we hit our next order of scale. Who you are You are an infrastructure owner, not a dashboard watcher You don’t wait for tickets—you proactively find bottlenecks, measure them, fix them, and prove the gains. You ship improvements that compound. You care about tail latencies and failure modes You think in SLOs, load patterns, saturation curves, and blast radius. You design for the real world: retries, backpressure, partial failures, and noisy neighbors. You love performance You enjoy turning “slow and expensive” into “fast and efficient.” You benchmark, profile, tune, and iterate. You can operate autonomously You are comfortable making high-stakes engineering decisions with good judgment, and communicating tradeoffs clearly to the team. You'll own and evolve HUD’s infrastructure so it is: Extremely performant (fast sandbox provisioning, fast cold starts, low tail latency, high throughput) Extremely reliable (predictable behavior, graceful failure, robust scaling, low operational risk) Operationally excellent (systems scale, clear SLOs, deep observability, incident readiness, cost discipline) Secure and compliant (SOC 2-aligned practices, strong security posture by default)
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed
Number of Employees
1-10 employees