Senior DevOps & Infrastructure Engineer

HUD•San Francisco, CA

11d•$160,000 - $280,000•Onsite

About The Position

At HUD, we’re building the future of how companies and individuals train and evaluate AI. We believe that in the near future, most post-training data used to align and improve LLMs will flow through HUD. We build a platform and developer tools that let teams create post-training data through RL environments and run reinforcement fine-tuning (RFT) reliably, reproducibly, and at scale. We’re trusted by foundation labs, Fortune 500s, and fast-growing startups. We’re also a high-caliber team: former founders, published ML researchers, Olympiad medalists, and engineers who have built products with real adoption. We run lean, move fast, and hold an extremely high bar. The Role We run a platform + SDK/dev tools for creating RL environments/post-training data and running reinforcement fine-tuning at scale. A key part of that experience is our infra and developer sandboxes: fast, reliable, observable, Dockerized compute environments with massive parallelization. We’re looking for an infrastructure owner who is obsessed with performance and reliability—someone who treats shaving seconds off sandbox lifecycle and runtime performance as a sport. You’ll own DevOps, infrastructure and architecture decisions as we hit our next order of scale. Who you are You are an infrastructure owner, not a dashboard watcher You don’t wait for tickets—you proactively find bottlenecks, measure them, fix them, and prove the gains. You ship improvements that compound. You care about tail latencies and failure modes You think in SLOs, load patterns, saturation curves, and blast radius. You design for the real world: retries, backpressure, partial failures, and noisy neighbors. You love performance You enjoy turning “slow and expensive” into “fast and efficient.” You benchmark, profile, tune, and iterate. You can operate autonomously You are comfortable making high-stakes engineering decisions with good judgment, and communicating tradeoffs clearly to the team. You'll own and evolve HUD’s infrastructure so it is: Extremely performant (fast sandbox provisioning, fast cold starts, low tail latency, high throughput) Extremely reliable (predictable behavior, graceful failure, robust scaling, low operational risk) Operationally excellent (systems scale, clear SLOs, deep observability, incident readiness, cost discipline) Secure and compliant (SOC 2-aligned practices, strong security posture by default)

Requirements

Deep AWS experience, including operating production systems at scale (networking, IAM, compute, storage, observability, cost).
Strong Kubernetes/EKS experience: cluster design, workload isolation, autoscaling (cluster + pod), upgrades, reliability practices.
Excellent Docker + container runtime knowledge: image optimization, build pipelines, caching strategies, and runtime security considerations.
Systems-level competence: Linux fundamentals, networking, performance debugging, resource contention, concurrency basics.
Infrastructure automation: strong ability to implement infrastructure as code (Terraform/CDK/CloudFormation) and repeatable environments.
Observability expertise: metrics/logging/tracing design, SLOs/SLIs, alerting that avoids noise and catches real issues.
Security + compliance mindset: experience working in SOC 2-aligned environments; ability to implement least privilege, auditability, and operational controls.
Strong engineering communication: can write clear docs, propose designs, and upskill the team.

Nice To Haves

Experience building ephemeral compute / sandbox / job execution platforms (multi-tenant, Dockerized workloads, queueing, isolation).
Proven wins reducing cold start / startup time and improving p95/p99 latency for infra-critical paths.
Deep familiarity with:
Karpenter / Cluster Autoscaler, HPA/VPA, pod scheduling strategies, priority classes, taints/tolerations, topology spread constraints
Container performance: image layering, registry optimization, pull-through caches, snapshotters, prewarming strategies
Service mesh / networking (where appropriate), network policies, ingress design, egress controls
Experience migrating from mixed hosting providers into a more cohesive platform architecture.
Experience with CI/CD at high velocity (safe deploys, progressive delivery, canaries, rollbacks).
Experience with GPU infrastructure and orchestration (if applicable to workloads).
Security depth beyond basics: threat modeling, hardening, secure supply chain for containers, audit-readiness workflows.
Ability to contribute across the stack:
Python (our SDK and backend systems) and Next.js/TypeScript, enough to collaborate effectively with other engineers.
Strong fluency with AI coding tools (using them to accelerate debugging, automation, and implementation without sacrificing correctness).

Responsibilities

Developer sandbox infrastructure
Own our AWS + EKS-based sandbox platform that runs Dockerized workloads for customers and internal teams.
Optimize sandbox lifecycle end-to-end: provisioning, scheduling, image pulls, startup, execution, teardown, and caching.
Design for massive parallelism while maintaining reliability, fairness, and predictable performance.
Kubernetes + AWS excellence
Evolve our cluster architecture: node groups, autoscaling strategies, spot/on-demand mixes, scheduling policies, and workload isolation.
Build safe-by-default patterns: quotas, resource limits, network policies, pod security, secrets management, and guardrails.
Improve cluster resiliency and operational ergonomics (upgrades, rollouts, disaster recovery, fail-safes).
Cross-stack DevOps ownership
Address infrastructure bottlenecks as we scale.
Improve developer experience for internal teams: safer deploys, better CI/CD, smoother local/dev workflows, faster iteration.
Provide architectural input and raise the infra maturity of the team via docs, patterns, and coaching.
Interface with our backend/workers (Railway), frontend (Vercel/Next.js), and data (Supabase/Postgres) to ensure the whole system is cohesive.
Performance engineering and ruthless measurement
Establish “infra product metrics” and instrument everything: P50/P95/P99 sandbox startup times, queue times, job success rates, noisy-neighbor rates, image pull latencies, cluster saturation, and cost-per-run.
Build benchmarking harnesses for sandboxes and workloads to track regressions and validate improvements.
Treat efficiency as a first-class metric: optimize utilization without sacrificing latency or reliability.
Observability + incident readiness
Implement gold-standard observability across logs/metrics/traces with actionable dashboards and alerting tied to SLOs.
Create runbooks, incident processes, and postmortem culture that meaningfully improves the system each time.