INFERENCE ENGINEER

MakerMaker•San Francisco, CA

5d•Onsite

About The Position

We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site. You build and operate the inference systems that serve our models in production. The work spans serving infrastructure, runtime optimization, and the long tail of production infrastructure that come with running real workloads. This is an engineering role, not a research role. You'll measure, profile, debug, and ship. You'll work alongside researchers, but your job is to make their work fast and reliable in production. Real ownership, real autonomy.

Requirements

Senior ML systems engineer with 3+ years building production-grade, large-scale serving infrastructure
Strong distributed systems experience ; you've been on-call for systems that matter
Performance profiling and optimization fluency: you read flame graphs, you are analytical and measured before you change
Experience with GPU-accelerated inference at scale (multi-GPU, multi-node, batched and streaming workloads), preferably experience with AMD GPUs
Fluent Python; comfortable reading and writing systems-level code in at least one of the following languages: C++,CUDA, ROCm or Triton
Track record of shipping production infrastructure, preferably surfaces serving millions of requests across diverse workloads
Good written communication; you can write a runbook that someone else can follow at 3am

Nice To Haves

Open-source contributions to inference / serving frameworks
Experience with mixed cloud and on-premises deployments
Familiarity with hardware-aware optimization (memory hierarchy, NCCL/RDMA, NUMA)
Background in compilers, runtimes, or accelerator software stacks

Responsibilities

Build, operate, and harden production inference systems serving large models at high throughput
Own the performance characteristics of those systems end-to-end: throughput, latency, cost-per-token, reliability under load
Profile real workloads to identify bottlenecks; ship fixes that move the metric you set out to improve
Implement and integrate inference optimizations from the research team (quantization, custom kernels, scheduling improvements, memory management) into production
Design observability into the inference layer: metrics, tracing, alerting that surface regressions before users notice them
Run capacity planning, autoscaling, and load testing for varied workload shapes (batch, online, mixed, agentic)
Diagnose and resolve production incidents; write postmortems that turn bugs into systemic fixes

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume