Senior Software Engineer, Machine Learning Infrastructure - Generative AI

DoorDash USA•Sunnyvale, CA

1d•$137,100 - $299,300

About The Position

You will join a small, high-leverage team building production infrastructure for Generative AI at DoorDash, leading the design and architecture of our open-weights model platform spanning inference and fine-tuning: real-time GPU serving, high-throughput batch inference, and model fine-tuning. You’ll set technical direction across model serving and inference engines, fine-tuning and training pipelines, GPU autoscaling and utilization, batch pipelines, backend services, and observability, and mentor engineers as you go. This role is ideal for a senior engineer who enjoys owning ambiguous, high-impact systems and pushing the cost/performance frontier of GPU inference and fine-tuning in a fast-moving technical area where product needs, model capabilities, vendor ecosystems, and cost/performance tradeoffs are evolving quickly.

Requirements

B.S., M.S., or PhD. in Computer Science or equivalent
6+ years of industry experience in software engineering
Deep backend engineering fundamentals, especially in Python and distributed systems.
Track record of designing and owning production services, APIs, data pipelines, or ML infrastructure at scale.
Experience operating systems in production, including observability, debugging, reliability, incident response, and performance/cost optimization.
Deep hands-on experience with LLM inference and/or fine-tuning of open-weight models in production — serving (latency, throughput, batching, autoscaling, GPU utilization) and/or fine-tuning (SFT/DPO/LoRA).
Demonstrated technical leadership: leading design across ambiguous, fast-moving technical areas, mentoring engineers, and turning customer use cases into reusable platform capabilities
Proficiency in using AI coding tools (e.g., Claude Code, Codex, Cursor) in the full software development lifecycle, including designing, generating code, testing, monitoring and releasing software

Nice To Haves

Experience with LLM inference engines and serving frameworks (e.g., vLLM, SGLang, TensorRT-LLM) in production
Experience with distributed/multi-node fine-tuning and training pipelines (SFT, DPO/RLHF, LoRA), including data preparation and evaluation
GPU performance work — multi-node/distributed inference, KV-cache/memory optimization, quantization (FP8/INT8/AWQ/GPTQ), or cold-start/throughput tuning
Experience with Kubernetes, cloud infrastructure (AWS/GCP), GPUs, serverless/elastic GPU platforms (e.g., Modal), or high-throughput batch systems
Experience with LLM gateways, model routing, vendor abstraction, or cost attribution
Experience building developer platforms, internal platforms, or self-serve infrastructure
Experience building and deploying AI agents or MCP servers in production
Experience with eval systems, LLM observability, tracing, RAG, search, or vector databases

Responsibilities

Lead the design of infrastructure that helps DoorDash teams move GenAI ideas from prototype to production, increasing the velocity of business impact from AI across the company.
Own and evolve our open-weights serving stack — real-time GPU endpoints, high-throughput batch inference, and fine-tuning (SFT/DPO/LoRA) — alongside the LLM Gateway, Agent Gateway, evals infrastructure, guardrails, and cost attribution.
Architect scalable, high-performance systems for model serving, batch inference, GPU autoscaling, and fine-tuning that power real customer and internal automation use cases
Push the cost and latency frontier of GPU inference — turning batch jobs that took days into hours and cutting inference cost by multiples — while giving product teams a clean choice across open-weight and closed-source models with reliability, fallback, observability, and cost controls built in.
Build platforms that support rapid experimentation while meeting production standards for latency, scale, monitoring, SLOs, playbooks, and operational excellence.
Partner closely with — and raise the technical bar for — ML engineers, product engineers, data scientists, and platform teams across DoorDash, Wolt, and Deliveroo to turn emerging GenAI capabilities into durable platform primitives.
Set technical direction for the future of DoorDash’s centralized GenAI platform — including emerging directions such as reinforcement learning (RLHF/RLVR), agent optimization, and other post-training and agentic techniques — enabling the next generation of AI-powered products, agents, automation, and personalization.