Staff Software Engineer, ML Infrastructure

DecagonSan Francisco, CA
1d$300,000 - $430,000Onsite

About The Position

We're hiring a Staff ML Infrastructure Engineer to own the platforms powering Decagon's model training and inference. You'll build distributed training systems, design inference architecture across multiple providers, and create the frameworks that let our Research and Product teams ship faster. This role is for someone who thrives on technical depth, can lead multi-quarter initiatives, and wants to shape the long-term architecture of our ML stack.

Requirements

  • 8+ years building ML infrastructure or production systems at scale
  • Deep experience with distributed training: multi-node GPU clusters, fault tolerance, and optimization
  • Strong understanding of LLM inference: latency optimization, provider tradeoffs, and serving architecture
  • Proficiency in Python and modern ML frameworks (PyTorch, JAX, or TensorFlow)
  • Proven track record leading complex, multi-quarter technical projects

Responsibilities

  • Design and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scale
  • Implement and integrate state-of-the-art training algorithms into production pipelines
  • Own inference architecture and multi-provider routing, including failover and optimization
  • Research and implement inference optimizations including quantization, speculative decoding, and batching strategies
  • Lead initiatives to improve latency and cost efficiency across the training and serving stack
  • Build evaluation and experimentation infrastructure that enables rapid, reliable iteration
  • Drive technical direction, mentor engineers, and establish best practices for ML infrastructure

Benefits

  • Medical, dental, and vision benefits
  • Take what you need vacation policy
  • Daily lunches, dinners and snacks in the office to keep you at your best
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service