Engineer, Inference & Model serving

Techire AiSan Francisco, CA
Hybrid

About The Position

ML Model Serving Engineer Want to build the layer that actually makes AI usable in real time? You’ll join a team focused on inference, where performance is the product. This is about delivering low-latency, high-throughput systems across LLMs, speech, and vision models running in production, not offline experiments. They’re building real-time AI systems that need to respond instantly, reliably, and at scale. That means solving hard problems around batching, GPU efficiency, memory constraints, and system-level bottlenecks that most teams never fully crack. You’ll sit at the core of the platform, working across model serving, infrastructure, and performance optimisation. A big part of the role is pushing current tooling beyond its limits, extending frameworks, profiling bottlenecks, and designing systems that hold up under real-world load. This is not about training models. It’s about making them fast, efficient, and production-ready.

Requirements

  • Strong experience with ML inference or model serving systems
  • Deep understanding of latency and throughput optimisation in production
  • Solid Python and PyTorch skills, plus a systems or performance engineering mindset
  • Familiarity with distributed systems and production infrastructure

Nice To Haves

  • Exposure to CUDA, GPU profiling tools, or systems like Kubernetes and Ray is useful, but the key is knowing how to make models run efficiently at scale.

Responsibilities

  • Building high-performance serving systems for LLM, speech, and vision models
  • Scaling inference to production workloads with strict latency requirements
  • Optimising GPU utilisation and execution efficiency
  • Implementing techniques like continuous batching, KV cache optimisation, speculative decoding, and prefill/decode separation
  • Improving frameworks such as vLLM, TensorRT-LLM, Triton, and SGLang
  • Profiling and debugging performance across GPU, memory, and system layers

Benefits

  • equity
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service