Member of Technical Staff - ML Systems & Inference

Gimlet LabsSan Francisco, CA
12d

About The Position

Gimlet Labs is building the first heterogeneous neocloud for AI workloads. As AI systems scale, the industry is hitting fundamental limits in power, capacity, and cost with today’s homogeneous, vertically integrated infrastructure. Gimlet addresses this by decoupling AI workloads from the underlying hardware. Our platform intelligently partitions workloads into components and orchestrates each component to hardware that best fits its performance and efficiency needs. This approach enables heterogeneous systems across multi-vendor and multi-generation hardware, including the latest emerging accelerators. These systems unlock step-function improvements in performance and cost efficiency at scale. On top of this foundation, Gimlet is building a production-grade neocloud for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production-ready APIs, without having to reason about hardware selection, placement, or low-level performance optimization. Gimlet works with foundation labs, hyperscalers, and AI native companies to power real production workloads built to scale to gigawatt-class AI datacenters. Gimlet Labs is seeking a Member of Technical Staff focused on ML systems and inference. In this role, you will design and build the inference systems that execute full models end-to-end under real production constraints. You will work at the intersection of model architecture, runtime behavior, and system performance to ensure inference is fast, predictable, and scalable. This role is ideal for engineers who deeply understand how modern models execute in practice and who care about latency, throughput, and memory behavior across the full inference lifecycle.

Requirements

  • Strong software engineering fundamentals
  • Experience building or operating ML inference or model serving systems
  • Comfort reasoning about performance, memory usage, and system behavior under load

Nice To Haves

  • Experience with inference runtimes such as TensorRT-LLM, vLLM, or custom serving systems
  • Deep understanding of modern model architectures and attention mechanisms
  • Experience with batching, scheduling, and concurrency control in inference systems
  • Familiarity with KV cache management and memory placement strategies
  • Experience profiling and tuning latency- and throughput-critical systems
  • Software development experience in Python and C++

Responsibilities

  • Design and optimize end-to-end inference pipelines from request ingestion through execution and response
  • Build and evolve inference runtimes that balance latency, throughput, and concurrency under real-world load
  • Reason about batching, queuing, and scheduling tradeoffs, including their impact on tail latency and fairness
  • Manage KV cache allocation, placement, reuse, and eviction across models and requests
  • Optimize prefill and decode paths, including attention mechanisms and memory usage
  • Profile and debug inference performance issues across model, runtime, and system boundaries
  • Work closely with compilers, kernels, networking, and distributed systems to deliver end-to-end performance improvements
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service