About The Position

In this role, you will be responsible for building low-latency inference pipelines for on-device deployment, enabling real-time next-token and diffusion-based control loops in robotics. You will design and optimize distributed inference systems on GPU clusters, focusing on pushing throughput with large-batch serving and efficient resource utilization. Your work will involve implementing efficient low-level code using CUDA, Triton, and custom kernels, integrating it seamlessly into high-level frameworks. Additionally, you will optimize workloads for both throughput and latency, develop monitoring and debugging tools to ensure reliability, determinism, and rapid diagnosis of regressions across both stacks.

Requirements

  • Deep experience in distributed systems, ML infrastructure, or high-performance serving (8+ years).
  • Production-grade expertise in Python.
  • Strong background in systems languages (C++/Rust/Go).
  • Low-level performance mastery: CUDA, Triton, kernel optimization, quantization, memory and compute scheduling.
  • Proven track record scaling inference workloads in both throughput-oriented cluster environments and latency-critical on-device deployments.
  • System-level mindset with a history of tuning hardware–software interactions for maximum efficiency, throughput, and responsiveness.

Responsibilities

  • Build low-latency inference pipelines for on-device deployment.
  • Design and optimize distributed inference systems on GPU clusters.
  • Implement efficient low-level code (CUDA, Triton, custom kernels).
  • Optimize workloads for throughput and latency.
  • Develop monitoring and debugging tools for reliability and rapid diagnosis.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service