In this role, you will be responsible for building low-latency inference pipelines for on-device deployment, enabling real-time next-token and diffusion-based control loops in robotics. You will design and optimize distributed inference systems on GPU clusters, focusing on pushing throughput with large-batch serving and efficient resource utilization. Your work will involve implementing efficient low-level code using CUDA, Triton, and custom kernels, integrating it seamlessly into high-level frameworks. Additionally, you will optimize workloads for both throughput and latency, develop monitoring and debugging tools to ensure reliability, determinism, and rapid diagnosis of regressions across both stacks.