This role focuses on building low-latency inference pipelines for on-device deployment, enabling real-time next-token and diffusion-based control loops in robotics. The position also involves designing and optimizing distributed inference systems on GPU clusters, pushing throughput with large-batch serving and efficient resource utilization. Additionally, the role requires implementing efficient low-level code (CUDA, Triton, custom kernels) and integrating it seamlessly into high-level frameworks, as well as optimizing workloads for both throughput and latency. Developing monitoring and debugging tools to guarantee reliability, determinism, and rapid diagnosis of regressions across both stacks is also a key aspect of this position.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed