Remote)

Genesis AI

118d

About The Position

In this role, you will be responsible for building low-latency inference pipelines for on-device deployment, enabling real-time next-token and diffusion-based control loops in robotics. You will design and optimize distributed inference systems on GPU clusters, focusing on pushing throughput with large-batch serving and efficient resource utilization. Your work will involve implementing efficient low-level code using CUDA, Triton, and custom kernels, integrating it seamlessly into high-level frameworks. Additionally, you will optimize workloads for both throughput and latency, develop monitoring and debugging tools to ensure reliability, determinism, and rapid diagnosis of regressions across both stacks.

Requirements

Deep experience in distributed systems, ML infrastructure, or high-performance serving (8+ years).
Production-grade expertise in Python.
Strong background in systems languages (C++/Rust/Go).
Low-level performance mastery: CUDA, Triton, kernel optimization, quantization, memory and compute scheduling.
Proven track record scaling inference workloads in both throughput-oriented cluster environments and latency-critical on-device deployments.
System-level mindset with a history of tuning hardware–software interactions for maximum efficiency, throughput, and responsiveness.

Responsibilities

Build low-latency inference pipelines for on-device deployment.
Design and optimize distributed inference systems on GPU clusters.
Implement efficient low-level code (CUDA, Triton, custom kernels).
Optimize workloads for throughput and latency.
Develop monitoring and debugging tools for reliability and rapid diagnosis.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

Staff Software Engineer, Inference (Bay Area / Paris / Remote)

About The Position

Requirements

Responsibilities

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company