Research Engineer - AI/RL Infrastructure

Applied Intuition•Sunnyvale, CA

10d•Onsite

About The Position

We are looking for a passionate Research Engineer (AI/RL Infrastructure) to join the Research Group at Applied Intuition. This role is ideal for engineers who design, build, and operate state-of-the-art, large-scale ML systems and enjoy working closely with researchers to develop and accelerate the core platform powering next-generation physical AI systems. The mission of the Research Group is to create cutting-edge technology enabling next-generation physical AI, with emphasis on the two most challenging applications reshaping our everyday life: end-to-end autonomous driving and robotic generalist. We have a group composed of leading experts from top institutions and companies, recognized for their exceptional academic and industry contributions—including eight Best Paper awards at premier conferences and journals such as CVPR and ICRA. Learn more at appliedintuition.com/research. Supported by industry-leading tools and infra, researchers can access millions of miles of data from large fleets, and deploy methods they develop into various autonomous and robotic systems including self-driving cars/trucks, autonomous mining/construction machines, humanoid robots and dexterous hands. In addition to your research contributions, you will contribute to and learn from best practices in the autonomy and robotics industries within our fast-paced and customer-focused culture. Improvements deployed to our system immediately help our customers with their programs and deliver value to our business. We are open to all years of experience as long as the necessary requirements are met, including those with potential Tech Lead and Manager capacity; Senior/Staff level experience is strongly preferred for this role.

Requirements

Experience building and operating production-grade software systems across the full machine learning lifecycle, including training, evaluation, data, and deployment
Opinions about building a company-wide platform for ML training, evaluation, and deployment
Experience with performance engineering and compute acceleration for large-scale ML training, including profiling, bottleneck analysis, and optimization
Strong systems-level debugging skills to diagnose and resolve issues in large-scale distributed training, spanning model code, data pipelines, runtimes, and cluster infrastructure
Deep familiarity with the open-source ML and systems ecosystem, with judgment on when to adopt open source versus build in-house
Technical experience in: Pytorch, CUDA, Ray, Flyte, K8s

Nice To Haves

Industry experience on relevant topics (self-driving application preferred)

Responsibilities

Design and build training and evaluation infrastructure to support our current AI research directions, orchestrating massive GPU clusters to process PBs of multimodal sensor data
Build robust benchmarking, continuous evaluation, and regression tracking systems to measure model performance across diverse, long-tail real-world driving distributions
Develop large-scale data sampling, dataset generation, and advanced data curation pipelines, leveraging state-of-the-art AI models to power a closed-loop data flywheel
Enable high-throughput distributed training across heterogeneous cloud environments, focusing on reliability, efficiency, and cost-aware scaling
Collaborate closely with AI research, autonomy, and platform teams to translate cutting-edge research into production-ready systems