Member of Technical Staff (MTS) - Multimodal Foundation Models

Deeproute.ai•Fremont, CA

About The Position

We are looking for strong technical builders and researchers who deeply understand foundation models and representation learning beyond simply applying existing frameworks. Ideal candidates should have strong experimental rigor, solid systems and modeling intuition, hands-on engineering ability, and an interest in scalable multimodal AI systems for real-world autonomy. We value people who can bridge research and production, and who care about robustness, scalability, efficiency, and practical deployment in large-scale autonomous driving systems.

Requirements

MS or PhD in Computer Vision, Machine Learning, Robotics, Computer Science, or related fields.
Strong understanding of foundation models, self-supervised learning, representation learning, multimodal learning, and large-scale pretraining.
Hands-on experience with methods such as CLIP, DINO/DINOv2, MAE, contrastive learning, masked modeling, MoE or scalable transformer architectures.

Nice To Haves

Experience with video foundation models.
Experience with long-context modeling.
Experience with retrieval systems.
Experience with efficient inference.
Experience with distributed training.
Experience with model compression and deployment optimization.
Strong publication record in top-tier venues such as CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML.

Responsibilities

Develop scalable pretraining pipelines for large-scale multimodal driving data.
Design and optimize training strategies for vision-language-action models, video foundation models, long-context temporal modeling, and multimodal representation alignment.
Improve training stability, data efficiency, scaling efficiency, and representation robustness.
Work on distributed training systems and large-scale model optimization using frameworks such as PyTorch Distributed, DeepSpeed, and Megatron-LM.
Design and improve self-supervised and multimodal learning methods for real-world autonomous driving systems.
Conduct architecture-level research on Vision Transformers (ViT), video/temporal architectures, multimodal fusion and alignment, embedding and retrieval systems, and long-context and memory-efficient architectures.
Explore and improve pretraining objectives, loss functions, training paradigms, generalization, and robustness.
Analyze model behavior through rigorous ablation studies, failure case analysis, and representation probing and evaluation.
Improve the efficiency, scalability, and deployability of large multimodal foundation models for real-world autonomous driving systems.
Work on areas such as model quantization, knowledge distillation, efficient attention mechanisms, sparse architectures and Mixture-of-Experts (MoE), long-context and memory-efficient modeling, inference acceleration and serving optimization, and training and inference system efficiency.
Optimize model throughput, latency, memory usage, and deployment performance for large-scale production environments.