About The Position

We are looking for strong technical builders and researchers who deeply understand foundation models and representation learning beyond simply applying existing frameworks. Ideal candidates should have: Strong experimental rigor, Solid systems and modeling intuition, Hands-on engineering ability, Interest in scalable multimodal AI systems for real-world autonomy. We value people who can bridge research and production, and who care about robustness, scalability, efficiency, and practical deployment in large-scale autonomous driving systems.

Requirements

  • MS or PhD in: Computer Vision, Machine Learning, Robotics, Computer Science, Related fields
  • Strong understanding of: Foundation models, Self-supervised learning, Representation learning, Multimodal learning, Large-scale pretraining
  • Hands-on experience with methods such as: CLIP, DINO / DINOv2, MAE, Contrastive learning, Masked modeling, MoE or scalable transformer architectures
  • Experience with one or more of the following is highly valued: Video foundation models, Long-context modeling, Retrieval systems, Efficient inference, Distributed training, Model compression and deployment optimization
  • Strong publication record in top-tier venues is preferred: CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML

Responsibilities

  • Large-Scale Foundation Model Pretraining: Develop scalable pretraining pipelines for large-scale multimodal driving data. Design and optimize training strategies for: Vision-language-action models, Video foundation models, Long-context temporal modeling, Multimodal representation alignment. Improve: Training stability, Data efficiency, Scaling efficiency, Representation robustness. Work on distributed training systems and large-scale model optimization using frameworks such as: PyTorch Distributed, DeepSpeed, Megatron-LM.
  • Representation Learning & Method Innovation: Design and improve self-supervised and multimodal learning methods for real-world autonomous driving systems. Conduct architecture-level research on: Vision Transformers (ViT), Video / temporal architectures, Multimodal fusion and alignment, Embedding and retrieval systems, Long-context and memory-efficient architectures. Explore and improve: Pretraining objectives, Loss functions, Training paradigms, Generalization and robustness. Analyze model behavior through: Rigorous ablation studies, Failure case analysis, Representation probing and evaluation.
  • Efficient Foundation Models & Scalable Deployment: Improve the efficiency, scalability, and deployability of large multimodal foundation models for real-world autonomous driving systems. Work on areas such as: Model quantization, Knowledge distillation, Efficient attention mechanisms, Sparse architectures and Mixture-of-Experts (MoE), Long-context and memory-efficient modeling, Inference acceleration and serving optimization, Training and inference system efficiency. Optimize model throughput, latency, memory usage, and deployment performance for large-scale production environments.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service