About The Position

The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence. As a Multi-modality Foundation Model Engineer, you will focus on building highly efficient, production-ready multi-modality models. We are looking for experts who have hands-on experience building multi-modality foundation models—whether that involves AV-centric modalities (Vision, LiDAR, Radar) or broader domains (Vision, Language, Text, Audio). You will design, train, and deploy these models using Knowledge Distillation (KD) to transfer capabilities from large-scale proprietary teacher models to efficient student models capable of real-time, on-vehicle inference.

Requirements

  • MS or PhD in Computer Science, Machine Learning, or a related technical field with demonstrated professional experience.
  • Deep, proven expertise in building and training large-scale multi-modality foundation models (e.g., Vision-Language Models (VLMs), Vision-Audio-Text, or Vision-LiDAR-Radar architectures).
  • Strong understanding of cross-modal alignment, multi-modal attention mechanisms, and large-scale pre-training techniques.
  • Proven experience in Knowledge Distillation (KD), model compression, and training highly efficient student models for production environments.
  • Proficiency in ML frameworks (e.g., PyTorch) and experience building large-scale ML training and evaluation pipelines.

Nice To Haves

  • Experience in the Autonomous Driving or robotics industry.
  • Experience with model deployment, optimization, and hardware constraints (e.g., C++ for inference, TensorRT, quantization, pruning).
  • Publications in top-tier conferences (CVPR, ICCV, NeurIPS, ICLR, ACL) related to multi-modality foundation models, cross-modal learning, or model compression.

Responsibilities

  • Build, pre-train, and evaluate large-scale multi-modality foundation models from the ground up, successfully aligning diverse data streams (e.g., Vision, LiDAR, Radar, Language, Audio).
  • Define and execute the ML roadmap for deploying these multi-modality representations to the vehicle.
  • Architect and implement Knowledge Distillation pipelines to compress large-capacity multi-modal teacher models into highly efficient, production-ready student models.
  • Build high-quality training and evaluation datasets, applying advanced data-centric techniques to maximize cross-modal representation learning and student model convergence.
  • Collaborate with downstream perception teams to integrate and validate the performance, robustness, and latency of your models in on-board production systems.

Benefits

  • paid time off (e.g. sick leave, vacation, bereavement)
  • unpaid time off
  • Zoox Stock Appreciation Rights
  • Amazon RSUs
  • health insurance
  • long-term care insurance
  • long-term and short-term disability insurance
  • life insurance
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service