We are building large-scale, native multimodal model systems that jointly support vision, audio, and text to enable comprehensive perception and understanding of the physical world. You will join the core research team focused on speech and audio, contributing to the following key research areas: Develop general-purpose, end-to-end large speech models covering multilingual automatic speech recognition (ASR), speech translation, speech synthesis, paralinguistic understanding, and general audio understanding. Advance research on speech representation learning and encoder/decoder architectures to build unified acoustic representations for multi-task and multimodal applications. Explore representation alignment and fusion mechanisms between audio/speech and other modalities in large multimodal models, enabling joint modeling with image and text. Build and maintain high-quality multimodal speech datasets, including automatic annotation and data synthesis technologies.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Entry Level
Industry
Broadcasting and Content Providers
Education Level
Ph.D. or professional degree
Number of Employees
5,001-10,000 employees