Research Scientist Intern, Audio Quality with AI (PhD)

Meta•Redmond, WA

1d•$7,650 - $12,134

About The Position

The Meta Reality Labs Research Team is seeking an intern passionate about speech perception and audio quality to investigate why processed speech sometimes sounds degraded or robotic. The project focuses on identifying systematic phonemic errors as causal factors in perceived quality degradation, and linking these errors to human quality and intelligibility judgments. A core method is to explore the capabilities of audio vs video LLMs. This is fundamentally a speech-perception research role; multimodal/LLM methods are a supporting tool rather than the central focus. Our internships are twelve (12) to twenty four (24) weeks long and we have various start dates throughout the year.

Requirements

Currently have, or is in the process of obtaining, a PhD degree in the field of Speech and Hearing Science, Auditory Neuroscience, Computational Neuroscience, Computer Science, Artificial Intelligence, Generative AI, Transformer Models, Machine Learning, Signal Processing or Computer vision
3+ years experience with Python, Matlab, or similar
3+ years experience with machine learning software platforms such as PyTorch, TensorFlow, etc
Background in speech perception, psychoacoustics, or acoustic phonetics
Experience deploying novel audio computational models and LLMs
Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment

Nice To Haves

Experience building novel audio computational models and LLMs
Demonstrated software engineer experience via an internship, work experience, coding competitions, or widely used contributions in open source repositories (e.g. Github)
Experience in advancing AI techniques, including core contributions to open source libraries and frameworks in computer vision or audio processing
Experience with audio and speech quality assessment
Experience with multichannel audio processing
Experience in visual and acoustic scene analysis
Experience manipulating and analyzing complex, large scale, high-dimensionality data from varying sources
Proven track record of achieving significant results as demonstrated by grants, fellowships, patents, as well as first-authored publications at leading workshops or top computer vision and machine learning conferences such as ARO, ASA, NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, ECCV, ICASSP, InterSpeech or similar
Experience in utilizing theoretical and empirical research to solve problems
Experience working and communicating cross functionally in a team environment
Intent to return to a degree-program after the completion of the internship/co-op

Responsibilities

Investigate systematic phonemic errors as causal factors in perceived speech quality degradation, and link them to human perceptual judgments
Build and curate datasets and benchmarks of speech for phoneme-level analysis
Explore and compare the capabilities of audio and video (multimodal) LLMs as tools to support this analysis
Relate findings to human perceptual data (quality preference and intelligibility) and translate them into actionable insights for research and engineering teams
Where appropriate, adapt multimodal models to the task in a supporting capacity
Collaborate with researchers, engineers, and cross-functional partners to define goals, communicate findings, and drive improvements in speech quality
Develop tools and infrastructure to streamline and scale the analysis
Stay current with research in speech perception and audio quality and intelligibility assessment, and incorporate best practices into Meta's workflows
Disseminate results through internal reports and presentations, and, when appropriate, external publications