Research Engineer, Computer Vision

Meta•Pittsburgh, PA

1d•$121,992 - $181,000

About The Position

As a Research Engineer focused on Multi-Modal Understanding, you will develop advanced algorithms that integrate computer vision with other modalities such as language, audio, and sensor data. You will also drive the curation of multi-modal datasets and ground truth annotation pipelines to support model training and evaluation. You will work closely with our research team to bring innovative multi-modal solutions to production, bridging the gap between visual perception and holistic contextual understanding for immersive applications.

Requirements

Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
Proven experience with C++ and/or Python, including experience with modern features
Experience working with deep learning frameworks such as PyTorch and TensorFlow
Demonstrated experience working collaboratively in cross-functional teams

Nice To Haves

Master's degree in Computer Science, Computer Vision, Machine Learning, or related field
Experience with vision-language models or multi-modal transformers
Publications or contributions to multi-modal understanding research
Familiarity with large language models and their integration with visual understanding systems
Experience with data curation, annotation tools, or ground truth labeling pipelines

Responsibilities

Design and implement multi-modal understanding systems that combine vision, language, and other sensory inputs to enable richer contextual awareness
Develop algorithms for cross-modal learning, fusion, and reasoning to improve human-AI interaction
Lead the curation and management of multi-modal datasets, ensuring data quality and diversity across vision, language, and sensor modalities
Design and oversee ground truth annotation workflows and quality assurance processes for multi-modal data
Complete medium to large features spanning multiple tasks independently with minimal to no guidance
Collaborate with researchers and engineers across computer vision and machine learning teams to drive multi-modal innovation
Develop well-organized code with proper testing and documentation, building production-ready multi-modal systems