Vision Language Model Engineer

EchoTwin AI•San Francisco, CA

164d

About The Position

As a Vision Language Model Engineer, you will design, develop, and optimize advanced vision-language models that integrate visual and textual data to enable intelligent systems. You will work closely with cross-functional teams to build models that power applications such as image captioning, visual question answering, and multimodal AI at the edge.

Requirements

Bachelor’s, Master’s or Ph.D. in Computer Science, Machine Learning, Artificial Intelligence, or a related field (or equivalent experience).
3+ years of experience in machine learning, with a focus on vision-language models or multimodal AI.
Hands-on experience with deep learning frameworks such as PyTorch or TensorFlow.
Proven track record of building and deploying computer vision and/or NLP models.
Proficiency in Python and relevant ML libraries (e.g., Hugging Face, OpenCV, Transformers).
Experience with large-scale model training and optimization (e.g., distributed training, quantization).
Strong understanding of neural network architectures (e.g., CNNs, Transformers, CLIP, or similar).
Experience with multimodal datasets and preprocessing techniques for images and text.
Familiarity with cloud platforms (e.g., AWS, GCP, Azure) and model deployment workflows.
Strong problem-solving skills and ability to work in a fast-paced, collaborative environment.
Excellent communication skills to explain complex technical concepts to diverse audiences.

Responsibilities

Design and implement state-of-the-art vision-language models using deep learning frameworks.
Develop and fine-tune models that combine computer vision and natural language processing for tasks like image captioning, visual question answering, and text-to-image generation.
Collaborate with data scientists and software engineers to integrate models into production systems.
Optimize model performance for accuracy, latency, and scalability in real-world applications.
Conduct experiments to evaluate model performance and iterate on architectures and training pipelines.
Stay up-to-date with the latest research in vision-language models and incorporate advancements into projects.
Contribute to data preprocessing, augmentation, and annotation pipelines for multimodal datasets.
Document model development processes and present findings to technical and non-technical stakeholders.

Benefits

Endless learning and development opportunities from a highly diverse and talented peer group.
Options for medical, dental, and vision coverage for employees and dependents (for US employees).
Flexible Spending Account (FSA) and Dependent Care Flexible Spending Account (DCFSA).
401(k) with 3% company matching.
Unlimited PTO.
Profit sharing.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Career Level

Mid Level

Education Level

Master's degree

Vision Language Model Engineer

About The Position

Requirements

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company