Applied Scientist / Research Engineer - Multimodal (Come to Singapore)

Mistral AI

About The Position

Mistral AI is seeking world class Applied Scientists and Research Engineers who wish to relocate to Singapore. You will be focused on multimodal learning (text, image, audio, video) to drive innovative research and collaborate with clients on complex projects. You will design, train, and deploy SOTA multimodal models (e.g., Omni-models, VLMs, Audio, Image generation, Robotics and much more) and apply them to diverse use cases: enterprise search, agents grounded in images and documents, video understanding, and speech interfaces. Youâll work crossâfunctionally with internal and external science, engineering, and product teams to deliver highâimpact AI solutions.

Requirements

You are fluent in English, and have excellent communication skills. You are at ease explaining complex technical concepts to both technical and non-technical audiences.
Youâre an expert with PyTorch or JAX.
Youâre not afraid of contributing to a big codebase and can find yourself around independently with little guidance.
You have experience in one of the following: VLMs, diffusion for image/video, audio processing (ASR/TTS), image processing, robotics.
You write clean, readable, high-performance, fault-tolerant Python code.
You donât need roadmaps: you just do. You donât need a manager: you just ship.
Low-ego, collaborative and eager to learn.
You have a track record of success through personal projects, professional projects or in academia.

Nice To Haves

Hold a PhD / master in a relevant field (e.g., Mathematics, Physics, Machine Learning), but if youâre an exceptional candidate from a different background, you should apply.
Can bring a variety of research experience (agents, multi-modality, robotics, diffusion, time-series).
Have contributed to a large codebase used by many (open source or in the industry).
Have a track record of publications in top academic journals or conferences.
Love improving existing code by fixing typing issues, adding tests and improving CI pipelines.

Responsibilities

Run pre-training, post-training and deploy state of the art models on clusters with thousands of GPUs. You donât panic when you see OOM errors or when NCCL feels like not wanting to talk.
Generate and curate multimodal datasets (webâscale imageâtext, documentâimage, audioâtext, videoâtext), and build robust evaluators/benchmarks for perception, grounding, OCR, and captioning.
Develop the necessary tools and frameworks to facilitate data generation, model training, evaluation and deployment.
Collaborate with cross-functional teams to tackle complex use cases using agents and RAG pipelines.
Manage research projects and communications with client research teams.

Benefits

Competitive cash salary and equity
Health Insurance
Sport : $90 for gym membership allowance
Food : $200 monthly allowance for meals (solution might evolve as we grow bigger)
Transportation : $120/month for public transport or Parking charges reimbursed

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume