Machine Learning Researcher, Multimodal LLMs

Bland•San Francisco, CA

16h•Hybrid

About The Position

At Bland.com, our mission is to empower enterprises to build AI phone agents at scale. Voice is quickly becoming the primary interface between businesses and their customers, and we are building the models and infrastructure that make those interactions feel natural, reliable, and genuinely human. We’ve raised $65M from leading investors including Emergence Capital, Scale Venture Partners, Y Combinator, and founders of Twilio, Affirm, and ElevenLabs. We are looking for someone to contribute to the development of our next-generation multimodal LLM stack, combining speech, text, tools, and real-time reasoning into a single unified system. You’ll be responsible for building industry-leading conversational AI models that power Bland's agent, and taking them all the way from idea to production. At Bland, we're not just thinking about text modeling. You will define how our agents listen, think, and act in real time, integrating streaming audio, tool execution, and dynamic context into a single coherent system. You will take ideas from research through production systems serving millions of calls per day.

Requirements

Experience with LLMs, multimodal models, or speech-language systems
Deep understanding of prompting, fine-tuning, and alignment techniques
Familiarity with neural audio codecs and modern multimodal LLM techniques
Ability to go from idea dataset experiment conclusion in days
Know how to design experiments that actually answer the question
Strong sense for what makes an interaction feel natural vs robotic
Ability to translate abstract modeling ideas into user-facing improvements
Thrive in ambiguous, fast-moving environments
Care about impact, not just elegance
Think in systems, not just models
Obsess over latency, correctness, and real-world behavior
Comfortable discarding ideas quickly when data disagrees

Nice To Haves

Experience with real-time voice systems or conversational AI
Background in tool-using agents or agent frameworks
Experience with multimodal datasets (audio + text + actions)
Contributions to LLM or speech-related research or open source

Responsibilities

Contribute to the development of our next-generation multimodal LLM stack, combining speech, text, tools, and real-time reasoning into a single unified system
Build industry-leading conversational AI models that power Bland's agent
Take models all the way from idea to production
Define how agents listen, think, and act in real time, integrating streaming audio, tool execution, and dynamic context into a single coherent system
Take ideas from research through production systems serving millions of calls per day
Take ownership from research through deployment
Push toward simple abstractions for complex problems