AI Researcher / ML Engineer (ASR & Speech Specialist)

LILT•Washington D.C., DC

About The Position

LILT is transforming how the world communicates by making information accessible to everyone, regardless of language. We leverage cutting-edge AI, machine translation, and human expertise to deliver fast, accurate, and cost-effective translations while maintaining brand voice and quality. At LILT, we foster a collaborative environment with opportunities for growth. Our company virtues—Work together, win together; Find a way or make one; Quicker than they expect; Quality is Job 1—guide our actions. We are trusted by major enterprises like Intel Corporation, Canva, and the U.S. Department of Defense, and backed by investors such as Sequoia and Intel Capital. We are building a category-defining company in the AI-driven global translation market, valued at over $50 billion. We are seeking a highly skilled and visionary Senior AI Researcher / Machine Learning Engineer specializing in Automatic Speech Recognition (ASR) to lead our core speech intelligence and benchmarking initiatives. This role involves being our principal subject matter expert in AI speech data processing, architecting, training, and scaling high-performance, multilingual ASR models, and developing rigorous quality benchmarks for agentic conversational AI. A key aspect of this position is creating robust domain-adaptation frameworks to enable our models to dynamically incorporate proprietary customer terminology, specialized industry jargon, and multilingual nuances. You will collaborate with Engineering, Product, and AI Research teams to translate state-of-the-art speech research into production-ready systems for on-device real-time streaming translation and novel frontier model benchmarks. The key challenges for this role include scaling ASR models capable of dynamic vocabulary insertion for enterprise-grade, ultra-low-latency, real-time environments, and developing end-to-end agentic AI benchmarking that surpasses surface-level metrics.

Requirements

Master’s or Ph.D. degree in Computer Science, Electrical Engineering, Computational Linguistics, Data Science, or a related quantitative field with an emphasis on speech processing or deep learning (or equivalent proven industry track record).
Minimum of 3–5 years of dedicated professional experience developing ASR systems, speech-to-text translation pipelines, or advanced audio processing models.
Advanced proficiency with PyTorch or equivalent frameworks, along with extensive experience utilizing dedicated speech toolkits such as Whisper, NVIDIA NeMo, Hugging Face Transformers, Kaldi, ESPnet, or SpeechBrain.
Hands-on experience converting and running PyTorch models on at least one mobile inference runtime: ExecuTorch, LiteRT (formerly TensorFlow Lite), or ONNX Runtime Mobile. You have personally taken a non-trivial model through conversion, including resolving unsupported operations and dynamic-shape or decoder-loop issues.
Strong software engineering principles in Python, with a clear understanding of data structures, algorithm optimization, and handling complex multilingual text/audio tokenization schemas.
Proven experience working with large-scale audio datasets, audio augmentation techniques (e.g., SpecAugment, noise injection), and text normalization/inverse text normalization (ITN) pipelines.

Nice To Haves

Experience optimizing models for constrained on-device and production environments using quantization (INT4/INT8/FP16), distillation, ONNX Runtime, TensorRT, or Triton Inference Server.
Peer-reviewed publications in premier speech and machine learning conferences (e.g., ICASSP, INTERSPEECH, NeurIPS, ICLR, ACL) are a strong plus, or an active contribution footprint to open-source speech communities.
Working knowledge of mobile NPU/DSP acceleration on the Android SoC landscape (Qualcomm QNN / Hexagon, GPU, and NNAPI delegates) and the trade-offs across Snapdragon, MediaTek, and Google Tensor.
Deep technical familiarity with streaming neural architectures (e.g., block-processing, streaming transformers, or transducer models) and real-time network transport constraints (WebSockets, gRPC).
Professional exposure to building zero-shot multilingual speech systems or managing cross-lingual acoustic phonology data.

Responsibilities

Architect, train, fine-tune, and evaluate state-of-the-art speech representations and ASR models (e.g., End-to-End Conformer, Whisper, RNN-T, and hybrid CTC/Attention architectures) across multiple global languages.
Design and deploy highly scalable algorithms for dynamic vocabulary insertion, contextual biasing, and language model (LM) personalization to precisely capture customer-specific terminology, acronyms, and product names.
Implement automated framework evaluations to benchmark model performance, rigorously tracking Word Error Rate (WER), Character Error Rate (CER), embedding-based metrics, latency budgets (RTF), and computing efficiency profiles under varying acoustic environments.
Develop pioneering multilingual benchmarks for end-to-end conversational AI agents, including speech-to-text and text-to-speech components, and targeting the weaknesses of state-of-the-art frontier models.
Partner with core engineering teams to build, optimize, and maintain high-throughput pipelines optimized for both ultra-low latency real-time streaming inference and high-efficiency asynchronous (batch) multi-channel speech analysis.
Develop and refine standard auxiliary components of the speech processing chain, including Voice Activity Detection (VAD), speaker diarization, punctuation restoration, noise/acoustic normalization, and audio pre-processing filters.
Translate product requirements into technical AI roadmaps, working hand-in-hand with Product Managers to ship speech-to-text, simultaneous translation, and semantic speech analytics features.