Research Scientist, LLM Evaluation & Post-Training

Centific

12h•Remote

About The Position

Centific is a frontier AI data foundry that curates diverse, high-quality data, using purpose-built technology platforms to empower leading AI organizations and enterprise clients with safe, scalable AI deployment. The company's team includes over 150 PhDs and data scientists, along with more than 4,000 AI practitioners and engineers, leveraging an integrated solution ecosystem and 1.8 million vertical domain experts across 230+ markets. Centific creates contextual, multilingual, pre-trained datasets, fine-tuned, industry-specific LLMs, and RAG pipelines supported by vector databases. Their zero-distance innovation solutions for GenAI aim to reduce costs by up to 80% and accelerate market solutions by 50%. The mission is to bridge the gap between AI creators and industry leaders, bringing best practices in GenAI to innovators and enterprise customers to unlock significant business value by deploying GenAI at scale. As a Research Scientist, LLM Evaluation & Post-Training, you will be at the forefront of designing evaluation, measurement strategy, and feedback signals to drive model improvement across Centific’s AI platform products. This is a high-impact individual contributor and collaborative research role, blending applied ML research, enterprise AI product development, and customer-facing scientific consulting. You will lead research programs to define next-generation evaluation-driven post-training workflows, develop rigorous benchmark frameworks, and collaborate with leading AI organizations to deliver credible, actionable model improvement insights. This role provides an opportunity to shape Centific’s internal research agenda, build reusable scientific assets, and contribute to top-tier publications.

Requirements

Expert-level benchmark dataset and test suite design for language and multimodal models
Deep understanding of metric design, scoring reliability, and measurement validity
Experience with human evaluation methods and quality assurance (rubric design, inter-rater reliability, adjudication frameworks)
Strong understanding of post-training techniques (SFT, RLHF, RLAIF, DPO, PPO, GRPO) and how training objectives interact with evaluation outcomes
Ability to reason about model behavior, failure modes, and performance tradeoffs across tasks and domains
Familiarity with alignment, safety, and robustness considerations in model evaluation
Strong statistical analysis skills: sampling, uncertainty quantification, significance testing, error analysis, metric interpretation
Ability to synthesize complex experimental findings into concise, actionable recommendations for engineering and business stakeholders
MS or PhD in Computer Science, Machine Learning, Statistics, Applied Mathematics, AI, or a related quantitative field (PhD strongly preferred).
5+ years of relevant experience in applied ML research or research science, with substantial work in LLMs or foundation models (graduate research counts).
Demonstrated experience with LLM evaluation, benchmarking, alignment, post-training, or model quality research.
Strong foundation in experimental design, statistical analysis, and scientific reasoning for ML systems.
Strong Python coding skills for research experimentation, data processing, evaluation pipelines, statistical analysis, and visualization.
Hands-on experience with modern ML frameworks (PyTorch, Hugging Face, JAX/TensorFlow).
Ability to evaluate and compare human and automated evaluation methods, including tradeoffs in cost, reliability, validity, and scalability.
Experience designing reproducible evaluation studies across datasets and model versions.
Strong written and verbal communication skills; able to present nuanced technical conclusions, assumptions, and limitations clearly to both research and non-technical audiences.

Nice To Haves

Hands-on experience running fine-tuning or post-training experiments (SFT, preference optimization, RLHF/RLAIF-style workflows).
Experience with multimodal evaluation (text-image, audio, video) and long-context benchmarking in real-world settings.
Experience designing multi-turn, interactive, or agentic evaluation protocols.
Publications and/or open-source benchmark contributions in LLM evaluation, post-training, alignment, or related areas at top venues (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.).
Experience in customer-facing applied research, technical consulting, or cross-functional product/research collaboration.
Familiarity with safety, trustworthiness, and governance considerations in GenAI evaluation.

Responsibilities

Define and execute a rigorous research agenda focused on LLM evaluation and post-training, with emphasis on evaluation-driven model improvement.
Design experiments to study how evaluation methodologies impact fine-tuning and post-training outcomes.
Develop and validate comprehensive evaluation frameworks for LLM and multimodal systems, covering benchmark and task design, scoring methods, judge/model-assisted evaluation, human evaluation protocols, and robustness/stress testing.
Lead research on frontier evaluation domains including long-context, cross-modal, and dynamic multi-turn evaluations.
Study effectiveness and limitations of existing techniques and propose improved methodologies with clear validity and scalability tradeoffs.
Analyze model behavior and failure patterns; generate actionable recommendations for model improvement and evaluation redesign.
Translate findings into practical improvements for customer solutions and Centific’s internal platforms.
Partner with Language Data Scientists to integrate human-in-the-loop and synthetic data/evaluation strategies.
Partner with AI/ML Research Engineers to translate research methods into scalable evaluation and post-training pipelines.
Engage with customer technical stakeholders at leading AI organizations to understand evaluation goals, review methodologies, and provide expert scientific recommendations.
Serve as a credible technical peer to research and engineering leaders.
Contribute to internal benchmark datasets, reusable evaluation frameworks, and research assets.
Produce high-quality technical documentation, internal research reports, and client-facing materials explaining methods, results, assumptions, and limitations.
Contribute to Centific’s position as a leader in LLM evaluation and post-training through publications, conference presentations, and open-source contributions.