Senior Research Engineer, LLM Evaluation and Behavioral Analysis

Together AISan Francisco, CA
14d$220,000 - $270,000

About The Position

Together AI is building the fastest, most capable open-source-aligned LLMs and inference stack in the world. As part of the Turbo organization, you will be a critical bridge between cutting-edge model research and real-world behavioral reliability. This role focuses on deeply understanding model behavior — probing reasoning, tool use, function calling, multi-step interactions, and subtle failure modes — and building the evaluation systems that ensure models behave intelligently and consistently in production. You will develop robust evaluation pipelines, design high-quality behavioral test suites, and work closely with training, post-training, inference, and product teams to identify regressions, shape datasets, and influence model improvements. Your work will directly define how Together measures model quality and reliability across releases.

Requirements

  • Strong engineering skills with Python, evaluation tooling, and distributed workflows.
  • Experience working with LLMs or transformer-based models, particularly in model evaluation, testing, or red-teaming.
  • Ability to reason clearly about qualitative behavior, edge cases, and model failure patterns.
  • Experience designing experiments, building datasets, and interpreting noisy behavioral signals.
  • Understanding of function calling and structured output formats.
  • Familiarity with GPU or distributed compute environments.
  • Hands-on experience evaluating function-calling models, agentic systems, or tool-augmented LLM pipelines.
  • Experience with multi-turn or multi-step reasoning tasks.
  • Familiarity with inference systems, distributed infrastructure, or post-training workflows.
  • Passion for discovering subtle behaviors, surprising model gaps, or edge-case failures.

Responsibilities

  • Build and iterate on evaluation frameworks that measure model performance across instruction following, function calling, long-context reasoning, multi-turn dialog, safety, and agentic behaviors.
  • Develop specialized evaluation suites for: Function calling — argument correctness, schema adherence, tool selection, multi-function planning, and error recovery. Agentic workflows — task decomposition, multi-step planning, self-correction, and autonomous tool-use sequences. Tool-augmented interactions — search, retrieval, code execution, API-driven actions.
  • Create CI/CD automated pipelines for A/B comparisons, regression detection, behavioral drift monitoring, and adversarial probing.
  • Design and curate high-quality evaluation datasets, especially nuanced or challenging cases across domains.
  • Collaborate with researchers and engineers to diagnose failures, triage regressions, and guide data selection, shaping strategies, objective design, and system improvements.
  • Work with engineering teams to build dashboards, reports, and internal tools that help visualize behavior changes across releases.
  • Operate in a fast-paced, high-impact environment with deep technical ownership and close partnership with world-class model researchers and infra engineers.

Benefits

  • competitive compensation
  • startup equity
  • health insurance
  • other benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

101-250 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service