About The Position

As an Applied Scientist focused on Evaluation & Model Behavior, you will design and implement the systems used to measure and improve the performance of Computer Use Agents. This is not a support role. You will be responsible for the technical definition of model quality, including the design of evaluation metrics, the curation of training datasets, and the engineering of system prompts. You'll work directly with the engineering team to translate product requirements into technical specifications and quantifiable benchmarks. You'll focus on rigor, clarity, and impact, ensuring every metric, dataset, and prompt moves us toward more reliable, trustworthy agents.

Requirements

  • Master's degree or PhD in Computer Science, Data Science, Statistics, or a related technical field, or equivalent practical experience
  • 3+ years of experience in Data Science, Machine Learning, or Applied Science
  • Proficiency in Python, with experience writing production-quality code for data pipelines or evaluation harnesses
  • Experience with experimental design, A/B testing, or statistical analysis

Nice To Haves

  • Experience with Large Language Models (LLMs), including prompt engineering, fine-tuning, or RLHF workflows
  • Experience building automated evaluation systems or implementing model-based evaluation frameworks
  • Ability to translate product requirements into measurable technical metrics
  • Experience managing human-in-the-loop data pipelines or annotation quality control

Responsibilities

  • Model Behavior Design: Translate product requirements into technical specifications for model behavior. Engineer system prompts and few-shot examples to address specific capability gaps and behavioral failures.
  • Evaluation Design: Define metrics for reasoning, tool usage, and safety, and validate these metrics against human judgment to ensure statistical rigor.
  • Data Strategy: Design algorithms to filter, score, and select training data. Write Python scripts to sanitize inputs and manage the training data lifecycle from raw logs to high-quality datasets.
  • Failure Analysis: Investigate regressions in model benchmarks. Diagnose root causes, distinguishing between data quality issues, prompt instruction failures, or underlying model capability gaps and implement fixes.
  • Ground Truth Management: Define rubrics and guidelines for human annotation. Maintain reference datasets ("Golden Sets") to establish a consistent baseline for model performance evaluation.

Benefits

  • Competitive company-sponsored medical, dental, and vision insurance
  • Top-tier relocation and immigration support
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service