Machine Learning Researcher - RL and Agentic Systems

Protege

About The Position

Data is the foundation of AI performance, and we believe model quality starts with data quality. As AI systems become more agentic, a critical challenge is understanding which real-world datasets, tasks, and environments actually lead to better model behavior. We’re seeking a Machine Learning Researcher focused on RL and agentic systems to help define, design, and evaluate the datasets, tasks, environments, and benchmarks used to assess advanced AI systems. In this role, you’ll work closely with research and engineering teams to translate real-world workflows into high-value datasets and evaluation assets: structured tasks, interactive environments, benchmark suites, and quality scorecards that help us understand how models perform in realistic settings. You’ll help define what “high-quality agentic data” means in practice, using statistical, computational, and ML-driven methods to evaluate dataset quality, task design, environment fidelity, and downstream model performance. You’ll work on the core problems of benchmarking real-world data, measuring how well models perform on that data, and designing RL-style or agentic environments that capture the structure of meaningful work. This is an ideal role for someone with a strong machine learning background who is excited by reinforcement learning, agentic systems, evaluation, and the role of data in shaping model behavior. You should be excited by the opportunity to build the datasets and benchmarks that help define what high-quality real-world data looks like for frontier AI systems.

Requirements

PhD or equivalent Master’s Degree + 4+ years industry experience in machine learning, computer science, statistics, engineering, mathematics, economics, or related quantitative fields.
Strong understanding of AI model training pipelines, evaluation methodology, and the role of data in shaping model performance.
Experience working with large, unstructured, or semi-structured datasets used to train or evaluate ML systems.
Experience with reinforcement learning, sequential decision-making, agentic systems, tool-using models, or multi-step model evaluation.
Experience designing tasks, benchmarks, environments, simulations, or evaluation frameworks for real-world model behavior.
Strong intuition for realism, coverage, difficulty, fidelity, and meaningful outcome structure in datasets.
Strong experimental design, evaluation, benchmarking, and data-validation skills.
High ownership and ability to independently identify and solve high-impact problems.

Nice To Haves

Experience developing evaluation frameworks or performance metrics for datasets, agentic systems, or training data.
Experience translating real-world workflows into structured tasks or environments for model evaluation.
Experience with RLHF, RLAIF, imitation learning, reward modeling, online or offline RL, or related methods.
Experience with Harbor or other agent evaluation frameworks.
Publications or open-source contributions in reinforcement learning, agents, evaluation, or data-centric AI.
Experience collaborating cross-functionally with product, infrastructure, or partnership teams.
Experience with synthetic data generation, trajectory generation, or simulation-based environments.

Responsibilities

Design and build datasets, tasks, environments, and evaluation assets for benchmarking agentic systems and multi-step model behavior.
Translate real-world workflows into structured tasks, interaction traces, trajectories, stateful environments, and verifiable outcomes that can be used to evaluate advanced AI systems.
Develop frameworks that assess diversity, realism, coverage, fidelity, informativeness, and downstream usefulness of datasets for agentic systems.
Build quality scorecards and evaluation methods that make dataset strengths, weaknesses, and failure modes legible across teams.
Evaluate planning, tool use, robustness, recovery from failure, task completion, and generalization behavior in RL-style or agentic environments.
Connect model failures back to concrete dataset, environment, or task-design gaps and recommend improvements grounded in empirical evidence.
Contribute to tools and systems that automate dataset validation, environment generation, rollout analysis, benchmark construction, and evaluation workflows.
Improve internal infrastructure for reproducible experimentation, benchmark management, and evaluation quality.
Collaborate closely with research and engineering teams to identify data bottlenecks, improve evaluation methodology, and shape internal best practices around task-grounded AI training data.
Represent DataLab’s perspective in cross-functional discussions around dataset quality, benchmark design, and frontier agentic-system evaluation.