ML Lead, AI Data Labeling

NewtonX

3d•$180,000 - $260,000•Remote

About The Position

This role is for an ML Lead at NewtonX, a B2B insights company. The primary focus is on the evaluation of AI systems, providing structured, expert-grounded evaluation data and domain-specific benchmarks to clients. The ML Lead will serve as the technical counterpart to ML and product teams at client organizations, translating their needs into operational specifications and partnering with recruiting and operations to build expert pipelines. Additionally, the role involves designing and developing NewtonX's own domain benchmarks across key verticals like finance, legal, and healthcare. The position also has a light sales component, involving client calls, identifying opportunities, and shaping pitches. The ML Lead will work directly with the VP of Commercial.

Requirements

5 to 8 years of applied ML experience with substantive evaluation, benchmark, or human data work.
Working fluency with modern LLM evaluation, including benchmark design, contamination handling, statistical significance, eval harness construction, agentic and tool-use evaluation, RLHF and preference data quality, and red-team probe design.
Strong programming foundation (Python, working with model APIs, prototyping scoring pipelines).
Statistical fluency (understanding significance, defending sample size choices).
Demonstrated client-facing presence (presenting technical work, defending design choices, adjusting scope).
Light commercial instinct (identifying client problems and potential solutions).
Strong written communication skills (methodology sections, technical reports, proposals).

Nice To Haves

Direct experience designing or contributing to an LLM benchmark or evaluation system.
Domain depth in finance, legal, healthcare, scientific reasoning, and/or software engineering.
Exposure to expert-driven data work (RLHF pipelines, preference data collection, expert annotation programs, red-team operations, evaluation contractor management).
Graduate degree in computer science, machine learning, statistics, or a related quantitative field (or strong applied track record).
Publications or open-source contributions in evaluation, benchmarking, or applied ML methodology.

Responsibilities

Serve as the primary technical point of contact for ML, applied science, and product teams at AI-focused clients.
Engage in technical conversations regarding eval design, dataset construction, contamination risk, statistical power, inter-annotator agreement, RLHF data quality, agentic evaluation, and red-teaming methodology.
Translate ambiguous technical requirements into concrete operational specs, including target expert profiles, screener trees, task design, annotation rubrics, quality control protocols, and statistical sampling plans.
Design and build domain benchmarks for NewtonX-owned domains in high-value verticals (finance, legal, healthcare, etc.).
Architect benchmark structure, including task taxonomy, difficulty distribution, expert involvement model, evaluation rubrics, and scoring protocols.
Recruit and calibrate domain experts for benchmark tasks.
Publish methodology papers, technical reports, and leaderboards.
Work with recruiting and operations to convert client and benchmark requirements into operational specs.
Calibrate the recruiting team on quality standards for each engagement.
Own the technical feedback loop for expert output quality.
Define and monitor quality control metrics such as inter-annotator agreement targets, gold-standard task injection rates, and statistical power thresholds.
Participate in client calls to identify technical gaps and unsolved problems.
Translate identified gaps into concrete proposal narratives.
Contribute to NewtonX's positioning with AI buyers through case studies, technical blog posts, and conference presentations.
Help shape the hiring strategy for additional ML and research roles.