Applied AI Evaluation Scientist

Jump

49d•Remote

About The Position

We're looking for an Applied AI Evaluation Scientist — someone who sits at the intersection of data science, information retrieval, machine learning, and product thinking. This person will own the quality and trustworthiness of our AI/ML systems by designing, building, and running rigorous evaluation frameworks. The primary focus will be on our Agentic Retrieval-Augmented Generation (RAG) pipelines — optimizing how we chunk, embed, retrieve, rank, and generate — but the role extends to evaluating other AI/ML systems across the company. The ideal candidate has the judgment to know what's worth evaluating, what isn't, and the statistical grounding to make sure the evaluations they do run are sound, realistic, and actionable. Balancing resource capacity and velocity is key--knowing what to measure and how to measure it to drive improvements for our customers is paramount. You will work closely with Product and Engineering. Your code doesn't need to be production-hardened, but it must achieve intended outcomes — think research-quality Python, clear notebooks, and reproducible experiments, not bulletproof microservices.

Requirements

Strong product sense. You can look at an AI system's output and tell whether it's good enough for the user — not just whether it passes a benchmark. You understand the difference between a specification failure and a generalization failure, and you act accordingly.
Statistical rigor. You're comfortable with experimental design, confidence intervals, hypothesis testing, inter-annotator agreement metrics, and bias correction for imperfect classifiers. You can explain why a metric is or isn't trustworthy.
Information retrieval fundamentals. You understand embedding models, vector search, BM25, re-ranking, chunking strategies, and retrieval evaluation metrics (Recall@k, MRR, NDCG). You've thought about why retrieval fails and how to diagnose it.
Proficiency with Python and SQL. You can write clean data analysis code, call LLM APIs, parse structured outputs, and build evaluation scripts. You don't need to write production services, but your code should be clear and reproducible.
Hands-on experience with LLMs. You've worked with modern LLM APIs (OpenAI, Anthropic, Google Gemini, etc.), written prompts, and understand the strengths and limitations of these models.
Data labeling and annotation experience. You've designed labeling tasks, written rubrics, managed annotation quality, and understand the pitfalls of human judgment at scale.
Strong written and verbal communication. You'll be presenting findings to Product and Engineering teams and need to make complex evaluation results legible and actionable.
BS in Computer Science, Statistics, Data Science, Information Science, Mathematics, Engineering, or a related quantitative field.
5+ years of experience in data science, applied ML, information retrieval, or AI evaluation roles.
Demonstrated ability to work cross-functionally with Product and Engineering teams.
Proficiency in Python and SQL. Experience with Elixir is a plus.

Nice To Haves

Experience building or evaluating RAG systems end-to-end in a production setting.
Familiarity with LLM-as-Judge patterns — prompt design, calibration, few-shot example selection, and alignment measurement.
Experience with observability/tracing tools for LLM pipelines (Arize, Braintrust, LangSmith, or similar).
Background in qualitative research methods (grounded theory, open/axial coding) applied to error analysis.
Experience building lightweight data review UIs (Streamlit, Gradio, or custom web apps).
Familiarity with model cascades, cost optimization, or other strategies for efficient AI evaluation at scale.
Experience as a Data Engineer/Architect who understands tradeoffs of different database systems, indexing strategies, and data models beyond vector/hybrid databases and engines such as tabular/relational DBs, graphs (and graphDBs), inverted index systems (ElasticSearch, Solr, etc.), in-memory key-value caches (Redis, etc.), and more.

Responsibilities

Design and curate evaluation datasets for retrieval quality — including synthetically generated query-answer-context pairs, adversarial test cases, and gold sets drawn from real user queries.
Measure retrieval quality using metrics like Recall@k, Precision@k, MRR, and NDCG@k. Know when each metric matters and when it doesn't for a given use case.
Recommend data cleaning/normalization strategies — real-world data is full of noise that reduces discriminative powers of retrieval algorithms, clutters LLM context windows, and can be a source of irrelevant downstream responses. Your work in finding major patterns that could be alleviated with better data cleaning pipelines and/or heuristics will help drive improvements.
Evaluate and optimize chunking strategies — run grid searches over chunk size, overlap, and segmentation methods. Understand how chunking decisions cascade into retrieval and generation quality.
Assess embedding and re-ranking strategies — benchmark embedding models, evaluate re-rankers, and measure the downstream impact on generation quality.
Evaluate generation quality in context — measure faithfulness, relevance, hallucination rates, and omissions using a combination of code-based checks, LLM-as-judge, and targeted human review.
Attribute failures across the pipeline — determine whether a bad answer is caused by poor data cleanliness/normalization, retrieval, bad chunking, a generation error, or an interaction between components. Build diagnostic tooling to isolate root causes.
Conduct systematic error analysis on AI/ML system outputs — read traces, identify failure modes through open and axial coding, and build structured failure taxonomies.
Design and validate LLM-as-Judge evaluators where appropriate — write judge prompts, split data into train/dev/test sets, iteratively refine, and measure TPR/TNR against human-labeled ground truth.
Estimate true success rates using imperfect judges — apply bias-correction techniques (e.g., Rogan-Gladen) and bootstrap confidence intervals to provide statistically grounded performance estimates.
Build and maintain golden datasets for CI regression testing of AI pipelines.
Prioritize ruthlessly — assess which failure modes are worth investing evaluation effort into versus which can be fixed by clarifying a prompt or adjusting a tool description.
Partner with Product to understand what "good" looks like for specific use cases and translate qualitative product requirements into measurable evaluation criteria.
Partner with Engineering to instrument pipelines for observability, design trace logging, and integrate evaluation checks into CI/CD workflows.
Design and build lightweight review interfaces (or work with engineers to build them) that make it fast and easy for domain experts to review traces, label data, and provide structured feedback.
Lead or facilitate annotation workflows — define rubrics, measure inter-annotator agreement (Cohen's Kappa), run alignment sessions, and produce consensus-labeled datasets.