Applied Data Scientist, LLM Evaluation

Driver•Austin, TX

1d•Remote

About The Position

At Driver, we’re building systems that turn source code into human language. The tech stack includes a core compiler-like engine, a heavily asynchronous/distributed backend server, and a frontend web application that provides a rich user experience. Driver is an early-stage startup backed by Y Combinator and Google Ventures that combines first principles technical approaches and applied LLM expertise to tackle context engineering at scale. Driver builds the context layer for employees and AI agents alike to use in developing software. Driver is an early-stage but fast-growing startup. As such, we take advantage of that which startups can excel: delivery speed, flexibility, and enjoying working with a small close-knit team. Organizational and engineering values at Driver include first-principles thinking, correct by construction, writing things down, experimentation and iteration, pragmatism, commitment to effective communication and transparency, autonomy, and ambition.

Requirements

Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative field.
Minimum 3 — 5 years in applied science, ML engineering, or data science roles with a focus on evaluation, NLP, or generative AI.
Strong statistical foundations: experimental design, hypothesis testing, confidence intervals, effect sizes, power analysis.
Experience designing and running evaluations for LLM or NLP systems — you've thought carefully about what "better" means when outputs are open-ended text.
Proficient in Python and the scientific/data stack (pandas, NumPy, scipy, sklearn).
Comfortable working in Jupyter notebooks for exploration and prototyping, and turning that work into automated pipelines.
Experience with LLM-as-judge approaches, inter-annotator agreement, and rubric design for subjective quality assessment.
Familiarity with the practical challenges of non-deterministic systems: variance decomposition, multi-run methodology, distinguishing signal from noise at scale.
Strong data storytelling — you can turn experiment results into clear recommendations that drive engineering and product decisions.

Nice To Haves

7+ years experience preferred.
Experience with LLM APIs and prompt engineering across multiple providers.
Familiarity with evaluation frameworks (e.g., RAGAS, DeepEval, custom harnesses).
Experience building data pipelines or ETL workflows (Airflow, Dagster, or similar).
Comfort with SQL and working directly against production data stores.
Experience with visualization tools (Matplotlib, Plotly, Streamlit) for building internal dashboards and reports.
Background in code understanding, developer tools, or technical documentation.
Experience building or managing annotation pipelines and human evaluation workflows.

Responsibilities

Own the LLM evaluation strategy at Driver — from first principles to production infrastructure.
Define quality metrics and build evaluation datasets.
Establish what "good" looks like for each content type across the pipeline.
Build and curate gold-standard evaluation datasets across languages and repo archetypes (monorepos, microservices, libraries, applications).
Design rubrics that capture accuracy, completeness, usefulness, and readability.
Build benchmarking and experimentation infrastructure.
Create automated evaluation pipelines that score output against reference datasets.
Instrument the content generation pipeline to support A/B comparisons — run the same codebase through two strategies and compare results.
Build tooling for LLM-as-judge evaluation and regression detection.
Integrate evaluation into CI so pipeline changes come with quality evidence.
Develop automated quality signals at scale.
Build quality checks that flag degraded output without requiring human review of every document.
Monitor content quality trends over time.
Design sampling strategies for human review that maximize signal with minimal annotation effort.
Quantify tradeoffs and inform decisions.
Run experiments on model selection, context strategies, and pipeline architecture changes.
Quantify cost/quality/latency tradeoffs.
Partner with the engineering team to turn evaluation insights into shipped improvements.