Senior Applied AI Scientist

Ro•New York, NY

5h•$182,300 - $220,000

About The Position

Ro is building a team focused on shipping LLM-powered products across the patient experience, clinical operations, and internal tooling. We're hiring a Senior Applied AI Scientist to own the evaluation, measurement, and optimization of our AI systems. This role sits at the intersection of data science, applied machine learning, and product engineering. You'll design the frameworks that tell us whether our AI systems are actually working and use those insights to continuously improve them. This is not a research role. You'll work closely with engineers and product teams to evaluate production systems, run experiments, identify failure modes, and ensure our AI products become more accurate, reliable, and cost-effective over time.

Requirements

5+ years of experience in data science, applied machine learning, experimentation, or a closely related field, with at least the last year focused on applied LLMs or AI evaluation.
Strong Python and SQL skills with experience working on production data pipelines and experimentation.
Experience designing reproducible evaluation frameworks rather than relying on manual spot checks or qualitative assessments.
Strong statistical intuition: you think in terms of distributions, confidence intervals, variance, and sample sizes rather than anecdotes.
Comfortable working closely with engineers and product teams to translate experimental findings into production improvements.

Nice To Haves

Experience with evaluation platforms (e.g. Braintrust, LangSmith, OpenAI Evals), experimentation platforms, causal inference, healthcare, or operations-heavy environments.

Responsibilities

Design and own evaluation frameworks for production LLM features, including LLM-as-a-judge evaluations, regression suites, synthetic datasets, golden datasets, and human review workflows.
Analyze production behavior to identify quality issues, hallucinations, latency bottlenecks, cost regressions, and emerging failure modes.
Design and run experiments including prompt variations, workflow changes, retrieval improvements, and model comparisons; and quantify their impact on quality, operational metrics, and user outcomes.
Define the metrics that matter and build dashboards that make AI performance visible across the organization.
Partner with engineering to determine which optimizations should be productionized and how to measure ongoing success.
Mentor teammates on experimental design, statistical rigor, evaluation methodology, and measurement best practices.