Tech Lead Data Scientist, AI Evaluation & Monitoring

Geisinger

52d•Remote

About The Position

The Tech Lead Data Scientist, AI Evaluation & Monitoring is the principal technical expert for how Geisinger evaluates, monitors, and optimizes AI systems in production. This is a hands-on technical leadership role. The Tech Lead sets the technical direction for AI evaluation across a large and growing portfolio, provides technical leadership to a team of data analysts who execute evaluation work, and partners directly with AI program teams to raise the quality of how AI is validated, monitored, and improved over time. The role exists because AI at Geisinger has scaled past the point where oversight can be a document-review exercise. We need a technical leader who can guide program teams toward better-designed evaluations up front, instrument meaningful production monitoring, and continually advance the methods we use, from LLM-as-Judge frameworks to simulation-based testing to pragmatic experiment design that actually scales in healthcare.

Requirements

6+ years in data science, statistics, ML engineering, or applied quantitative research, with demonstrated experience as the senior technical voice on cross-functional projects
Strong foundation in experimental design and causal inference — and judgment about which method fits which situation
Hands-on experience designing and running model evaluation studies in real production settings
Experience evaluating LLM or generative AI systems, or comparable experience evaluating complex ML systems where ground truth is messy
Proven ability to translate ambiguous failure modes into concrete, defensible evaluation designs and monitoring metrics
Strong fluency in Python and SQL; working comfort with modern ML tooling and cloud-native data environments
Experience with fairness and equity evaluation for ML systems
Track record of providing technical leadership and mentorship without formal people-management authority
Clear written communication — the role produces evaluation memos and specifications that non-technical decision-makers rely on
Bachelor's Degree-Related Field of Study (Required)
Minimum of 6 years-Relevant experience (Required)

Nice To Haves

Healthcare, clinical, or regulated-industry experience strongly preferred
MS or PhD in a quantitative field preferred; equivalent experience accepted

Responsibilities

The technical evaluation methodology applied to AI programs across the enterprise, pre-production validation, production monitoring, and ongoing optimization
Hands-on guidance to program teams as they design validation studies, equity audits, monitoring plans, and escalation playbooks for their AI systems
Instrumentation of production monitoring: translating program-specific failure modes into concrete, measurable metrics
The evaluation toolkit: LLM-as-Judge frameworks, golden sets, simulation harnesses, experimental study designs, drift detection, subgroup fairness analysis
Reusable evaluation playbooks and templates that let each new program move faster than the last
Technical direction, design review, and mentorship for a team of data analysts supporting the evaluation function