About The Position

We are hiring a Solutions Applied Data Scientist to help design, construct, and validate complex healthcare data cohorts used for AI model training. This role sits within the delivery organization, working closely with Solutions Leads and delivery engineers to solve complex data challenges that arise during customer projects. Solutions Leads own the customer relationship and overall delivery of projects. The Solutions Applied Data Scientist serves as their technical partner for more complex data problems, including cohort construction, multi-source dataset assembly, feasibility analysis, and data validation. You will help translate research generated by Protege’s Data Lab and customer requirements into practical dataset definitions, determine whether those requirements can be met with available data, and build the SQL and analysis needed to construct the resulting datasets. You will also collaborate with delivery engineers when solutions require changes to data pipelines, infrastructure, or large-scale data movement. This is a highly applied role focused on solving real-world dataset challenges, not research or model development. The ideal candidate is someone who enjoys solving messy real-world data problems, working directly with large healthcare datasets, writing complex SQL and collaborating closely with cross-functional teams. Our environment has a lot going on as we grow - so we’re looking for someone energized by and excited by the fast pace of the industry and our company!

Requirements

  • Experience working with large structured healthcare datasets
  • Strong SQL and python skills and experience writing complex queries
  • Experience using Claude Code / Codex
  • Experience joining and transforming large datasets
  • Experience performing data validation and exploratory analysis
  • Strong Python skills for data analysis and scripting
  • Experience working with structured file formats (CSV, Parquet, etc.)
  • Ability to translate ambiguous requirements into concrete data logic
  • Strong communication skills and ability to collaborate with technical and non-technical stakeholders

Responsibilities

  • Act as a technical partner to Solutions Leads, helping solve complex data challenges such as cohort definitions requiring multi-source joins, linking datasets across different data partners, investigating data gaps or anomalies, and evaluating the existence of requested variables or labels.
  • Determine if a dataset can realistically satisfy model requirements.
  • Collaborate with Solutions Leads to unblock delivery challenges and ensure successful project completion.
  • Partner with Solutions Engineers and internal platform engineering teams to implement required workflows when solutions necessitate infrastructure or pipeline changes.
  • Translate customer requirements into concrete dataset logic with Solutions Leads, ensuring datasets accurately represent the intended population and meet specifications.
  • Write complex SQL queries to construct cohorts, implement inclusion and exclusion logic, and join datasets across multiple sources.
  • Validate linkage between datasets and identify/resolve inconsistencies or missing fields.
  • Partner with Solutions Leads to resolve complex data questions during project delivery.
  • Escalate or collaborate with delivery engineers when dataset construction requires pipeline changes or large-scale data processing.
  • Validate that complex datasets meet required standards before delivery to customers, working closely with Solutions Leads to ensure acceptance criteria are met.
  • Perform data completeness analysis, investigate missing or anomalous data, and verify cohort logic results.
  • Create summary statistics and validation outputs.
  • Work with customer AI researchers and model development teams to translate research goals into practical dataset specifications.
  • Review dataset requests, clarify and refine requirements, and evaluate the availability of requested variables or labels in data sources.
  • Identify proxy variables or alternative dataset structures when ideal variables are unavailable.
  • Assess the feasibility of requested cohort definitions given real-world data constraints.
  • Explain data limitations, tradeoffs, and potential biases to technical stakeholders.
  • Iterate with researchers to converge on datasets that are scientifically meaningful and operationally feasible.
  • Analyze external healthcare data partner datasets to understand schema, field availability, data quality, completeness, and identify required transformations.
  • Develop tools and reusable workflows to improve delivery efficiency, such as reusable SQL templates, automated validation checks, and scripts for dataset preparation.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service