Sr Staff Data Scientist, Virtual Biology Initiative, AI Research

Biohub•Redwood City, CA

54d•Hybrid

About The Position

Biohub is launching the Virtual Biology Initiative, a significant five-year commitment to advance AI models for biology. This initiative aims to build predictive models of the human cell by generating massive, multi-modal biological data. The data science team plays a crucial role in transforming raw biological measurements into AI-ready datasets, focusing on designing data formats, building efficient processing pipelines, developing QC frameworks, creating agent-augmented curation tools, and establishing cross-modal entity resolution. This role involves active research into novel data representations, tokenization strategies, and combining diverse biological data modalities to enable new AI training architectures. The position offers broad scope and high autonomy, with opportunities to influence roadmaps and mentor senior individual contributors. Success means creating adaptive, interpretable, and scientifically grounded data systems that accelerate progress in biological AI and human health.

Requirements

12+ years of experience (or PhD + 7 years) working with large-scale biological datasets, including ownership of end-to-end data products
Deep expertise in at least one of: (a) imaging data—microscopy, cell phenotyping, spatial biology, and the data characteristics of image-based biological measurement; or (b) genomics data—bulk and single-cell sequencing, functional genomics, epigenomics, transcriptomics, spatial biology, and/or multi-omics
Understanding of how to transform raw biological data into AI-ready datasets, including familiarity with scientific best practices, noise characteristics, batch effects, and quality assessment specific to your domain
Experience with tokenization strategies for non-text data (images, sequences, graphs, time series) or with creating data representations and feature engineering for machine learning in scientific or biological contexts
Strong expertise in data science and statistical modeling; familiarity with modern ML architectures (transformers, diffusion models, or similar) and how data representation choices affect learning
Strong computational skills; demonstrated ability to design robust, extensible data architectures
Excellent communication and leadership skills, with the ability to translate between biology, ML, and engineering audiences and align teams to deliver complex projects
Creative, first-principles thinking about how to structure data for learning

Responsibilities

Set technical vision and strategy for the design of data representations and tokenization strategies across biological data types—including imaging, sequencing, and multimodal data—that enable novel model architectures
Develop, deploy and validate approaches for combining heterogeneous data modalities into unified training frameworks, designing for robustness to noise, bias, and batch effects
Evaluate model performance, identifying which biological signals are captured or lost and iterating to improve
Partner deeply with ML engineers and AI researchers to co-design datasets and optimize model training, evaluation, and generalization
Lead cross-functional initiatives spanning data engineering, infrastructure, science, and product, aligning technical execution with long-term scientific goals
Identify and drive new data acquisition and generation opportunities, from consortium partnerships to internal experimental pipelines
Serve as a technical mentor and leader, raising the bar for data science and ML rigor across the organization