Staff Data Scientist

Recursion Pharmaceuticals•Salt Lake City, UT

52d•Hybrid

About The Position

As a member of Recursion's AI-driven drug discovery initiatives, you will be at the forefront of reimagining how biological knowledge is generated, stored, accessed, and reasoned upon by LLMs. You will play a key role in developing the biological reasoning infrastructure, connecting large-scale data and codebases with dynamic, agent-driven AI systems.You will be responsible for defining the architecture that grounds our agents in biological truth. This involves integrating biomedical resources to enable AI systems to reason effectively and selecting the most appropriate data retrieval strategies to support those insights. This is a highly collaborative role: you will partner with machine learning engineers, biologists, chemists, and platform teams to build the connective tissue that allows our AI agents to reason like a scientist. The ideal candidate possesses deep expertise in both core bioinformatics/cheminformatics libraries and modern GenAI frameworks (including RAG and MCP), a strong architectural vision, and the ability to translate high-potential prototypes into scalable production workflows.

Requirements

PhD in a relevant field (Bioinformatics, Cheminformatics, Computational Biology, Computer Science, Systems Biology) with 5+ years of industry experience, or MS in a relevant field with 7+ years of experience, focusing on biological data representation and retrieval.
Proficiency in utilizing major public biological databases (NCBI, Ensembl, STRING, GO) and using standard bioinformatics/cheminformatics toolkits (e.g., RDKit, samtools, Biopython).
Strong skills in designing and maintaining automated data pipelines that support continuous ingestion, transformation, and refresh of biological data without manual intervention.
Ability to work with knowledge graph data models and query languages (e.g., RDF, SPARQL, OWL) and translate graph-structured data into relational or other non-graph representations, with a strong judgment in evaluating trade-offs between different approaches.
Competence in building and operating GenAI stacks, including RAG systems, vector databases, and optimization of context windows for large-scale LLM deployments.
Hands-on expertise with agentic AI frameworks (e.g., MCP, Google ADK, LangChain, AutoGPT) and familiarity with leading LLMs (e.g., Google Gemini/Gemma) in agentic workflows, including benchmarking and evaluating agent performance on bioinformatics/cheminformatics tasks such as structure prediction, target identification, and pathway mapping.
Strong Python skills and adherence to software engineering best practices, including CI/CD, Git-based version control, and modular design.
Excellent cross-functional communication skills, ability to clearly explain complex architectural decisions to both scientific domain experts and technical stakeholders.

Nice To Haves

Strong background in machine learning and deep learning, including hands-on experience with foundation models and modern neural architectures.
Fine-tuning LLMs on scientific corpora for domain-specific reasoning.
Integrating LLMs with experimental or proprietary assay data in live scientific workflows.
Background in drug discovery and target identification.
Meaningful contributions to open-source libraries, research codebases, or community-driven tools.

Responsibilities

Architect and maintain robust infrastructure to keep critical internal and external biological resources (e.g., ChEMBL, Ensembl, Reactome, proprietary assays) up-to-date and accessible to reasoning agents.
Design sophisticated context retrieval strategies, choosing the most effective approach for each biological use case, whether working with structured, entity-focused data, unstructured RAG, or graph-based representations.
Integrate established bioinformatics/cheminformatics libraries into a GenAI ecosystem, creating interfaces (such as via MCP) that allow agents to autonomously query and manipulate biological data.
Pilot methods for tool use by LLMs, enabling the system to perform complex tasks like pathway analysis on the fly rather than relying solely on memorized weights.
Develop scalable, production-grade systems that serve as the backbone for Recursion's automated scientific reasoning capabilities.
Collaborate cross-functionally with Recursion's core biology, chemistry, data science and engineering teams to ensure our biological data and the reasoning engines are accurately reflecting the complexity of disease biology and drug discovery.
Present technical trade-offs (e.g., graph vs. vector) to leadership and stakeholders in a clear, compelling way that aligns technical reality with product vision.