Data Science / AI Intern – Literature Mining & Graph Modeling

AstraZeneca•Waltham, MA

2d•$41 - $48•Onsite

About The Position

AstraZeneca is seeking Master’s and PhD students studying Biology, Computer Science, Chemistry, Physics, Engineering, Biomedical Science, Pharmacology, Data Science, Bioinformatics, or a related discipline for a 10-week internship role at our site in Waltham, MA from June 01, 2026- August 07, 2026. This internship sits at the intersection of data engineering, biomedical NLP, and translational science, enabling faster insight generation for R&D teams. Position Description: Build an end-to-end pipeline turning literature (papers, abstracts, patents) into a standardized knowledge graph with contextualized evidence. Handle source selection, inclusion/exclusion criteria, updates, and data snapshots. Develop NLP for entity recognition, relation extraction, assertion detection, and context tagging (drug, indication, resistance, biomarker, outcome). Encode domain relations (e.g., Drug–mechanism→Gene/Pathway; Biomarker–modulates→Outcome; ADC–targets→Antigen). Map entities to controlled vocabularies; manage synonyms, disambiguation, and canonical IDs. Implement edge-level confidence scoring (source quality, claim type, co-occurrence, citations, model certainty) with full evidence provenance. Build graph storage (property graph or RDF) and queryable APIs. Deliver interactive visualization (UI or notebook) with filters, context toggles, and evidence drill-down. Define metrics, run error analyses, and validate with scientific stakeholders. Ensure reproducibility and documentation: version models/data; record architecture, assumptions, benchmarks; provide user guides. Present outcomes to data science, oncology, and translational medicine teams.

Requirements

Master’s and PhD students studying Biology, Computer Science, Chemistry, Physics, Engineering, Biomedical Science, Pharmacology, Data Science, Bioinformatics, or a related discipline.
Candidates must have an expected graduation date after August 2026.
US Work Authorization is required at time of application. This role will not be providing OPT support.
NLP and ML: NER, relation extraction, transformers; Python-based workflows.
Graph/data modeling: experience with Neo4j, NetworkX, or RDF/SPARQL.
Reproducibility: version control, environment management, documentation.
Soft skills: problem-solving, communication, collaboration.
Ability to report onsite to Waltham, MA site 3-5 days per week.

Nice To Haves

Domain knowledge: genes, pathways, biomarkers, therapeutic modalities (incl. ADCs) preferred.

Responsibilities

Build an end-to-end pipeline turning literature (papers, abstracts, patents) into a standardized knowledge graph with contextualized evidence.
Handle source selection, inclusion/exclusion criteria, updates, and data snapshots.
Develop NLP for entity recognition, relation extraction, assertion detection, and context tagging (drug, indication, resistance, biomarker, outcome).
Encode domain relations (e.g., Drug–mechanism→Gene/Pathway; Biomarker–modulates→Outcome; ADC–targets→Antigen).
Map entities to controlled vocabularies; manage synonyms, disambiguation, and canonical IDs.
Implement edge-level confidence scoring (source quality, claim type, co-occurrence, citations, model certainty) with full evidence provenance.
Build graph storage (property graph or RDF) and queryable APIs.
Deliver interactive visualization (UI or notebook) with filters, context toggles, and evidence drill-down.
Define metrics, run error analyses, and validate with scientific stakeholders.
Ensure reproducibility and documentation: version models/data; record architecture, assumptions, benchmarks; provide user guides.
Present outcomes to data science, oncology, and translational medicine teams.