Data Science / AI Intern – Literature Mining & Graph Modeling

AstraZenecaWaltham, MA
2d$41 - $48Onsite

About The Position

AstraZeneca is seeking Master’s and PhD students studying Biology, Computer Science, Chemistry, Physics, Engineering, Biomedical Science, Pharmacology, Data Science, Bioinformatics, or a related discipline for a 10-week internship role at our site in Waltham, MA from June 01, 2026- August 07, 2026. This internship sits at the intersection of data engineering, biomedical NLP, and translational science, enabling faster insight generation for R&D teams. Position Description: Build an end-to-end pipeline turning literature (papers, abstracts, patents) into a standardized knowledge graph with contextualized evidence. Handle source selection, inclusion/exclusion criteria, updates, and data snapshots. Develop NLP for entity recognition, relation extraction, assertion detection, and context tagging (drug, indication, resistance, biomarker, outcome). Encode domain relations (e.g., Drug–mechanism→Gene/Pathway; Biomarker–modulates→Outcome; ADC–targets→Antigen). Map entities to controlled vocabularies; manage synonyms, disambiguation, and canonical IDs. Implement edge-level confidence scoring (source quality, claim type, co-occurrence, citations, model certainty) with full evidence provenance. Build graph storage (property graph or RDF) and queryable APIs. Deliver interactive visualization (UI or notebook) with filters, context toggles, and evidence drill-down. Define metrics, run error analyses, and validate with scientific stakeholders. Ensure reproducibility and documentation: version models/data; record architecture, assumptions, benchmarks; provide user guides. Present outcomes to data science, oncology, and translational medicine teams.

Requirements

  • Master’s and PhD students studying Biology, Computer Science, Chemistry, Physics, Engineering, Biomedical Science, Pharmacology, Data Science, Bioinformatics, or a related discipline.
  • Candidates must have an expected graduation date after August 2026.
  • US Work Authorization is required at time of application. This role will not be providing OPT support.
  • NLP and ML: NER, relation extraction, transformers; Python-based workflows.
  • Graph/data modeling: experience with Neo4j, NetworkX, or RDF/SPARQL.
  • Reproducibility: version control, environment management, documentation.
  • Soft skills: problem-solving, communication, collaboration.
  • Ability to report onsite to Waltham, MA site 3-5 days per week.

Nice To Haves

  • Domain knowledge: genes, pathways, biomarkers, therapeutic modalities (incl. ADCs) preferred.

Responsibilities

  • Build an end-to-end pipeline turning literature (papers, abstracts, patents) into a standardized knowledge graph with contextualized evidence.
  • Handle source selection, inclusion/exclusion criteria, updates, and data snapshots.
  • Develop NLP for entity recognition, relation extraction, assertion detection, and context tagging (drug, indication, resistance, biomarker, outcome).
  • Encode domain relations (e.g., Drug–mechanism→Gene/Pathway; Biomarker–modulates→Outcome; ADC–targets→Antigen).
  • Map entities to controlled vocabularies; manage synonyms, disambiguation, and canonical IDs.
  • Implement edge-level confidence scoring (source quality, claim type, co-occurrence, citations, model certainty) with full evidence provenance.
  • Build graph storage (property graph or RDF) and queryable APIs.
  • Deliver interactive visualization (UI or notebook) with filters, context toggles, and evidence drill-down.
  • Define metrics, run error analyses, and validate with scientific stakeholders.
  • Ensure reproducibility and documentation: version models/data; record architecture, assumptions, benchmarks; provide user guides.
  • Present outcomes to data science, oncology, and translational medicine teams.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service