Roche-posted 2 days ago
$50 - $50/Yr
Full-time • Intern
Onsite • New York, NY
5,001-10,000 employees

At Roche's AI for Drug Discovery (AIDD) group (formerly Prescient Design), we are revolutionizing drug discovery with cutting-edge machine learning techniques. We are seeking talented researchers and engineers with a passion for building machine learning systems that transform how scientific data is represented, modeled, and evaluated. AIDD’s Foundation Model team is seeking a Machine Learning Research Intern to work on data interfaces between structured biochemical measurements and large language models, supporting next-generation foundation models for drug discovery as part of our broader Lab-in-the-Loop approach. The intern will collaborate closely with researchers and engineers to design, implement, and evaluate data transformation and modeling pipelines, gaining hands-on experience with real-world scientific datasets and foundation-model workflows. This role is well suited for candidates who enjoy careful technical reasoning, experimentation, and building reusable components that sit at the intersection of machine learning and scientific data. The group provides a dynamic and challenging environment for multidisciplinary research, including access to heterogeneous data sources, close links to top academic institutions around the world, as well as collaborations with internal Genentech and Roche teams. This internship position is located in New York City, NY, On-Site.

  • Design and implement textification pipelines that translate structured biochemical assay data (e.g., affinity, expression) into precise, uncertainty-aware natural language for LLM training.
  • Develop parsing logic and round-trip validation to recover structured values (numbers, units, QC indicators) from model-generated text, enabling consistent evaluation.
  • Integrate and evaluate optional sequence-neighborhood enrichment using embedding-based retrieval, and study its effect on model calibration and robustness.
  • Run controlled experiments and ablations to analyze how rendering and enrichment choices affect downstream property prediction, ranking, and calibration.
  • Contribute production-quality code to internal frameworks, including clear documentation, README-style usage examples, and comprehensive unit tests.
  • Must be pursuing a Master's Degree (enrolled student).
  • Must be pursuing a PhD (enrolled student).
  • Computer Science, Machine Learning, Data Science, Bioinformatics or Computational Biology, Statistics, Applied Mathematics, Physics, or a related quantitative field
  • Strong programming skills, particularly in Python, with experience writing clean and maintainable code.
  • Solid understanding of machine learning or NLP fundamentals, including model training and evaluation concepts.
  • Experience working with structured scientific or technical data (e.g., tables, fields, or schemas) in the context of data analysis or modeling.
  • Ability to reason carefully about experimental results and communicate technical ideas clearly.
  • Excellent communication, collaboration, and interpersonal skills.
  • Complements our culture and the standards that guide our daily behavior & decisions: Integrity, Courage, and Passion.
  • Familiarity with biological or biochemical data (e.g., proteins, antibodies, or assays).
  • Intensive 12-weeks, full-time (40 hours per week) paid internship.
  • Program start dates are in May/June
  • A stipend, based on location, will be provided to help alleviate costs associated with the internship.
  • Ownership of challenging and impactful business-critical projects.
  • Work with some of the most talented people in the biotechnology industry.
  • paid holiday time off benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service