Data Architect, Data Foundry

LillyIndianapolis, IN
$132,000 - $193,600Onsite

About The Position

Lilly Small Molecule Discovery is dedicated to creating molecules that improve lives. Discovery Technology and Platforms (DTP) accelerates this process by building optimized foundational platforms, streamlining lab operations with advanced technologies and data connectivity, and investing in novel capabilities. The Data Foundry, a multidisciplinary team within DTP, enables AI-native drug discovery through four pillars: Architecture4Insight (data infrastructure and scientific software), Methods4Insight (analytical and computational methods), Automation & Scale4Insight (lab automation and agentic workflows), and Preparedness4Insight (data governance and readiness). These pillars provide seamless access to data, insights, and AI-driven capabilities for both human scientists and autonomous AI agents, empowering optimal decision-making. We are seeking Data Architects at multiple levels to design and build the essential data infrastructure for AI-native drug discovery. This role involves creating schemas, ontologies, data models, knowledge graphs, and platform architectures to transform raw scientific data into machine-actionable, FAIR-compliant, and insight-ready assets for discovery scientists and AI agents. As a foundational role within Architecture4Insight, the designs from this team underpin all software engineering efforts, including pipelines, APIs, and prototypes. The successful candidate will leverage deep knowledge of scientific data (chemical, biological, HTE, automation-generated) to develop custom solutions and collaborate with Tech@Lilly for scaling and maintenance. The position focuses on three areas based on expertise: data modeling & ontologies, data platform & lakehouse architecture, and knowledge graph & specialized data systems.

Requirements

  • B.S. or M.S. in Computer Science, Data Science, Bioinformatics, Computational Biology, Information Science, or related STEM field; Ph.D. valued for ontology and knowledge graph roles.
  • B.S. with 7+ years and M.S. with 5+ years of data architecture, data engineering, or scientific informatics' experience.
  • SQL skills and experience in multiple database paradigms (relational, graph, document, columnar, key-value).
  • Qualified applicants must be authorized to work in the United States on a full-time basis. Lilly will not provide support for or sponsor work authorization or visas for this role, including but not limited to F-1 CPT, F-1 OPT, F-1 STEM OPT, J-1, H-1B, TN, O-1, E-3, H-1B1, or L-1.

Nice To Haves

  • Expertise in at least one of: data modeling/ontologies, data platform engineering (Databricks, Snowflake, Spark), or graph/specialized databases (Neo4j, Neptune, MongoDB).
  • Familiarity with cloud platforms (AWS, Azure, or GCP) and modern data integration patterns.
  • Understanding of scientific data types and experimental workflows in life sciences or pharma (chemical, biological, HTE data).
  • Strong communication skills with ability to translate data architecture concepts for both technical and scientific audiences.
  • Pharmaceutical or biotech research industry experience, particularly in discovery data management or research informatics.
  • Experience with semantic web technologies: RDF, OWL, SPARQL, Protégé, or equivalent ontology engineering tools.
  • Hands-on experience with graph databases (Neo4j, Neptune, TigerGraph) and knowledge graph design patterns for scientific data.
  • Data lakehouse architecture experience: Databricks (Delta Lake, Unity Catalog), Snowflake, or equivalent; ETL/ELT with Spark, dbt.
  • Experience with streaming/real-time data platforms (Kafka, Kinesis, Flink) and event-driven architectures.
  • Familiarity with LIMS, ELN systems (e.g., Benchling), and laboratory instrument data integration.
  • Experience with vector databases (Pinecone, Weaviate, pgvector) and embedding-based retrieval for ML/RAG applications.
  • Array database experience (TileDB, Zarr) for genomics, imaging, or high-dimensional scientific data.
  • Experience with bioinformatics data formats (FASTA, BAM/CRAM, VCF) and biological sequence databases; familiarity with NGS data pipelines and proteomics data management.
  • FAIR data principles implementation experience and Data Readiness Level frameworks.
  • Scientific data standards and controlled vocabularies in chemistry (InChI, SMILES) or biology (Gene Ontology, UniProt, pathway databases such as Reactome or KEGG).

Responsibilities

  • Design and implement data models, schemas, and ontologies for chemical, biological, and automation-generated data that serve discovery workflows across the portfolio.
  • Define and maintain controlled vocabularies, metadata standards, and FAIR-compliant data frameworks in partnership with Preparedness4Insight.
  • Implement semantic data standards (RDF, OWL, SPARQL) and ontology engineering practices to create interoperable, machine-readable scientific data.
  • Design and implement data lakehouse architecture using modern platforms (Databricks, Snowflake, or equivalent), including data storage patterns, partitioning strategies, and query optimization.
  • Build and optimize ETL/ELT pipelines using Spark, dbt, or similar tools to transform raw scientific data into analytical and ML-ready formats.
  • Implement real-time and streaming data integration (Kafka, Kinesis, event-driven patterns) connecting LIMS, instruments, and lab automation systems to the data infrastructure.
  • Design and implement knowledge graphs (Neo4j, Amazon Neptune, TigerGraph) that capture molecular, target, pathway, and experimental relationships across the discovery landscape.
  • Architect specialized data solutions: array databases (TileDB) for genomics/imaging, document stores (MongoDB) for experimental records, and vector databases for embedding-based retrieval supporting ML and RAG workflows.
  • Build query and traversal patterns that enable scientists and AI agents to ask relational questions across the entire data landscape.
  • Partner with scientific software engineers to ensure data architectures are implementable, performant, and well-documented.
  • Collaborate with Methods4Insight to design data structures that support analytical model training, deployment, and evaluation.
  • Work with Tech@Lilly to define scaling strategies, ensure enterprise compliance, and transition data architectures to production-grade management.
  • Contribute to build-versus-buy-versus-adopt decisions by evaluating commercial and open-source data platforms against Data Foundry requirements.

Benefits

  • company bonus (depending, in part, on company and individual performance)
  • company-sponsored 401(k)
  • pension
  • vacation benefits
  • eligibility for medical, dental, vision and prescription drug benefits
  • flexible benefits (e.g., healthcare and/or dependent day care flexible spending accounts)
  • life insurance and death benefits
  • certain time off and leave of absence benefits
  • well-being benefits (e.g., employee assistance program, fitness benefits, and employee clubs and activities)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service