Senior Platform Data Engineer

Geisinger
Remote

About The Position

The Senior Platform Data Engineer owns roadmap, priorities, platform standards, and architecture reviews; provides formal input on performance reviews. This position makes clinical data ready for AI at scale: owning the shared data products, retrieval infrastructure, and platform administration that the entire AI portfolio depends on. Owns Real-time data feeds. Reusable clinical data models and feature pipelines. RAG retrieval infrastructure (ingestion, chunking, embeddings, vector DB, retrieval pipelines). Databricks platform administration.

Requirements

  • 5+ years in data engineering, with strong experience building both batch and streaming data pipelines
  • Expert-level Databricks skills: Delta Live Tables, PySpark, Unity Catalog, Feature Store
  • Hands-on experience with real-time data ingestion (Kafka, Spark Structured Streaming, or comparable frameworks)
  • Strong SQL and Python (pandas, PySpark) skills for data transformation and feature engineering
  • Experience administering Databricks workspaces: cluster policies, compute management, access controls, cost monitoring
  • Familiarity with clinical data models and healthcare data sources (EHR extracts, ADT feeds, lab results, claims data)
  • Bachelor's Degree-Related Field of Study (Required)
  • Minimum of 5 years-Relevant experience (Required)

Nice To Haves

  • Experience with Epic data extraction methods (SDE, FHIR, epic-ws) a significant plus
  • Understanding of data governance principles: lineage, quality monitoring, access controls
  • Master's Degree-Related Field of Study (Preferred)

Responsibilities

  • Streams data from Epic SDE, ADT feeds, lab results, and other clinical sources into Databricks for downstream model consumption.
  • Curates shared clinical feature tables (patient demographics, labs, vitals, diagnoses, utilization history, imaging metadata) in Databricks/Unity Catalog that multiple AI programs consume for model training, validation, and monitoring.
  • Owns RAG Infrastructure, the shared retrieval-augmented generation platform that agentic and generative AI programs use to ground LLM outputs in organizational knowledge.
  • Designs and operates document ingestion pipelines: normalizing clinical documents, policies, guidelines, and unstructured data sources into formats ready for embedding and retrieval.
  • Implements and optimizes chunking strategies tailored to healthcare content (e.g., preserving clinical note structure, section-aware chunking for guidelines and protocols).
  • Manages the embedding pipeline: selecting, tuning, and versioning embedding models (domain-specific clinical models where they outperform general-purpose).
  • Administers the vector database: schema design, indexing, metadata management, access controls, and performance tuning.
  • Builds and maintains retrieval pipelines: hybrid search (vector + keyword/BM25), reranking, and relevance filtering to maximize retrieval precision for downstream agents and LLM applications.
  • Establishes data quality gates for RAG: automated profiling, completeness checks, and accuracy scoring before content enters the vector store.
  • Monitors retrieval quality metrics (Precision@K, Recall@K, MRR) and continuously optimize retrieval performance.
  • Databricks workspace configuration and Unity Catalog governance.
  • Cluster policies, compute management, and cost monitoring.
  • Manges user/group management and access control.
  • Administrator for Feature Store.
  • Accountable for satisfying all job specific obligations and complying with all organization policies and procedures.

Benefits

  • We offer healthcare benefits for full time and part time positions from day one, including vision, dental and domestic partners.
  • We encourage an atmosphere of collaboration, cooperation and collegiality.
  • We know that a diverse workforce with unique experiences and backgrounds makes our team stronger.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service