Senior Platform Data Engineer

Geisinger

2d•Remote

About The Position

The Senior Platform Data Engineer owns roadmap, priorities, platform standards, and architecture reviews; provides formal input on performance reviews. This position makes clinical data ready for AI at scale: owning the shared data products, retrieval infrastructure, and platform administration that the entire AI portfolio depends on. Owns Real-time data feeds. Reusable clinical data models and feature pipelines. RAG retrieval infrastructure (ingestion, chunking, embeddings, vector DB, retrieval pipelines). Databricks platform administration.

Requirements

5+ years in data engineering, with strong experience building both batch and streaming data pipelines
Expert-level Databricks skills: Delta Live Tables, PySpark, Unity Catalog, Feature Store
Hands-on experience with real-time data ingestion (Kafka, Spark Structured Streaming, or comparable frameworks)
Strong SQL and Python (pandas, PySpark) skills for data transformation and feature engineering
Experience administering Databricks workspaces: cluster policies, compute management, access controls, cost monitoring
Familiarity with clinical data models and healthcare data sources (EHR extracts, ADT feeds, lab results, claims data)
Bachelor's Degree-Related Field of Study (Required)
Minimum of 5 years-Relevant experience (Required)

Nice To Haves

Experience with Epic data extraction methods (SDE, FHIR, epic-ws) a significant plus
Understanding of data governance principles: lineage, quality monitoring, access controls
Master's Degree-Related Field of Study (Preferred)

Responsibilities

Streams data from Epic SDE, ADT feeds, lab results, and other clinical sources into Databricks for downstream model consumption.
Curates shared clinical feature tables (patient demographics, labs, vitals, diagnoses, utilization history, imaging metadata) in Databricks/Unity Catalog that multiple AI programs consume for model training, validation, and monitoring.
Owns RAG Infrastructure, the shared retrieval-augmented generation platform that agentic and generative AI programs use to ground LLM outputs in organizational knowledge.
Designs and operates document ingestion pipelines: normalizing clinical documents, policies, guidelines, and unstructured data sources into formats ready for embedding and retrieval.
Implements and optimizes chunking strategies tailored to healthcare content (e.g., preserving clinical note structure, section-aware chunking for guidelines and protocols).
Manages the embedding pipeline: selecting, tuning, and versioning embedding models (domain-specific clinical models where they outperform general-purpose).
Administers the vector database: schema design, indexing, metadata management, access controls, and performance tuning.
Builds and maintains retrieval pipelines: hybrid search (vector + keyword/BM25), reranking, and relevance filtering to maximize retrieval precision for downstream agents and LLM applications.
Establishes data quality gates for RAG: automated profiling, completeness checks, and accuracy scoring before content enters the vector store.
Monitors retrieval quality metrics (Precision@K, Recall@K, MRR) and continuously optimize retrieval performance.
Databricks workspace configuration and Unity Catalog governance.
Cluster policies, compute management, and cost monitoring.
Manges user/group management and access control.
Administrator for Feature Store.
Accountable for satisfying all job specific obligations and complying with all organization policies and procedures.

Benefits

We offer healthcare benefits for full time and part time positions from day one, including vision, dental and domestic partners.
We encourage an atmosphere of collaboration, cooperation and collegiality.
We know that a diverse workforce with unique experiences and backgrounds makes our team stronger.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume