Data Engineer - Hybrid

Surgery Partners, Inc•Brentwood, TN

69d•Hybrid

About The Position

This is a hybrid position based at our corporate office in Brentwood, TN, with on-site work required Monday through Wednesday. This role requires a highly technical Data Engineer with expert-level proficiency in Azure Databricks, distributed data pipelines, and large-scale healthcare data processing. This role focuses on designing and implementing high-throughput ingestion pipelines, transactional lakehouse layers, and secure PHI data flows using Azure-native services and Databricks runtime optimizations. You will build and operate production-grade data pipelines that meet rigorous requirements for security, lineage, compliance (HIPAA), observability, and operational SLAs, supporting analytics, AI, and clinical insights across the organization.

Requirements

5+ years of experience in modern data engineering roles
Expert-level proficiency in PySpark and Spark SQL
Expert-level proficiency in Databricks (Jobs, Workflows, Repos, Delta Live Tables)
Expert-level proficiency in Delta Lake architecture and transactional design patterns
Expert-level proficiency in Azure Data Factory or Azure Synapse Pipelines
Expert-level proficiency in Cloud-native data security (RBAC, ABAC, privilege boundary enforcement)
Strong experience working with healthcare data formats and standards: FHIR (JSON), HL7 v2/v3, X12 EDI claims data
Deep understanding of distributed systems, data partitioning strategies, concurrency, and cluster resource tuning

Nice To Haves

Experience implementing Unity Catalog at enterprise scale
Familiarity with MLOps workflows and Databricks MLflow
Experience using dbt with Databricks SQL
Databricks Data Engineer Professional certification
Microsoft Azure DP-203 certification
HL7 or FHIR certification

Responsibilities

Architect and implement scalable data processing pipelines using Databricks Runtime (Apache Spark, Spark SQL, MLflow, Delta Lake), Delta Lake ACID transactions, Z-Ordering, OPTIMIZE, and Change Data Feed (CDF), and Unity Catalog for governance, lineage, RBAC, and audit controls.
Design and enforce a medallion (Bronze/Silver/Gold) architecture with schema evolution, Delta Live Tables (DLT), and robust error-handling patterns.
Build high-performance ingestion frameworks for FHIR and HL7 message streams, X12 837/835 healthcare claims data, EHR/EMR source systems, and batch, real-time, and event-driven data sources.
Develop and operate data pipelines leveraging Azure Data Lake Storage Gen2 (hierarchical namespace, ACLs, POSIX permissions), Azure Data Factory or Synapse Pipelines (parameterization, dynamic pipelines, triggers), Azure Event Hubs and/or Service Bus for streaming ingestion, Azure SQL Database and Azure Synapse (Dedicated and Serverless pools), Azure Functions for lightweight orchestration and automation, and Azure Monitor, Log Analytics, and Application Insights for observability.
Implement enterprise-grade security including VNet integration and private endpoints, secrets and key management using Azure Key Vault, and managed identities and least-privilege access controls.
Develop optimized PySpark and/or Scala pipelines using advanced Spark techniques: Catalyst optimizer tuning, cluster sizing and autoscaling strategies, Adaptive Query Execution (AQE), and efficient join strategies (broadcast vs. shuffle).
Build and maintain high-volume batch ETL pipelines (100M+ records) and low-latency streaming pipelines using Spark Structured Streaming.
Implement CI/CD for Databricks environments, including Git-integrated DEV/QA/PROD workspaces, automated job and workflow deployments, and unit testing using pytest and Databricks testing frameworks.
Design and implement secure PHI pipelines compliant with HIPAA Privacy and Security Rules, and SOC 2 and HITRUST-aligned controls.
Build pipelines supporting healthcare data standards including FHIR R4 resources (Patient, Encounter, Observation, Claim, etc.), HL7 v2.x messages (ADT, ORU, ORM), and X12 EDI transactions (837, 835, 270/271).
Ensure end-to-end lineage tracking, auditability, and data retention across all lakehouse layers.