Senior Machine Learning Engineer

Zip Co Limited

26d•Remote

About The Position

Deep expertise in building and operating production-grade ML and data platforms using Spark (PySpark + SQL), Databricks (Azure), and Delta Lake, with strong hands-on experience in MLOps practices including MLflow model lifecycle management, feature store architecture (offline + online), CI/CD for ML workflows, and scalable model deployment. Proven ability to design reliable, cost-efficient distributed data systems, optimize Spark workloads, and implement robust governance, observability, and access controls across ML data pipelines. Strong cloud engineering fundamentals in Azure, including orchestration, infrastructure reliability, and integration with services such as CosmosDB and downstream analytics systems. The Data Engineering and Machine Learning teams at Zip exist to make data and ML production-ready, trusted, and scalable across the business. Our mission is to elevate the quality, reliability, and accessibility of data assets while enabling innovative AI-driven applications that create measurable customer and commercial impact. We operate with an ownership mindset — engineers here don’t just build pipelines, they own platforms end-to-end. Great talent on this team thrives in ambiguity, designs with scale and reliability in mind, and proactively improves standards rather than maintaining the status quo. We work collaboratively across Data Science, Analytics, and Engineering, balancing speed with engineering discipline. We value pragmatic problem-solvers who think in systems, prioritize observability and maintainability, and are motivated by building infrastructure that empowers others to move faster and smarter. Start your adventure with Zip We’re hiring a Senior Machine Learning Platform Engineer to build and operate the infrastructure that powers production-grade machine learning at Zip. In this role, you’ll own the ML lifecycle end-to-end — from feature pipelines and model registry standards to CI/CD and scalable model serving on Databricks (Azure). You’ll ensure our ML systems are reliable, observable, and built to scale as we expand AI-driven capabilities across the business. Our goals include enhancing the discipline within our data engineering practices, strengthening our collaboration with the Data Analytics and Data Science teams, and elevating the quality of our data assets. These changes are designed to better position us to leverage the full potential of our data, allowing us to explore new and innovative applications, including the use of AI.

Requirements

8+ years of experience in Machine Learning with a strong focus on production-grade ML and distributed data systems
Demonstrated experience owning and operating ML systems end-to-end in production environments
Strong Spark Capability (Core Requirement)
Advanced experience with PySpark and Spark SQL
Strong understanding of Spark execution (joins, shuffles, partitioning)
Experience building and optimizing reliable, scalable data pipelines
Strong data engineering fundamentals including medallion architecture design, incremental/idempotent ETL patterns, and Delta Lake optimization (partitioning)
MLOps & ML Systems
Experience operating ML systems in production
Hands-on experience with MLflow (tracking + model registry)
Experience managing feature stores (offline + online)
Experience deploying and monitoring model serving endpoints
Experience implementing CI/CD for ML workflows
Cloud & Platform Experience
Experience working in Azure
Production experience with Databricks and Delta Lake
Experience integrating with CosmosDB or similar NoSQL key-value stores
Experience designing orchestrated, production-grade data workflows (Databricks Workflows, Airflow, or ADF) with dependency management, backfills, and failure recovery

Nice To Haves

Delta Live Tables and streaming pipelines
Iceberg or Lakehouse Federation experience
Snowflake experience
Vector databases or LLM infrastructure
Infrastructure-as-code experience

Responsibilities

Own the ML Lifecycle (MLOps)
Build and maintain feature pipelines (batch + streaming)
Manage offline and online feature store patterns (CosmosDB-backed online lookup)
Administer and enforce standards around MLflow model registry, versioning, and promotion workflows
Deploy and operate model serving endpoints
Implement CI/CD for ML pipelines and model deployment
Participate in on-call rotation for platform-owned systems
Build Production-Grade Spark Systems
Develop pipelines using PySpark and Spark SQL
Optimize joins, partitioning, and shuffle-heavy workloads
Improve reliability and cost-efficiency of Spark jobs
Ensure pipelines are modular, testable, and production-ready
Support streaming workloads using Delta Live Tables
Operate and Improve the Platform
Administer Databricks clusters, jobs, policies, and permissions
Improve observability, alerting, and operational standards
Contribute to Lakehouse Federation initiatives (Databricks ↔ Snowflake via Iceberg)
Integrate ML services into downstream architecture
Implement governance, access controls (RBAC), and data quality/observability standards across ML data pipelines

Benefits

Flexible working culture
Incentive programs
Unlimited PTO
Generous paid parental leave
Leading family support policies
Company-sponsored 401k match
Learning and wellness subscription stipend
Beautiful Union Square office with a casual dress code
Industry-leading, employer-sponsored insurance for you and your dependents, with several 100% Zip-covered choices available

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume