Senior Machine Learning Engineer

Zip Co Limited
Remote

About The Position

Deep expertise in building and operating production-grade ML and data platforms using Spark (PySpark + SQL), Databricks (Azure), and Delta Lake, with strong hands-on experience in MLOps practices including MLflow model lifecycle management, feature store architecture (offline + online), CI/CD for ML workflows, and scalable model deployment. Proven ability to design reliable, cost-efficient distributed data systems, optimize Spark workloads, and implement robust governance, observability, and access controls across ML data pipelines. Strong cloud engineering fundamentals in Azure, including orchestration, infrastructure reliability, and integration with services such as CosmosDB and downstream analytics systems. The Data Engineering and Machine Learning teams at Zip exist to make data and ML production-ready, trusted, and scalable across the business. Our mission is to elevate the quality, reliability, and accessibility of data assets while enabling innovative AI-driven applications that create measurable customer and commercial impact. We operate with an ownership mindset — engineers here don’t just build pipelines, they own platforms end-to-end. Great talent on this team thrives in ambiguity, designs with scale and reliability in mind, and proactively improves standards rather than maintaining the status quo. We work collaboratively across Data Science, Analytics, and Engineering, balancing speed with engineering discipline. We value pragmatic problem-solvers who think in systems, prioritize observability and maintainability, and are motivated by building infrastructure that empowers others to move faster and smarter. Start your adventure with Zip We’re hiring a Senior Machine Learning Platform Engineer to build and operate the infrastructure that powers production-grade machine learning at Zip. In this role, you’ll own the ML lifecycle end-to-end — from feature pipelines and model registry standards to CI/CD and scalable model serving on Databricks (Azure). You’ll ensure our ML systems are reliable, observable, and built to scale as we expand AI-driven capabilities across the business. Our goals include enhancing the discipline within our data engineering practices, strengthening our collaboration with the Data Analytics and Data Science teams, and elevating the quality of our data assets. These changes are designed to better position us to leverage the full potential of our data, allowing us to explore new and innovative applications, including the use of AI.

Requirements

  • 8+ years of experience in Machine Learning with a strong focus on production-grade ML and distributed data systems
  • Demonstrated experience owning and operating ML systems end-to-end in production environments
  • Strong Spark Capability (Core Requirement)
  • Advanced experience with PySpark and Spark SQL
  • Strong understanding of Spark execution (joins, shuffles, partitioning)
  • Experience building and optimizing reliable, scalable data pipelines
  • Strong data engineering fundamentals including medallion architecture design, incremental/idempotent ETL patterns, and Delta Lake optimization (partitioning)
  • MLOps & ML Systems
  • Experience operating ML systems in production
  • Hands-on experience with MLflow (tracking + model registry)
  • Experience managing feature stores (offline + online)
  • Experience deploying and monitoring model serving endpoints
  • Experience implementing CI/CD for ML workflows
  • Cloud & Platform Experience
  • Experience working in Azure
  • Production experience with Databricks and Delta Lake
  • Experience integrating with CosmosDB or similar NoSQL key-value stores
  • Experience designing orchestrated, production-grade data workflows (Databricks Workflows, Airflow, or ADF) with dependency management, backfills, and failure recovery

Nice To Haves

  • Delta Live Tables and streaming pipelines
  • Iceberg or Lakehouse Federation experience
  • Snowflake experience
  • Vector databases or LLM infrastructure
  • Infrastructure-as-code experience

Responsibilities

  • Own the ML Lifecycle (MLOps)
  • Build and maintain feature pipelines (batch + streaming)
  • Manage offline and online feature store patterns (CosmosDB-backed online lookup)
  • Administer and enforce standards around MLflow model registry, versioning, and promotion workflows
  • Deploy and operate model serving endpoints
  • Implement CI/CD for ML pipelines and model deployment
  • Participate in on-call rotation for platform-owned systems
  • Build Production-Grade Spark Systems
  • Develop pipelines using PySpark and Spark SQL
  • Optimize joins, partitioning, and shuffle-heavy workloads
  • Improve reliability and cost-efficiency of Spark jobs
  • Ensure pipelines are modular, testable, and production-ready
  • Support streaming workloads using Delta Live Tables
  • Operate and Improve the Platform
  • Administer Databricks clusters, jobs, policies, and permissions
  • Improve observability, alerting, and operational standards
  • Contribute to Lakehouse Federation initiatives (Databricks ↔ Snowflake via Iceberg)
  • Integrate ML services into downstream architecture
  • Implement governance, access controls (RBAC), and data quality/observability standards across ML data pipelines

Benefits

  • Flexible working culture
  • Incentive programs
  • Unlimited PTO
  • Generous paid parental leave
  • Leading family support policies
  • Company-sponsored 401k match
  • Learning and wellness subscription stipend
  • Beautiful Union Square office with a casual dress code
  • Industry-leading, employer-sponsored insurance for you and your dependents, with several 100% Zip-covered choices available
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service