Principal Machine Learning Engineer

DAT•Seattle, WA

50d•$245,000 - $320,000•Hybrid

About The Position

DAT is an award-winning employer of choice and a next-generation SaaS technology company that has been at the leading edge of innovation in transportation supply chain logistics for 45 years. We continue to transform the industry year over year, by deploying a suite of software solutions to millions of customers every day - customers who depend on DAT for the most relevant data and most accurate insights to help them make smarter business decisions and run their companies more profitably. We operate the largest marketplace of its kind in North America, with 400 million freights posted in 2022, and a database of $150 billion of annual global shipment market transaction data. Our headquarters are in Denver, CO, and Beaverton, OR, with additional offices in Seattle, WA; Springfield, MO; and Bangalore, India. For additional information, see www.DAT.com/company Job Application Deadline: 01/30/2026 The Opportunity DAT’s Convoy Platform Science team is seeking a Principal ML Platform Engineer to scale and evolve Convoy’s most critical Data and ML Platform capabilities. As the platform enters a new phase of growth, we must dramatically increase our ability to experiment, learn, and adapt in real time across our marketplace, fraud detection, and pricing systems. This role is both deeply hands-on and highly architectural, responsible for building the foundational infrastructure that enables our ML and AI systems to move faster, learn faster, and operate safely at scale. You will lead the development of the core capabilities that let us: Deliver lower-latency data to models , unlocking online learning, adaptive policies, and improved real-time decision-making for Convoy’s auction mechanism, fraud detection apparatus, and carrier engagement campaigns. Evolve our ML platform to support generative AI , including orchestration, retrieval, standardized service patterns, and scalable model serving needed for foundational model applications in document digitization and voice-based features. Experiment faster and safer , through robust causal inference tooling, richer randomized experimentation, and reliable evaluation infrastructure to help us learn more about the unique spatio-temporal dynamics of a Trucking marketplace. You will define and implement durable service architectures, build the real-time systems that power ML in production, and partner closely with scientists to accelerate iteration and innovation. This is a pivotal moment in the integration of Convoy’s technology into the broader DAT ecosystem, and your work will form the backbone of the next generation of ML and AI capabilities across the freight network. You are someone who will As a Principal ML Platform Engineer, you will set technical direction, mentor other engineers and scientists, and deliver solutions whose impact scales across teams and the broader Convoy Platform, not just within individual projects. Your work will influence three major areas: Experimentation, Evaluation & Adaptive Learning Infrastructure Drive the evolution of Convoy’s experimentation and model-evaluation foundations. Enable rigorous causal measurement, reliable online experimentation, scalable model iteration, and adaptive learning systems that continuously improve marketplace and policy decisions. Evolve Convoy’s experimentation stack (TestDrive): Add richer randomized experiments, causal inference tooling, exposure/assignment logging, and metric pipelines; evaluate third-party solutions where beneficial. Enable adaptive learning approaches (RL, contextual bandits, online learning) for dynamic marketplace and policy decisions (e.g., inferring the best timing, cohort, or communication channel to maximize carrier engagement). Harden our evaluation infrastructure , including offline/online pipelines, drift detection mechanisms, and structured feedback loops that ensure reliable model behavior over time. Implement orchestration layers that combine inference, retrieval, business logic, guardrails, and human-in-loop flows into reliable, auditable multi-step AI agents. Feature Stores and Streaming Infrastructure Iterate on and expand Convoy Platform’s low-latency Feature Store and real-time streaming platform (on RisingWave) to deliver signals such as app analytics, carrier behavior, and digital fingerprints to support marketplace optimization, fraud detection, and other decision systems. Ensure unified online/offline semantics to improve online decision-making, support real-time optimization, and enable future reinforcement-learning and online-learning workflows. Build high-throughput streaming pipelines for carrier engagement, risk indicators, and fraud signals that power sub-minute marketplace and policy decisions. Develop platform-level trucking knowledge systems , including RAG indexes, domain adapters, structured benchmarks, and retrieval strategies that ground AI systems in operational realities. End-to-End Data & ML Platform + Core DevOps/MLOps Foundations This role entails end-to-end ownership and evolution of Convoy’s Data and ML Platform, spanning data capture, transmission, structuring, storage, and consumption by ML models and analytics. You will architect and lead initiatives that reduce latency, increase reliability, and improve developer efficiency, directly enabling our builders to perform complex analysis and ship high-quality ML products. Design and scale the platform ecosystem , leveraging Kafka, Snowflake, Kubernetes, and modern data formats (Avro, JSON, Iceberg), and use Python/Go to build the “connective tissue” that ensures platform reliability and scale. Build low-latency, production-grade Python services and contribute to TypeScript/Node where needed (e.g., emitting high-quality data signals, wiring model calls into product flows, enabling experimentation and feature-flag pathways). Partner with scientists to define durable service patterns (API design, serving workflows, monitoring) and uplift the platform that enables fast, safe iteration on ML-backed services. Mature platform infrastructure , including Terraform/IaC, CI/CD, observability, logging/tracing, incident readiness, and cost/performance optimization. Improve SQL/dbt workflows and batch/streaming pipelines to increase reliability, correctness, and scalability. Extend model-serving infrastructure to support more advanced ML workloads (managed inference →self-hosted GPU), with standardized versioning, canary/A/B rollouts, and granular monitoring.

Requirements

8–12+ years of experience in ML engineering, data infrastructure, platform engineering, or closely related production engineering roles.
Deep hands-on experience with real-time ML platforms , including feature stores, stream processing, low-latency data services, and online inference systems.
Strong proficiency in Python , with the ability to work across non-Python stacks including TypeScript/Node, gRPC services, and Kubernetes-based microservice ecosystems.
Expertise in modern data and ML infrastructure , including Kafka, Kubernetes, Postgres-like OLTP systems, cloud platforms, and production observability tooling.
Experience building and operating robust data and ML pipelines (both batch and streaming), ideally in high-scale environments such as marketplaces, fraud detection systems, pricing, personalization, or real-time decision platforms.
Strong DevOps and MLOps fundamentals , including CI/CD, containerization, infrastructure-as-code (Terraform/Helm), automated monitoring, and cloud cost and performance optimization.
Collaborative platform mindset , with a track record of partnering with scientists and product engineers to co-design durable service patterns for model serving, deployment, monitoring, and API design, enabling fast and safe iteration on ML-backed systems.
Ability to operate at Principal scope , setting technical direction, identifying and retiring platform risk, mentoring engineers, and delivering solutions whose impact scales across teams and the broader organization.

Nice To Haves

You have experience building ML systems in two-sided marketplaces, financial markets, or other economically complex environments , with intuition for incentives, pricing, and market dynamics.
You have deep experience with data reliability and correctness at scale , including schema evolution, data quality enforcement, backfills, late data handling, and incident response for production data systems.
You have applied advanced ML techniques such as reinforcement learning, bandits, or optimization to unlock real-world business impact, ideally within freight, logistics, or transportation technology.

Responsibilities

Experimentation, Evaluation & Adaptive Learning Infrastructure Drive the evolution of Convoy’s experimentation and model-evaluation foundations.
Enable rigorous causal measurement, reliable online experimentation, scalable model iteration, and adaptive learning systems that continuously improve marketplace and policy decisions.
Evolve Convoy’s experimentation stack (TestDrive): Add richer randomized experiments, causal inference tooling, exposure/assignment logging, and metric pipelines; evaluate third-party solutions where beneficial.
Enable adaptive learning approaches (RL, contextual bandits, online learning) for dynamic marketplace and policy decisions (e.g., inferring the best timing, cohort, or communication channel to maximize carrier engagement).
Harden our evaluation infrastructure , including offline/online pipelines, drift detection mechanisms, and structured feedback loops that ensure reliable model behavior over time.
Implement orchestration layers that combine inference, retrieval, business logic, guardrails, and human-in-loop flows into reliable, auditable multi-step AI agents.
Feature Stores and Streaming Infrastructure Iterate on and expand Convoy Platform’s low-latency Feature Store and real-time streaming platform (on RisingWave) to deliver signals such as app analytics, carrier behavior, and digital fingerprints to support marketplace optimization, fraud detection, and other decision systems.
Ensure unified online/offline semantics to improve online decision-making, support real-time optimization, and enable future reinforcement-learning and online-learning workflows.
Build high-throughput streaming pipelines for carrier engagement, risk indicators, and fraud signals that power sub-minute marketplace and policy decisions.
Develop platform-level trucking knowledge systems , including RAG indexes, domain adapters, structured benchmarks, and retrieval strategies that ground AI systems in operational realities.
End-to-End Data & ML Platform + Core DevOps/MLOps Foundations This role entails end-to-end ownership and evolution of Convoy’s Data and ML Platform, spanning data capture, transmission, structuring, storage, and consumption by ML models and analytics.
You will architect and lead initiatives that reduce latency, increase reliability, and improve developer efficiency, directly enabling our builders to perform complex analysis and ship high-quality ML products.
Design and scale the platform ecosystem , leveraging Kafka, Snowflake, Kubernetes, and modern data formats (Avro, JSON, Iceberg), and use Python/Go to build the “connective tissue” that ensures platform reliability and scale.
Build low-latency, production-grade Python services and contribute to TypeScript/Node where needed (e.g., emitting high-quality data signals, wiring model calls into product flows, enabling experimentation and feature-flag pathways).
Partner with scientists to define durable service patterns (API design, serving workflows, monitoring) and uplift the platform that enables fast, safe iteration on ML-backed services.
Mature platform infrastructure , including Terraform/IaC, CI/CD, observability, logging/tracing, incident readiness, and cost/performance optimization.
Improve SQL/dbt workflows and batch/streaming pipelines to increase reliability, correctness, and scalability.
Extend model-serving infrastructure to support more advanced ML workloads (managed inference →self-hosted GPU), with standardized versioning, canary/A/B rollouts, and granular monitoring.