DAT is an award-winning employer of choice and a next-generation SaaS technology company that has been at the leading edge of innovation in transportation supply chain logistics for 45 years. We continue to transform the industry year over year, by deploying a suite of software solutions to millions of customers every day - customers who depend on DAT for the most relevant data and most accurate insights to help them make smarter business decisions and run their companies more profitably. We operate the largest marketplace of its kind in North America, with 400 million freights posted in 2022, and a database of $150 billion of annual global shipment market transaction data. Our headquarters are in Denver, CO, and Beaverton, OR, with additional offices in Seattle, WA; Springfield, MO; and Bangalore, India. For additional information, see www.DAT.com/company Job Application Deadline: 01/30/2026 The Opportunity DAT’s Convoy Platform Science team is seeking a Principal ML Platform Engineer to scale and evolve Convoy’s most critical Data and ML Platform capabilities. As the platform enters a new phase of growth, we must dramatically increase our ability to experiment, learn, and adapt in real time across our marketplace, fraud detection, and pricing systems. This role is both deeply hands-on and highly architectural, responsible for building the foundational infrastructure that enables our ML and AI systems to move faster, learn faster, and operate safely at scale. You will lead the development of the core capabilities that let us: Deliver lower-latency data to models , unlocking online learning, adaptive policies, and improved real-time decision-making for Convoy’s auction mechanism, fraud detection apparatus, and carrier engagement campaigns. Evolve our ML platform to support generative AI , including orchestration, retrieval, standardized service patterns, and scalable model serving needed for foundational model applications in document digitization and voice-based features. Experiment faster and safer , through robust causal inference tooling, richer randomized experimentation, and reliable evaluation infrastructure to help us learn more about the unique spatio-temporal dynamics of a Trucking marketplace. You will define and implement durable service architectures, build the real-time systems that power ML in production, and partner closely with scientists to accelerate iteration and innovation. This is a pivotal moment in the integration of Convoy’s technology into the broader DAT ecosystem, and your work will form the backbone of the next generation of ML and AI capabilities across the freight network. You are someone who will As a Principal ML Platform Engineer, you will set technical direction, mentor other engineers and scientists, and deliver solutions whose impact scales across teams and the broader Convoy Platform, not just within individual projects. Your work will influence three major areas: Experimentation, Evaluation & Adaptive Learning Infrastructure Drive the evolution of Convoy’s experimentation and model-evaluation foundations. Enable rigorous causal measurement, reliable online experimentation, scalable model iteration, and adaptive learning systems that continuously improve marketplace and policy decisions. Evolve Convoy’s experimentation stack (TestDrive): Add richer randomized experiments, causal inference tooling, exposure/assignment logging, and metric pipelines; evaluate third-party solutions where beneficial. Enable adaptive learning approaches (RL, contextual bandits, online learning) for dynamic marketplace and policy decisions (e.g., inferring the best timing, cohort, or communication channel to maximize carrier engagement). Harden our evaluation infrastructure , including offline/online pipelines, drift detection mechanisms, and structured feedback loops that ensure reliable model behavior over time. Implement orchestration layers that combine inference, retrieval, business logic, guardrails, and human-in-loop flows into reliable, auditable multi-step AI agents. Feature Stores and Streaming Infrastructure Iterate on and expand Convoy Platform’s low-latency Feature Store and real-time streaming platform (on RisingWave) to deliver signals such as app analytics, carrier behavior, and digital fingerprints to support marketplace optimization, fraud detection, and other decision systems. Ensure unified online/offline semantics to improve online decision-making, support real-time optimization, and enable future reinforcement-learning and online-learning workflows. Build high-throughput streaming pipelines for carrier engagement, risk indicators, and fraud signals that power sub-minute marketplace and policy decisions. Develop platform-level trucking knowledge systems , including RAG indexes, domain adapters, structured benchmarks, and retrieval strategies that ground AI systems in operational realities. End-to-End Data & ML Platform + Core DevOps/MLOps Foundations This role entails end-to-end ownership and evolution of Convoy’s Data and ML Platform, spanning data capture, transmission, structuring, storage, and consumption by ML models and analytics. You will architect and lead initiatives that reduce latency, increase reliability, and improve developer efficiency, directly enabling our builders to perform complex analysis and ship high-quality ML products. Design and scale the platform ecosystem , leveraging Kafka, Snowflake, Kubernetes, and modern data formats (Avro, JSON, Iceberg), and use Python/Go to build the “connective tissue” that ensures platform reliability and scale. Build low-latency, production-grade Python services and contribute to TypeScript/Node where needed (e.g., emitting high-quality data signals, wiring model calls into product flows, enabling experimentation and feature-flag pathways). Partner with scientists to define durable service patterns (API design, serving workflows, monitoring) and uplift the platform that enables fast, safe iteration on ML-backed services. Mature platform infrastructure , including Terraform/IaC, CI/CD, observability, logging/tracing, incident readiness, and cost/performance optimization. Improve SQL/dbt workflows and batch/streaming pipelines to increase reliability, correctness, and scalability. Extend model-serving infrastructure to support more advanced ML workloads (managed inference →self-hosted GPU), with standardized versioning, canary/A/B rollouts, and granular monitoring.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Principal
Education Level
No Education Listed
Number of Employees
501-1,000 employees