Staff Software Engineer, Infrastructure - Machine Learning

Ryder Supply Chain Solutions•San Francisco, CA

1d•Hybrid

About The Position

Job Seekers can review the Job Applicant Privacy Policy by clicking here. Job Description: Responsibilities Own Core ML Infrastructure: Build and scale distributed systems for ML training, serving, and inference. Design and implement real-time ML workflows that power core product features. Implementation of Distributed Systems: Build robust distributed systems tailored for efficient ML training and seamless operational deployment. Feature Engineering Enhancement: Streamline and manage both online and offline feature stores, optimizing feature engineering processes for greater efficiency. Real-Time ML Workflow Enhancement: Improve real-time machine learning workflows to support dynamic decision-making and automate core operational processes. Platform Level Ownership: Lead the development of ML Ops systems, including model deployment, monitoring, and experiment tracking. Architect and manage scalable feature stores for online and offline usage. AI-Driven Optimization: Contribute to agentic AI systems for freight matching, ETA prediction, and load scheduling. Support systems that improve Stop Estimation Accuracy and Cross-Mode Optimization. Production Ready Engineering: Write production-grade Python that operates at scale, with reliability and performance top of mind. Collaborate across engineering and data science to turn models into resilient software systems.

Requirements

Production Python & Distributed Systems Expertise Advanced proficiency in Python at a Staff Level Must be within a production environment where the code directly impacts operations.
Experience in distributed computing, scalable ML infrastructure, & high-performance engineering.
Machine Learning (MLOps) Scales ML infra for multiple teams and use cases.
Experience implementing and serving ML algorithms.
Ensures reproducibility, lineage, and experiment rigor.
Owns end-to-end ML systems: training, deployment, features, monitoring, rollback.
Hands-on experience with data engineering, distributed training, model monitoring, and experiment tracking.
Breadth of knowledge and applied experience across multiple ML applications, with proven ability to leverage a wide range of tools, frameworks, and systems.
Technical Leadership & Cross-Functional Influence Leads design and delivery of large-scale ML or distributed systems.
Defines reusable patterns, standards, and architectures.
Drives decisions that improve reliability, latency, and developer velocity.
Sets technical direction and elevates ML engineering standards.
Communicates vision and trade-offs across disciplines.
Can Mentor other ML engineers on the team.

Nice To Haves

5 to 8 years of backend or ML infrastructure experience.
Proven track record building production ML workflows at scale.
Experience in industry logistics, transportation, or freight is a bonus.

Responsibilities

Own Core ML Infrastructure: Build and scale distributed systems for ML training, serving, and inference.
Design and implement real-time ML workflows that power core product features.
Implementation of Distributed Systems: Build robust distributed systems tailored for efficient ML training and seamless operational deployment.
Feature Engineering Enhancement: Streamline and manage both online and offline feature stores, optimizing feature engineering processes for greater efficiency.
Real-Time ML Workflow Enhancement: Improve real-time machine learning workflows to support dynamic decision-making and automate core operational processes.
Platform Level Ownership: Lead the development of ML Ops systems, including model deployment, monitoring, and experiment tracking.
Architect and manage scalable feature stores for online and offline usage.
AI-Driven Optimization: Contribute to agentic AI systems for freight matching, ETA prediction, and load scheduling.
Support systems that improve Stop Estimation Accuracy and Cross-Mode Optimization.
Production Ready Engineering: Write production-grade Python that operates at scale, with reliability and performance top of mind.
Collaborate across engineering and data science to turn models into resilient software systems.