The Reliability Engineering group at Walmart Global Tech builds intelligent, data-driven platforms that ensure the availability, performance, and efficiency of Walmartʼs enterprise and e-commerce systems at massive scale. The team leverages large-scale telemetry, automation, and machine learning to enable proactive optimization, faster incident detection, and resilient system behavior across thousands of services. About the Team: Building the right technology foundation for Infrastructure & Platforms is critical to operating at Walmartʼs scale. Our team designs and maintains the core technologies that power the broader tech organization — including data platforms, observability systems, DevOps tooling, cloud infrastructure, and runtime automation frameworks. These systems support secure, reliable, and scalable operations across stores, digital platforms, and distribution centers worldwide. What you'll do... As a Principle ML Engineer, you will architect, build, and operate production-grade ML systems that directly influence runtime behavior across large-scale distributed systems. This is a hands-on engineering role with strong system design and ownership responsibilities. You will: Architect and implement end-to-end ML systems (data pipelines, feature engineering, model training, deployment, and monitoring). Design scalable, low-latency model serving infrastructure integrated with Kubernetes and cloud- native systems. Build intelligent automation solutions including predictive autoscaling, anomaly detection, seasonality-aware forecasting, and capacity optimization. Engineer safe and reliable ML-driven automation that operates in high-availability environments. Own model lifecycle management, including validation, experiment tracking, model registry, monitoring, drift detection, and rollback strategies. Collaborate closely with platform, SRE, and infrastructure teams to embed ML capabilities into production systems. Drive engineering best practices around ML system reliability, observability, testing, andperformance. Contribute to architectural decisions and mentor engineers on ML systems design. Your solutions will operate at enterprise scale and directly impact system reliability, performance, and infrastructure cost efficiency.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Principal