Staff Software Engineer, Machine Learning Operations

W.W. GraingerChicago, IL
121d$121,500 - $202,500

About The Position

The Machine Learning Platform & Operations team is focused on enabling machine learning scientists and engineers at Grainger to continuously develop, deploy, monitor, and refine machine learning models as well as improving the ML software development process. Our mission is to empower Grainger teams to effortlessly build, ship, and scale reliable machine learning, data science, and analytical solutions by proactively listening to our users and anticipating Grainger's evolving needs; delivering self-service, quality-first platforms that accelerate business outcomes. You will work with machine learning, data engineering, network, security, and platform engineering teams to build core components of a scalable, self-service machine learning platform that powers customer-facing applications. You will play an important part in developing the tools and services that form the backbone of Grainger's AI driven features leveraging methods in Deep Learning, Natural Language Processing / Generative AI, Computer Vision, and beyond. This is an exciting opportunity to join a team fueling the next phase in Grainger Technology Group's data- and AI-driven modernization.

Requirements

  • Bachelor's degree and 7+ years' relevant work experience or equivalent staff-level impact in platform / infrastructure roles.
  • Possess strong software engineering fundamentals and experience developing production-grade software; experience with Python, Golang, or similar language preferred.
  • Experience leading org-wide platform initiatives and mentoring senior engineers.
  • Strong working knowledge of cloud-based services; AWS preferred.
  • Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments.
  • Deep expertise with GitOps practices and tools as well as policy‑as‑code for safe rollouts.
  • Familiarity with application monitoring and observability tools and integration patterns.
  • Deep, hands‑on experience with containers and Kubernetes.

Nice To Haves

  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
  • Experience driving machine learning system reliability and awareness of associated requirements.
  • Experience building pragmatic Kubernetes extensions and leading safe, multi-cluster Kubernetes upgrades.

Responsibilities

  • Build self-service and automated components of the machine learning platform to enable the development, deployment, and monitoring of machine learning models.
  • Design, monitor, and improve cloud infrastructure solutions that support applications executing at scale.
  • Optimize infrastructure spend by conducting utilization reviews, forecasting capacity, and driving cost/performance trade‑offs for training and inference.
  • Architect multi‑cluster/region topologies for ML workloads and lead progressive delivery patterns in CI/CD.
  • Ensure a rigorous deployment process using DevOps (GitOps) standards and mentor users in software development best practices.
  • Evolve CI/CD from repo‑local workflows to reusable pipeline templates with quality/performance gates.
  • Define org‑wide observability standards for ML system and model reliability; drive adoption across teams.
  • Collaborate with the SRE team to define and drive SRE standards for ML systems.
  • Institute compatibility and deprecation/versioning policies for clusters and runtimes.
  • Own multi‑component roadmap initiatives that measurably move platform & reliability OKRs.
  • Partner with teams across the business to enable reliable adoption of ML.

Benefits

  • Medical, dental, vision, and life insurance plans with coverage starting on day one.
  • 18 paid time off (PTO) days annually for full-time employees and 6 company holidays per year.
  • 6% company contribution to a 401(k) Retirement Savings Plan each pay period, no employee contribution required.
  • Employee discounts, tuition reimbursement, student loan refinancing and free access to financial counseling.
  • Maternity support programs, nursing benefits, and up to 14 weeks paid leave for birth parents.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service