Machine Learning Engineer III / Senior Machine Learning Engineer - AI Platform

Workday•Vancouver, BC

24d•Hybrid

About The Position

Do you want to build impactful, AI features and solutions that will be used by millions of end-users? We are in the AI Platform organization at Workday and we solve meaningful problems that lie at the intersection of machine learning and enterprise-scale software! We build advanced AI solutions that power the core Workday software by modeling user behavior and providing intelligent automation. Come join us and make it easier and balanced for millions of Workday users! This role is focused on building the systems and tooling required to host and scale agent-based applications powered by LLMs. You will work across the platform stack to create reusable capabilities for agent execution, workflow orchestration, observability, evaluation, reliability, and developer experience. You’ll partner closely with applied AI, product, and infrastructure teams to define how agents are built and operated across the organization. This is an ideal role for someone who enjoys solving hard engineering problems in a fast-evolving technical space and wants to shape the foundation for the next generation of AI applications. We are looking for a Machine Learning Engineer to help design and build our Agent Platform—the core infrastructure that enables teams to develop, deploy, orchestrate, and operate AI agents in production. This role is focused on building the systems and tooling required to host and scale agent-based applications powered by LLMs. You will work across the platform stack to create reusable capabilities for agent execution, workflow orchestration, observability, evaluation, reliability, and developer experience. You’ll partner closely with applied AI, product, and infrastructure teams to define how agents are built and operated across the organization. This is an ideal role for someone who enjoys solving hard engineering problems in a fast-evolving technical space and wants to shape the foundation for the next generation of AI applications.

Requirements

3+ yrs experience as part of a data science, machine learning software development team or relevant work in a PhD or equivalent program.
5+ years experience in Python and experience building reliable, maintainable production services.
3+ years experience with distributed systems, APIs, asynchronous workflows, and service-oriented architecture.
3+ years experience designing systems with a focus on scalability, reliability, observability, and maintainability.
6+ years of software engineering experience, including experience building and operating production-grade backend, ML, or platform systems.
8+ years experience in Python and experience building reliable, maintainable production services.
5+ years experience with distributed systems, APIs, asynchronous workflows, and service-oriented architecture.
5+ years experience designing systems with a focus on scalability, reliability, observability, and maintainability.

Nice To Haves

Experience building or supporting agent platforms, AI infrastructure, or internal developer platforms.
Experience building and deploying machine learning or LLM-powered applications in production.
Familiarity with LLM application patterns, including: Tool calling Retrieval-augmented generation (RAG) Memory and context management Multi-step workflows and orchestration Human-in-the-loop systems
Experience designing and implementing evaluation frameworks for LLM or agent quality.
Familiarity with vector databases, model serving, prompt/version management, and experimentation tooling.
Solid knowledge of Data Science principles and their application in NLP
Experience running services in Kubernetes-based environments.
Ability to work across ambiguity, make strong technical tradeoffs, and drive projects from concept to production.
Strong communication and collaboration skills, with the ability to partner effectively across engineering, product, and AI teams.

Responsibilities

Design and build the core platform capabilities required to develop, host, and operate AI agents at scale.
Develop infrastructure and services for agent execution, orchestration, state management, and runtime reliability.
Build reusable abstractions, frameworks, and workflows in Python to support agent development patterns across teams.
Design and implement systems for tool use, memory, retrieval, workflow coordination, and human-in-the-loop interactions.
Build and maintain services deployed on Kubernetes, with a focus on scalability, resiliency, and operational excellence.
Develop capabilities for evaluation, tracing, observability, debugging, and performance monitoring of agent behavior in production.
Improve platform performance across latency, throughput, fault tolerance, and cost efficiency.
Create internal APIs, SDKs, and developer tooling that make it easier for engineering teams to build on the platform.
Partner with cross-functional teams to productionize new agent use cases and establish common platform patterns and best practices.
Contribute to technical architecture and help define the roadmap for agent infrastructure and platform evolution.