Engineering Manager - Machine Learning

Recursion•Salt Lake City, UT

6h•$151,130 - $203,490•Hybrid

About The Position

You will lead a team working to build, scale, and optimize the machine learning infrastructure that powers Recursion's drug discovery platform. From model training pipelines to production deployment systems, to agent infrastructure and Large Language Models, you will ensure our ML models can operate at massive scale across our supercomputing infrastructure, both on prem and in the cloud. You will work cross-functionally across ML engineering, data science, and research teams to translate requirements into robust, scalable ML infrastructure solutions.

Requirements

Experience in a hands-on technical role as a tech lead or a manager with a focus on infrastructure, MLOps and distributed systems.
Excitement for deeply engaging in technical details with your team around machine learning, orchestration and agentic systems.
A people-first mindset. We deliver in a way that prioritizes supporting our coworkers in their growth and experience and understand how Conway's Law shapes our ML system outcomes.
Demonstrated past record of learning from and teaching peers in areas of ML infrastructure, model deployment, distributed compute, GPU optimization, and MLOps system architecture
Excitement to learn parts of our ML tech stack that you might not already know. Our current ML infrastructure includes: Python, PyTorch, Docker, Kubernetes, Ray, Weights & Biases, Prefect, BigQuery, Postgres, GCP, CUDA, and various model serving frameworks.

Nice To Haves

Fluency in life sciences or drug discovery is a plus but not required to be considered.

Responsibilities

Enable AI/ML, LLM, and Agentic Systems teams for scale - The ML infrastructure team is responsible for building and operating platforms that allow data scientists and ML engineers to train, deploy, and monitor models across Recursion's massive datasets. With billions of compounds, 30+ petabytes of experimental data, and complex deep learning workloads, your team enables everything from automated compound screening models to clinical trial prediction systems. You will work closely with researchers and ML engineers to understand their infrastructure needs and build scalable solutions for model development, training, and deployment.
Act as a mentor, coach, and sponsor - You will share your technical, leadership and managerial skills in MLOps, distributed computing, and infrastructure engineering, delivering impact, learning, and growth across teams at Recursion. We believe that the best work comes from working across organizational boundaries and you will have opportunities to partner with ML research, platform engineering, and business teams.
Enable a model-driven culture - Machine learning is at the core of everything we do. You will work with stakeholders across the business to ensure our ML infrastructure supports rapid experimentation, reliable model deployment, and continuous improvement. Problems you will work on could range from optimizing GPU cluster utilization to implementing Agentic orchestration and establishing company-wide MLOps standards