Sr Machine Learning Engineer- ML Infrastructure & Data Platforms

Adobe•San Jose, CA

56d

About The Position

We’re looking for a Senior Machine Learning Engineer to join our Applied Science Data Frameworks team. In this role, you’ll build the infrastructure that powers large-scale, multimodal AI training and inference. You’ll work across machine learning, distributed systems, and data engineering to develop tools and platforms that help teams train and deploy models at scale. Your work will support systems that process billions of data points across large GPU environments. If you’re motivated by solving complex problems and building systems that enable others to do their best work, we’d love to connect.

Requirements

8+ years of experience building and operating distributed systems or ML infrastructure in production
Experience working with large-scale data pipelines or inference systems
Strong programming skills in Python and a foundation in software engineering principles
Experience with ML frameworks such as PyTorch or TensorFlow
Familiarity with distributed computing tools (e.g., Ray, Spark, Dask, or similar)
Experience working with cloud platforms such as AWS or Azure
Understanding of MLOps practices, including CI/CD and deployment workflows
Ability to communicate clearly and collaborate with cross-functional teams

Nice To Haves

Experience working with multimodal data (images, video, text)
Familiarity with vector databases or semantic search systems

Responsibilities

Build distributed data loaders to support large-scale training workflows
Develop data pipelines for ingesting, transforming, and preparing multimodal datasets
Design batch inference systems for high-volume data processing across GPU environments
Improve system performance, scalability, and reliability using distributed computing tools (e.g., Ray, Spark, DuckDB)
Implement search and retrieval systems using vector databases and embedding-based approaches
Develop and maintain CI/CD workflows, including testing, deployment, and containerization
Partner with researchers and engineers to turn model requirements into scalable systems
Create reusable tools, libraries, and documentation to support teams across the organization
Monitor and improve system health, including throughput, latency, and resource utilization
Support a collaborative team environment through code reviews and knowledge sharing