Staff Software Engineer - AI/ML Systems and Reliability

Adobe•San Jose, CA

59d

About The Position

Adobe is looking for a Staff Software Engineer – AI/ML Systems, MLOps & Reliability to help build and scale the platform powering Adobe Experience Platform’s Personalization ML solutions and Generative AI capabilities. This role sits at the intersection of software engineering, MLOps, infrastructure, and reliability engineering. You will help design and operate the foundational platform that enables scalable model training, reliable inference, automated ML workflows, and production-grade AI systems for enterprise-scale personalization use cases. Partnering closely with engineering, product, and data science teams, you will build systems that support intelligent audience creation, journey optimization, and personalization at scale. You will join a collaborative and highly technical team of engineers and scientists with deep expertise in distributed systems and machine learning. The ideal candidate enjoys both building platform capabilities for ML systems and operating highly reliable cloud-native infrastructure. This is a hands-on role where you will contribute across MLOps platform development, distributed systems engineering, DevOps automation, and production reliability.

Requirements

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
8+ years of software engineering experience building distributed systems.
Curiosity and Bias to action.
Strong programming skills in Python or Java.
Experience with microservices, REST APIs, and cloud-native architectures.
Experience with AWS or Azure, Kubernetes, and Docker.
Experience with CI/CD, infrastructure automation, and production operations.
Strong understanding of reliability, scalability, and observability for distributed systems.
Strong troubleshooting, communication, and collaboration skills.

Nice To Haves

Experience with MLOps platforms or ML infrastructure.
Hands-on experience with Generative AI applications.
Familiarity with Ray, Kafka, Spark, Airflow, or similar distributed systems technologies.
Experience with relational and NoSQL databases such as MySQL, PostgreSQL, Redis, Elasticsearch, or Snowflake
Experience supporting high-throughput, low-latency production systems.

Responsibilities

AI/ML Platform & MLOps: Architect and build infrastructure for AI/ML systems, including Personalization and Generative AI platforms. Design and build MLOps capabilities such as model deployment pipelines, feature stores, model registries, and inference infrastructure. Partner with ML engineers and data scientists to productionize ML models and workflows. Build scalable platform services and APIs supporting multiple teams and products.
Reliability Engineering & DevOps: Improve reliability, scalability, observability, and operational efficiency of distributed AI systems. Build monitoring, alerting, logging, and tracing solutions for production services. Develop CI/CD pipelines, deployment automation, and infrastructure-as-code tooling. Troubleshoot production issues and drive operational excellence for cloud-native services. Design highly available systems that scale horizontally.
Software Engineering & Leadership: Lead technical design and architecture discussions across teams. Participate in design, development, testing, code reviews, deployment, and production support. Evaluate and adopt emerging technologies in AI and ML infrastructure and distributed systems.