ML Ops Engineer

Adobe•San Jose, CA

13d

About The Position

Changing the world through digital experiences is what Adobe’s all about. We give everyone—from emerging artists to global brands—everything they need to design and deliver exceptional digital experiences! We’re passionate about empowering people to create beautiful and powerful images, videos, and apps, and transform how companies interact with customers across every screen. We’re on a mission to hire the very best and are committed to creating exceptional employee experiences where everyone is respected and has access to equal opportunity. We realize that new ideas can come from everywhere in the organization, and we know the next big idea could be yours! The Opportunity Join Adobe as a skilled and proactive Machine Learning Ops Engineer to drive the operational reliability, scalability, and performance of our AI systems! This role is foundational in ensuring our AI systems operate seamlessly across environments while meeting the needs of both developers and end users. You will lead efforts to automate and optimize the full machine learning lifecycle—from data pipelines and model deployment to monitoring, governance, and incident response.

Requirements

3–5+ years in MLOps, DevOps, or ML platform engineering.
Strong experience with cloud infrastructure (AWS/GCP/Azure), container orchestration (Kubernetes), and IaC tools (Terraform, Helm).
Familiarity with ML model serving tools (e.g., MLflow, Seldon, TorchServe, BentoML).
Proficiency in Python and CI/CD automation (e.g., GitHub Actions, Jenkins, Argo Workflows).
Experience with monitoring tools (Prometheus, Grafana, Datadog, ELK, Arize AI, etc.).
Bachelor's or equivalent experience in Computer Science, Engineering, or a related technical field.

Nice To Haves

Experience supporting LLM applications, RAG pipelines, or AI agent orchestration.
Understanding of vector databases, embedding workflows, and model retraining triggers.
Exposure to privacy, safety, and responsible AI principles in operational contexts.

Responsibilities

Model Lifecycle Management Manage model versioning, deployment strategies, rollback mechanisms, and A/B testing frameworks for LLM agents and RAG systems. Coordinate model registries, artifacts, and promotion workflows in collaboration with ML Engineers
Monitoring & Observability Implement real-time monitoring of model performance (accuracy, latency, drift, degradation). Track conversation quality metrics and user feedback loops for production agents.
CI/CD for AI Develop automated pipelines for timely/agent testing, validation, and deployment. Integrate unit/integration tests into model and workflow updates for safe rollouts.
Infrastructure Automation Provision and manage scalable infrastructure (Kubernetes, Terraform, serverless stacks). Enable auto-scaling, resource optimization, and load balancing for AI workloads.
Data Pipeline Management Craft and maintain data ingestion pipelines for both structured and unstructured sources. Ensure reliable feature extraction, transformation, and data validation workflows.
Performance Optimization Monitor and optimize AI stack performance (model latency, API efficiency, GPU/compute utilization). Drive cost-aware engineering across inference, retrieval, and orchestration layers.
Incident Response & Reliability Build alerting and triage systems to identify and resolve production issues. Maintain SLAs and develop rollback/recovery strategies for AI services.
Compliance & Governance Enforce model governance, audit trails, and explainability standards. Support documentation and regulatory frameworks (e.g., GDPR, SOC 2, internal policy alignment).

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume