Expert MLOps Platform Engineer

SAP•Vancouver, BC

63d•$144,600 - $322,500•Hybrid

About The Position

We help the world run better. At SAP, we keep it simple: you bring your best to us, and we'll bring out the best in you. We're builders touching over 20 industries and 80% of global commerce, and we need your unique talents to help shape what's next. The work is challenging – but it matters. You'll find a place where you can be yourself, prioritize your wellbeing, and truly belong. What's in it for you? Constant learning, skill growth, great benefits, and a team that wants you to grow and succeed. Our DevOps group is a dynamic and innovative team dedicated to working on cutting-edge data and AI technologies. We focus on building robust and scalable infrastructure to support various AI-driven projects and machine learning workflows. With highly motivated and skilled engineers across multiple global locations including Israel, Germany, Hungary, India, and Canada, collaboration and continuous improvement are at the core of our work culture. We are excited to welcome a new member who can contribute to our mission of delivering world-class ML infrastructure and help us achieve greater heights.

Requirements

5+ years of MLOps experience in production cloud-native environments.
Proficient verbal and written communication in English.
Proven expertise in cloud-native ML platforms (e.g., Azure ML, AWS SageMaker) to architect and manage automated continuous training and deployment pipelines.
Experience with model versioning and experiment tracking tools (MLflow, etc.).
Optimizing GPU utilization, including long-term capacity management.
Hands-on experience with Databricks, including Unity Catalog for data governance, access control, and metadata management.
Knowledge of data lake and data warehouse principles.
Proven experience with Kubernetes for deploying, managing, and scaling containerized ML applications.
Strong understanding of containerization technologies such as Docker.
Experience building Infrastructure as Code using Terraform or similar tools.
Working knowledge of monitoring, logging, and metrics collection tools in high-scale production environments (Prometheus, Grafana, CloudWatch, Langfuse, OpenTelemetry, etc.).
Familiarity with CI/CD pipelines and tools like GitHub Actions, Jenkins and ArgoCD.
Programming experience with Python and Bash scripting.
Experience with event streaming technologies like Apache Kafka, Azure EventHub or similar distributed streaming platform.
Strong knowledge of cloud networks and Kubernetes networks (VPC, load balancers, ingress controllers, service meshes).
Hands-on experience with managing and configuring MySQL and PostgreSQL databases in high availability environments.

Nice To Haves

Experience with multiple cloud providers (AWS, Azure, GCP) and their ML services.
Experience with AWS cloud services, particularly AWS SageMaker for ML model training, deployment, and management.
Understanding of data privacy, security, and compliance requirements in ML systems.
Experience with service mesh technologies like Istio.
Knowledge of policy engines like Kyverno or OPA for Kubernetes governance.
Intelligent, curious, and passionate about learning and exploring new technologies.
Willing and able to meet challenges head-on, solve problems independently, and drive initiatives forward.
Open-minded, flexible, and thrive in a highly dynamic, fast-paced, ever-changing environment.

Responsibilities

Cooperate with developers and infrastructure teams to ensure highly available and scalable production systems.
Build and maintain highly available and scalable production systems with a focused on optimizing ML platforms and services.
Manage core production systems, including frequent changes and updates.
Monitor, troubleshoot, and rapidly resolve issues in production ML systems.
Implement and manage continuous training pipelines for automated model retraining and deployment.
Establish best practices for MLOps, including model versioning, experiment tracking, and deployment strategies.
Provide support during off hours (nights and weekends) when necessary.