Sr. Engineering Manager, MLOps

Quince•Palo Alto, CA

About The Position

We are seeking a Senior Engineering Manager, MLOps to join our growing team. The ideal candidate is a technical visionary with a proven track record of building and scaling the underlying infrastructure that powers production-grade Machine Learning. You have a deep understanding of the ML lifecycle—from model development and distributed training to automated deployment and real-time monitoring—and you are passionate about treating infrastructure as a product for your "customers": Quince’s Data Scientists and AI Researchers. You are a self-starter who excels at identifying architectural bottlenecks and transforming them into seamless, automated "paved roads" that increase team velocity without sacrificing stability. Thriving in an environment of rapid growth and ambiguity, you make high-judgment decisions on "build vs. buy" and prioritize technical roadmaps that align directly with e-commerce business outcomes. Above all, you are energized by a culture of distributed decision-making and extreme candor, where you will lead a high-performing team to set new standards for how AI is industrialized at scale to serve Quince customers.

Requirements

10+ years of industry experience, with at least 3-5 years in a leadership or management role specifically focused on ML Infrastructure, MLOps, or large-scale Data Platform engineering.
Proven track record of building and scaling MLOps platforms that support the full model lifecycle—from data ingestion and distributed training to real-time inference and monitoring.
Deep technical expertise in cloud-native infrastructure (preferably AWS) and orchestration tools like Kubernetes (EKS), Docker, and Infrastructure as Code (Terraform/Pulumi).
Hands-on experience with ML frameworks and tooling, such as PyTorch, TensorFlow, Kubeflow, or SageMaker, and a strong opinion on how to integrate them into a cohesive developer experience.
Expertise in building and managing Feature Stores and high-throughput data pipelines (using tools like Spark, Flink, or Kafka) to ensure data consistency across training and serving.
Experience partnering with AI Research and Data Science teams to understand their unique workflows and translate research needs into robust, scalable engineering solutions.
Strong understanding of CI/CD for ML, including automated testing for models, model versioning, and "blue-green" or "canary" deployment strategies.
Demonstrated ability to manage high-cost compute resources, with experience optimizing GPU utilization and cloud spend in a hyper-growth environment.
Excellence in operational leadership, with a history of driving service availability, performance, and stability through rigorous on-call rotations and root-cause analysis.
A product-oriented mindset, with the ability to treat infrastructure as a platform and prioritize the roadmap based on researcher velocity and business ROI.
Exceptional communication and influence skills, capable of navigating ambiguity and building consensus across engineering, product, and data science leadership.
Kindness and high standards: You move fast and push for excellence, but you do so as a supportive team player who fosters a culture of psychological safety and extreme candor.

Responsibilities

Define the MLOps Vision & Strategy: Architect a long-term roadmap that transitions ML workflows from manual scripts to a fully automated, self-service platform for all Quince Data Scientists and AI Researchers.
Own the "Paved Road" for Production: Build and maintain the end-to-end infrastructure for model training, deployment, and serving, ensuring researchers can move from "idea to production" with zero friction.
Drive Strategic Prioritization: Partner with business leaders to align infrastructure investments with core e-commerce drivers like real-time personalization, dynamic pricing, and inventory forecasting.
Lead "Build vs. Buy" Evaluations: Make high-judgment decisions on when to leverage cloud-native services (e.g., SageMaker, Vertex AI) versus building custom internal tools to optimize for cost, speed, and flexibility.
Guarantee System Scalability & Reliability: Oversee the uptime and performance of production ML services, ensuring the stack can handle massive traffic surges and seasonal spikes without degradation.
Manage Compute Governance & Costs: Direct the optimization of high-cost computational resources, such as GPU clusters and cloud instances, balancing high-performance training needs with fiscal responsibility.
Recruit and Mentor Top Talent: Build and lead a high-performing team of ML Infra and DevOps engineers, providing technical coaching, career pathing, and performance management.
Establish MLOps Standards: Drive the adoption of best practices in CI/CD for ML, Infrastructure as Code (IaC), and automated testing to ensure a modular and maintainable system.
Bridge the Research-Engineering Gap: Act as the primary cross-functional lead, translating the complex needs of AI Researchers into actionable engineering requirements for the infrastructure team.
Define and Track Velocity Metrics: Establish KPIs for the infrastructure team, such as model deployment frequency, mean time to recovery (MTTR), and infrastructure cost per inference.
Champion Operational Excellence: Lead root-cause analyses (RCAs) for production failures and foster a culture of accountability where systemic fixes are prioritized over "quick patches."
Stay Ahead of the AI Curve: Monitor emerging trends in LLM-ops, vector databases, and real-time feature engineering to ensure Quince’s infrastructure remains competitive and future-proof.