Principal MLOps Engineer

SOLVENTUM•Pittsburgh, PA

1d•$142,800 - $196,350•Remote

About The Position

Solventum is a new healthcare company with a long legacy of creating breakthrough solutions for our customers’ toughest challenges. We pioneer game-changing innovations at the intersection of health, material and data science that change patients' lives for the better while enabling healthcare professionals to perform at their best. As a Principal MLOps Engineer, you will lead the operational architecture, deployment strategy, and reliability engineering for integrating AI into high-stakes Healthcare Information Systems (HIS). You will define the enterprise operational standards, govern the release processes, and build the resilient infrastructure required to maintain models in mission-critical clinical environments. You are the definitive authority on production discipline, compliance support, and incident resolution for the AI organization.

Requirements

Bachelor's Degree or Higher in Computer Science, Software Engineering, or related technical field.
10+ years of experience in software engineering, with at least 6 years dedicated to deploying and maintaining large-scale ML systems in production (not just research or POCs).
Expert-level experience with Cloud Providers (AWS/GCP/Azure) and orchestration tools (Kubernetes, Kubeflow, or Airflow).
Expert-level Python and Java/Go (or similar).
Deep proficiency in backend frameworks, microservices, and system design patterns.
Expert knowledge of monitoring stacks (Prometheus, Grafana, Datadog) and establishing enterprise SLAs/SLOs for AI services.
Proven track record of designing automated deployment pipelines, managing complex rollback procedures, and enforcing model registry governance at scale.
Must be legally authorized to work in a country of employment without sponsorship for employment visa status (e.g., H1B status).

Nice To Haves

Master’s or PhD in Computer Science, Software Engineering, or related technical field is preferred.
Deep understanding of cybersecurity best practices and ATO processes within regulated industries (Healthcare, Finance, or Defense).
Proven ability to design systems that handle massive concurrency and distributed data processing.

Responsibilities

Architect and govern the comprehensive release process, defining enterprise checklists, automated approval gates, release notes, and deployment readiness standards.
Establish the deployment execution standards for promoting AI across all environments and ensure customer deployments adhere to strict internal production discipline.
Architect and oversee the enterprise model registry, ensuring seamless integration with CI/CD pipelines and full version control traceability.
Define and enforce monitoring standards, establishing critical SLAs/SLOs, service health metrics, and comprehensive dashboards across the AI ecosystem.
Architect automated checks for input/output data quality and model drift, ensuring proactive detection of system degradation.
Establish and lead the production incident process, including rigorous triage workflows, severity escalation paths, postmortems, rollback mechanisms, and recovery infrastructure.
Partner with Platform teams to provide essential ATO (Authority to Operate) and compliance support, ensuring complete deployment traceability and strict operational controls.
Oversee comprehensive operational reporting, providing leadership with status updates across production systems, pre-prod testing, customer rollouts, and incident metrics.
Foster a culture of production discipline, guiding junior engineers in maintaining operational runbooks and reliable deployment pipelines.