ML Ops Engineer

Circadia Health•El Segundo, CA

About The Position

As an ML Ops Engineer at Circadia Health, you will own the infrastructure and operational lifecycle of the machine learning systems that power our clinical monitoring platform. You will build and maintain the production ML pipelines, deployment infrastructure, and monitoring systems that enable Circadia's predictive models to identify early signs of clinical deterioration. Reporting to the Principal ML Engineer, you will work across ML, backend, data, and clinical teams to ensure models are reliably trained, versioned, deployed, and monitored in both cloud and edge environments. You will be a key driver in elevating Circadia's ML practice – from reproducibility and experiment tracking to CI/CD for models and operational observability. This is a high-ownership role at a lean company where production reliability, rapid iteration, and pragmatic engineering are essential. Your work will directly impact patient outcomes by ensuring our predictive models are always running, always accurate, and always improving.

Requirements

4+ years of experience in MLOps, ML Engineering, DevOps, or a closely related infrastructure role.
Strong proficiency in Python for ML pipeline development, tooling, and automation.
Hands-on experience with ML pipeline orchestration tools, particularly Apache Airflow.
Experience with model registries and experiment tracking platforms (MLflow preferred).
Experience deploying and operating ML workloads on AWS (Batch, EC2, S3, IAM, CloudWatch).
Solid understanding of the ML lifecycle: training, evaluation, deployment, monitoring, and retraining.
Experience with containerisation (Docker) and infrastructure-as-code.
Proficiency with Git and version control workflows.
Familiarity with SQL and data warehousing platforms (Snowflake preferred).
Experience implementing monitoring, logging, and alerting for production systems.
Strong debugging and incident response skills for complex distributed systems.

Nice To Haves

Experience deploying models to edge or embedded devices.
Background in healthcare, medical devices, or clinical data systems.
Familiarity with model serving frameworks (e.g., TorchServe, TF Serving, Triton, or custom solutions).
Experience with CI/CD systems for ML (e.g., GitHub Actions, Jenkins, or similar).
Experience with data versioning tools (e.g., DVC, LakeFS, or similar).
Experience supporting data science or ML research teams in a production context.
Exposure to HIPAA compliance and healthcare security best practices.
Experience with distributed compute frameworks (e.g. Apache Spark, Dask) for large-scale data processing.
Experience with streaming or real-time inference architectures.

Responsibilities

Own and extend Circadia’s ML pipeline orchestration using Apache Airflow, including training, evaluation, and deployment workflows.
Build and maintain automated pipelines for model retraining, validation, and promotion across development, staging, and production environments.
Implement pipeline monitoring, alerting, and failure recovery to eliminate silent failures and ensure operational reliability.
Design pipeline architectures that support rapid experimentation while enforcing production-grade reproducibility.
Deploy and manage ML models on AWS infrastructure (e.g. AWS Batch for batch inference workloads).
Support deployment of models to edge devices, including Circadia’s clinical monitoring hardware, working with firmware and embedded engineering teams as needed.
Manage model versioning, promotion, and rollback workflows through the MLflow model registry.
Evaluate and implement strategies for safe model rollouts (e.g. shadow deployments, canary releases) as the platform matures.
Maintain and improve the MLflow-based experiment tracking and model registry infrastructure.
Establish conventions for experiment logging, artifact storage, model metadata, and lineage tracking.
Enable ML engineers to move seamlessly from experimentation to production deployment with minimal friction.
Implement and maintain training data versioning and dataset management practices to ensure reproducibility of model training runs.
Track dataset lineage, labeling provenance, and feature dependencies alongside model versions.
Collaborate with ML engineers and data engineers to formalise dataset release and validation workflows.
Build monitoring systems for model performance in production, including data drift detection, prediction quality tracking, and alerting on degradation.
Implement operational dashboards for pipeline health, compute utilisation, and deployment status.
Collaborate with data engineering to ensure upstream data quality and pipeline reliability for ML feature inputs.
Develop incident response procedures and runbooks for ML system failures.
Manage and optimise AWS compute resources (Batch, EC2, or similar) used for model training and inference.
Design infrastructure-as-code solutions for reproducible ML environments.
Drive cost optimisation across ML compute, storage, and data transfer.
Support Snowflake integrations for feature generation and training data pipelines.
Introduce and champion ML engineering best practices including CI/CD for models, automated testing for ML pipelines, and reproducible training workflows.
Build internal tooling and templates that accelerate the ML development-to-production cycle.
Document operational processes, architecture decisions, and onboarding materials for the ML platform.
Participate in architecture discussions and technical planning to ensure ML systems scale with Circadia’s growth.
Ensure all ML pipelines and infrastructure meet healthcare security and privacy requirements, including HIPAA and SOC 2.
Apply best practices for handling Protected Health Information (PHI) in training data, model artifacts, and inference outputs.
Maintain audit trails for model decisions, data access, and deployment history.