Senior Machine Learning Ops Engineer

National Debt Relief, LLC.•,

7d•$150,500 - $173,000•Remote

About The Position

National Debt Relief (NDR) is seeking a Senior ML Ops Engineer to help evolve and scale our enterprise machine learning platform. This role sits within the Data Engineering organization on our existing ML Ops team and partners closely with Data Science, Analytics Engineering, and Applied AI teams to productionize machine learning workloads across the company. Today, many of our models are deployed within Snowflake using containerized FastAPI services and Snowflake-native capabilities. As we continue to mature our ML platform strategy, this role will help design and lead the evolution toward a more flexible cloud-native architecture leveraging AWS and modern ML infrastructure patterns. You will help own the infrastructure, orchestration, deployment, observability, and reliability of production ML systems. This includes enabling scalable model training and inference workflows, improving developer experience for Data Science teams, and establishing engineering standards for testing, CI/CD, governance, and monitoring. The ideal candidate combines strong software engineering fundamentals with hands-on ML platform experience across cloud infrastructure, orchestration, containerization, and data systems.

Requirements

Bachelor’s degree in Computer Science, Data Engineering, or a related field (advanced degree preferred).
6+ years of experience in ML Ops, platform engineering, DevOps, or data platform engineering.
Strong experience deploying and operating production machine learning systems.
Hands-on experience with cloud infrastructure, preferably AWS.
Strong experience with Docker and containerized application deployment.
Demonstrated experience building backend services using frameworks such as FastAPI.
Strong SQL expertise and experience building production-grade dbt models and data pipelines.
Hands-on experience with Snowflake in enterprise production environments.
Experience implementing CI/CD workflows and modern software engineering best practices.
Experience with orchestration frameworks such as Dagster, Airflow, or Prefect.
Experience with pytest testing frameworks and patterns, including unit, integration, and end-to-end testing.
Experience with Bash and Unix-based environments.
Familiarity with Infrastructure-as-Code tooling such as Terraform.
Strong Python engineering skills, including API development and automation tooling.
Strong communication and collaboration skills across Data Science, Data Engineering, and Product teams.
Ability to operate independently and help define ML platform standards and architecture direction.
Exceptional written and verbal communication skills.
Punctuality expected, ready to report to work on a consistent basis.

Nice To Haves

Experience deploying ML systems on Kubernetes, ECS, EKS, or other container orchestration platforms.
Experience with ML observability and experiment tracking tools such as MLflow, Arize, Evidently, WhyLabs, or Monte Carlo.
Experience designing feature stores or reusable ML data products.
Experience supporting both batch and low-latency inference workloads.
Experience in financial services, fintech, or other regulated industries.
Experience supporting Generative AI or LLM deployment workflows.
Strong software engineering fundamentals, including design patterns and maintainable architecture practices.

Responsibilities

Design, deploy, and maintain scalable ML infrastructure supporting model training, batch inference, and real-time inference workloads.
Lead the evolution of model hosting architecture from Snowflake-native services toward cloud-native infrastructure in AWS.
Build and maintain containerized model serving solutions using Docker, FastAPI, and modern deployment patterns.
Design and manage orchestration workflows for training, retraining, scoring, and inference pipelines using tools such as Dagster, Airflow, Prefect, or similar.
Partner closely with Data Science and Analytics Engineering teams to productionize ML models and improve deployment velocity.
Build and maintain scalable training and inference datasets using SQL, dbt, and Snowflake.
Implement CI/CD, Infrastructure-as-Code, testing, and deployment automation best practices across ML systems and platform infrastructure.
Establish observability and monitoring frameworks for deployed ML systems, including model performance monitoring, drift detection, data quality validation, and automated alerting.
Optimize platform reliability, scalability, governance, and operational efficiency across ML workflows and supporting infrastructure.
Document architecture, deployment standards, and operational processes to support maintainability and reproducibility.
Computer competency and ability to work with a computer.
Prioritize multiple tasks and projects simultaneously.
Attain and maintain high performance expectations on a monthly basis.
Work in a fast-paced, high-volume setting.
Use and navigate multiple computer systems with exceptional multi-tasking skills.
Remain calm and professional during difficult discussions.
Take constructive feedback.