Machine Learning Operations Engineer

K1X

4h•Remote

About The Position

We are K1X. Our platform powers a modern, all-digital K-1 experience by replacing legacy workflows with scalable software and AI-driven automation. As we expand our machine learning capabilities, we are investing in a robust ML platform that enables production-grade model development, deployment, and monitoring across our products. We’re seeking an experienced Machine Learning Operations (MLOps) Engineer to join our team and build the infrastructure that powers AI and machine learning at K1X. This is a hands-on role focused on designing scalable systems, pipelines, and tooling that enable our Machine Learning Engineers to efficiently train, deploy, and operate models in production. You’ll work at the intersection of software engineering, DevOps, and machine learning—owning the reliability, scalability, and performance of our ML platform.

Requirements

Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent experience.
5+ years of experience in software engineering, DevOps, or MLOps roles.
Strong proficiency in Python and experience building production-grade systems.
Hands-on experience with Docker, Kubernetes, and distributed systems.
Experience building and maintaining CI/CD pipelines.
Familiarity with ML lifecycle tools such as MLflow or similar.
Experience working with cloud-based data platforms such as Snowflake.
Strong understanding of system design, APIs, and microservices architectures.
Proven debugging and troubleshooting ability across distributed systems.

Nice To Haves

Experience managing inference infrastructure such as NVIDIA Triton Inference Server.
Experience building large-scale training infrastructure including GPU workloads and distributed training.
Familiarity with feature stores, data versioning, and experiment tracking systems.
Experience supporting NLP or document processing pipelines.
Exposure to observability tools such as Prometheus, Grafana, or similar.
Experience working in SaaS environments with high availability, productivity, and performance requirements.
A strong bias toward automation, scalability, and continuous improvement.
A collaborative mindset and ability to work cross-functionally with engineering and data teams.

Responsibilities

Design and build scalable ML infrastructure to support model training, evaluation, and deployment.
Develop and maintain containerized environments using Docker and Kubernetes.
Build and manage distributed training pipelines and orchestration workflows.
Implement and maintain ML lifecycle tooling such as MLflow for experiment tracking and reproducibility.
Own production inference systems, including NVIDIA Triton Inference Server.
Design and operate low-latency, high-availability model serving architectures.
Implement CI/CD pipelines for ML deployment, versioning, and rollback strategies.
Build and maintain data pipelines integrated with Snowflake and related data systems.
Implement monitoring, logging, and alerting for model performance, drift detection, and system health.
Partner with ML Engineers to improve developer experience and accelerate delivery.