Machine Learning Operations Engineer

Augment Professional Services•Houston, TX

1d•$65 - $82•Hybrid

About The Position

The MLOps Engineer is responsible for designing, deploying, and maintaining scalable machine learning solutions in production across multi-cloud and data platform environments. This role plays a critical part in operationalizing machine learning models by building robust pipelines, enabling automation, and ensuring reliability, performance, and governance across AWS, Microsoft Azure, and Snowflake ecosystems. Working closely with data scientists, data engineers, and cloud platform teams, the MLOps Engineer bridges the gap between model development and production deployment. This position focuses on creating secure, scalable, and cost-efficient ML platforms that support end-to-end lifecycle management, including model training, deployment, monitoring, and continuous improvement. The ideal candidate brings strong experience in cloud-native architectures, CI/CD automation, and production-grade ML systems, with hands-on expertise in AWS, Azure, and Snowflake environments.

Requirements

Minimum of 5+ years of experience in MLOps, machine learning engineering, platform engineering, or DevOps
Hands-on experience with AWS, Microsoft Azure, and Snowflake in building or supporting production ML/data platforms
Strong programming skills in Python and SQL
Experience deploying and managing machine learning models in production environments
Experience with cloud ML services such as AWS SageMaker and Azure Machine Learning
Experience building and integrating data pipelines with Snowflake
Proficiency with CI/CD pipelines, infrastructure automation, and model versioning
Experience with containerization and orchestration tools such as Docker and Kubernetes
Experience with workflow orchestration tools such as Apache Airflow, Azure Data Factory, or similar
Familiarity with monitoring, logging, alerting, and observability frameworks
Strong understanding of data engineering concepts, APIs, and distributed systems
Proven troubleshooting, communication, and cross-functional collaboration skills

Nice To Haves

Master’s or PhD in Computer Science, Computer Engineering, or a related technical field
Experience with Snowflake Cortex AI, Snowpark, or machine learning workloads within Snowflake
Experience with generative AI platforms such as AWS Bedrock or Azure OpenAI
Experience building real-time inference systems, event-driven architectures, or serverless pipelines
Familiarity with feature stores, vector databases, and retrieval-augmented generation (RAG) systems
Experience with infrastructure-as-code tools such as Terraform, AWS CloudFormation, or Azure Resource Manager
Understanding of security, compliance, and governance frameworks in regulated environments
Experience implementing A/B testing, shadow deployments, and advanced model release strategies

Responsibilities

Design and implement end-to-end machine learning pipelines including data ingestion, feature engineering, model training, validation, deployment, and monitoring
Deploy and manage machine learning models in production across AWS, Azure, and Snowflake platforms
Build and maintain batch and real-time inference pipelines using cloud-native and platform-native services
Develop and automate CI/CD pipelines for model packaging, testing, deployment, and rollback
Integrate ML workflows with services such as AWS SageMaker, AWS Lambda, Azure Machine Learning, Azure Data Factory, and Snowflake
Build and manage orchestration workflows using tools such as Apache Airflow, Azure Data Factory, or similar platforms
Implement model lifecycle management practices including experiment tracking, model registry, and governance frameworks
Monitor model performance, including accuracy, drift, latency, throughput, and pipeline reliability
Establish and manage deployment strategies such as canary releases, blue-green deployments, shadow testing, and rollback mechanisms
Collaborate cross-functionally to transition machine learning models from research to production environments
Ensure security, compliance, traceability, and access controls across data and ML systems
Optimize performance, scalability, and cost efficiency across cloud and data platforms
Document architecture designs, deployment standards, and operational procedures