ML Operations Engineer - Associate Vice President

Citi•Irving, TX

About The Position

ML Pipeline Development & Automation: Design, build, and maintain robust and scalable end-to-end ML pipelines for data ingestion, preprocessing, model training, validation, and deployment. CI/CD for ML: Implement and manage Continuous Integration/Continuous Delivery (CI/CD) pipelines specifically tailored for machine learning workflows, ensuring automated testing, versioning, and deployment of ML artifacts. Experiment Tracking & Model Management: Utilize MLflow extensively for experiment tracking, reproducible runs, managing model versions, and maintaining a centralized model registry. Hyperparameter Optimization: Leverage Ray Tune for efficient and distributed hyperparameter optimization to enhance model performance and accelerate experimentation. Containerization & Orchestration: Package ML models and their dependencies using Docker and deploy/manage them effectively on Kubernetes clusters. Data Platform Integration: Integrate with and optimize existing data platforms, including Apache Iceberg, Apache Spark, and FLINK, to ensure efficient data processing and feature engineering for ML models. Data Storage & Streaming: Work with PostgreSQL, Oracle, and MongoDB for diverse data storage needs, and utilize Kafka for real-time data streaming to support various ML applications. Monitoring & Observability: Implement comprehensive monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana) for ML models in production, tracking model performance, data drift, and infrastructure health to ensure reliability and facilitate automated retraining or rollback. Scripting & Automation: Develop automation scripts and tools using Python and Bash/Go to streamline MLOps processes and integrate various systems. Collaboration: Act as a vital link between data scientists, ML engineers, and infrastructure teams, facilitating clear communication and ensuring that ML solutions are production-ready.

Requirements

3-5 years of hands-on experience in an MLOps, DevOps, or Machine Learning Engineering role, with a proven track record of deploying and managing ML models in production environments.
Expert-level proficiency in Python for ML development, scripting, and automation.
Demonstrated hands-on experience with Ray Tune for hyperparameter optimization and AirFlow or MLflow for experiment tracking and model management.
Strong experience with Docker and Kubernetes (including Helm).
Experience implementing CI/CD practices for software and/or ML pipelines.
Familiarity with or experience with Apache Spark, Apache Iceberg, FLINK, and Kafka.
Experience with PostgreSQL, Oracle, and MongoDB.
Experience with Apache Airflow.
Experience with HashiCorp (Terraform).
Proficiency in Linux/Unix environments.
Experience with cloud platforms (AWS, Azure, GCP) and managing cloud-native ML infrastructure.
Knowledge of deep learning frameworks such as TensorFlow or PyTorch.
Experience with generative AI technologies (e.g., LLMs, prompt engineering, RAG pipelines).
Understanding of distributed computing and big data processing techniques.

Responsibilities

Design, build, and maintain robust and scalable end-to-end ML pipelines for data ingestion, preprocessing, model training, validation, and deployment.
Implement and manage Continuous Integration/Continuous Delivery (CI/CD) pipelines specifically tailored for machine learning workflows, ensuring automated testing, versioning, and deployment of ML artifacts.
Utilize MLflow extensively for experiment tracking, reproducible runs, managing model versions, and maintaining a centralized model registry.
Leverage Ray Tune for efficient and distributed hyperparameter optimization to enhance model performance and accelerate experimentation.
Package ML models and their dependencies using Docker and deploy/manage them effectively on Kubernetes clusters.
Integrate with and optimize existing data platforms, including Apache Iceberg, Apache Spark, and FLINK, to ensure efficient data processing and feature engineering for ML models.
Work with PostgreSQL, Oracle, and MongoDB for diverse data storage needs, and utilize Kafka for real-time data streaming to support various ML applications.
Implement comprehensive monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana) for ML models in production, tracking model performance, data drift, and infrastructure health to ensure reliability and facilitate automated retraining or rollback.
Develop automation scripts and tools using Python and Bash/Go to streamline MLOps processes and integrate various systems.
Act as a vital link between data scientists, ML engineers, and infrastructure teams, facilitating clear communication and ensuring that ML solutions are production-ready.