Senior Platform Engineer (MLOps)

Quantiphi-posted 3 months ago

Full-time • Mid Level

1,001-5,000 employees

Resume

Match Score

Upload and Match ResumeTrack Jobs with Teal

We are seeking experienced Platform Engineers with expertise in MLOps and handling distributed systems, particularly Kubernetes, along with a strong background in managing Multi-GPU, Multi-Node Deep Learning job/inference scheduling. Proficiency in Linux (Ubuntu) systems, the ability to create intricate shell scripts, good proficiency in working with configuration management tools and sufficient understanding of deep learning workflow.

Design, implement, and scale the underlying platform that supports GenAI workloads, be it for real-time or batch.
Build and manage operational pipelines for training, fine-tuning, and deploying LLMs such as Llama, Mistral etc, GPT-3/4, BERT, or similar.
Optimize GPU utilization and resource management for AI workloads, ensuring efficient scaling, low latency, and high throughput in model training and inference.
Design, deploy, and automate scalable, secure, and cost-effective infrastructure for training and running AI models.
Implement robust monitoring systems to track the performance, health, and efficiency of deployed AI models and workflows.
Work closely with data scientists, machine learning engineers, and product teams to understand and support their platform requirements.
Ensure that platform infrastructure is secure, compliant with organizational policies, and follows best practices for managing sensitive data and AI model deployment.

3+ years of experience in platform engineering, DevOps, or systems engineering, with a strong focus on machine learning and AI workloads.
Proven experience working with LLM workflows, and GPU-based machine learning infrastructure.
Hands-on experience in managing distributed computing systems, training large-scale models, and deploying AI systems in cloud environments.
Strong knowledge of GPU architectures (e.g., NVIDIA A100, V100, etc.), multi-GPU systems, and optimization techniques for AI workloads.
Proficiency in Linux systems and command-line tools.
Strong scripting skills (Python, Bash, or similar).
Expertise in containerization and orchestration technologies (e.g., Docker, Kubernetes, Helm).
Experience with cloud platforms (AWS, GCP, Azure), tools such as Terraform, /Terragrunt, or similar infrastructure-as-code solutions.
Familiarity with machine learning frameworks (TensorFlow, PyTorch, etc.) and deep learning model deployment pipelines.

Experience in building or managing machine learning platforms, specifically for generative AI models or large-scale NLP tasks.
Familiarity with distributed computing frameworks (e.g., Dask, MPI, Pytorch DDP) and data pipeline orchestration tools (e.g., AWS Glue, Apache Airflow, etc).
Knowledge of AI model deployment frameworks such as TensorFlow Serving, TorchServe, vLLM, Triton Inference Server.
Good understanding of LLM inference & how to optimize self-managed infrastructure.
Understanding of AI model explainability, fairness, and ethical AI considerations.
Experience in automating and scaling the deployment of AI models on a global infrastructure.

Be part of a team and company that has won NVIDIA's AI Services Partner of the Year three times in a row.
Strong peer learning which will accelerate your learning curve across Applied AI, GPU Computing and other softer aspects such as technical communication.
Exposure to working with highly experienced AI leaders at Fortune 500 companies and innovative market disruptors.
Access to state-of-the-art GPU infrastructure on the cloud and on-premise.
Be part of the fastest-growing AI-first digital transformation and engineering company in the world.

Track Jobs with Teal

Job Search Resources

•

AI Resume Builder

•

Senior DevOps Engineer Resume Examples

•

Senior DevOps Engineer Cover Letter Examples

Senior Platform Engineer (MLOps)

Job Search Resources

Tools

Career Hubs

Guides

Company