Quantiphi-posted 3 months ago
Full-time • Mid Level
1,001-5,000 employees

We are seeking experienced Platform Engineers with expertise in MLOps and handling distributed systems, particularly Kubernetes, along with a strong background in managing Multi-GPU, Multi-Node Deep Learning job/inference scheduling. Proficiency in Linux (Ubuntu) systems, the ability to create intricate shell scripts, good proficiency in working with configuration management tools and sufficient understanding of deep learning workflow.

  • Design, implement, and scale the underlying platform that supports GenAI workloads, be it for real-time or batch.
  • Build and manage operational pipelines for training, fine-tuning, and deploying LLMs such as Llama, Mistral etc, GPT-3/4, BERT, or similar.
  • Optimize GPU utilization and resource management for AI workloads, ensuring efficient scaling, low latency, and high throughput in model training and inference.
  • Design, deploy, and automate scalable, secure, and cost-effective infrastructure for training and running AI models.
  • Implement robust monitoring systems to track the performance, health, and efficiency of deployed AI models and workflows.
  • Work closely with data scientists, machine learning engineers, and product teams to understand and support their platform requirements.
  • Ensure that platform infrastructure is secure, compliant with organizational policies, and follows best practices for managing sensitive data and AI model deployment.
  • 3+ years of experience in platform engineering, DevOps, or systems engineering, with a strong focus on machine learning and AI workloads.
  • Proven experience working with LLM workflows, and GPU-based machine learning infrastructure.
  • Hands-on experience in managing distributed computing systems, training large-scale models, and deploying AI systems in cloud environments.
  • Strong knowledge of GPU architectures (e.g., NVIDIA A100, V100, etc.), multi-GPU systems, and optimization techniques for AI workloads.
  • Proficiency in Linux systems and command-line tools.
  • Strong scripting skills (Python, Bash, or similar).
  • Expertise in containerization and orchestration technologies (e.g., Docker, Kubernetes, Helm).
  • Experience with cloud platforms (AWS, GCP, Azure), tools such as Terraform, /Terragrunt, or similar infrastructure-as-code solutions.
  • Familiarity with machine learning frameworks (TensorFlow, PyTorch, etc.) and deep learning model deployment pipelines.
  • Experience in building or managing machine learning platforms, specifically for generative AI models or large-scale NLP tasks.
  • Familiarity with distributed computing frameworks (e.g., Dask, MPI, Pytorch DDP) and data pipeline orchestration tools (e.g., AWS Glue, Apache Airflow, etc).
  • Knowledge of AI model deployment frameworks such as TensorFlow Serving, TorchServe, vLLM, Triton Inference Server.
  • Good understanding of LLM inference & how to optimize self-managed infrastructure.
  • Understanding of AI model explainability, fairness, and ethical AI considerations.
  • Experience in automating and scaling the deployment of AI models on a global infrastructure.
  • Be part of a team and company that has won NVIDIA's AI Services Partner of the Year three times in a row.
  • Strong peer learning which will accelerate your learning curve across Applied AI, GPU Computing and other softer aspects such as technical communication.
  • Exposure to working with highly experienced AI leaders at Fortune 500 companies and innovative market disruptors.
  • Access to state-of-the-art GPU infrastructure on the cloud and on-premise.
  • Be part of the fastest-growing AI-first digital transformation and engineering company in the world.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service