Platform Engineer (Full-time)

Strong ComputeSan Francisco, CA
306d

About The Position

We’re building the operating system for AI compute—seamless workstation style access as a single entry point into global compute, with ultra fast data transit connecting everything. If you love high-performance computing, distributed systems, and AI infrastructure, and have experience managing large-scale GPU clusters and storage systems, you’ll fit right in.

Requirements

  • Strong systems engineering skills with experience in distributed computing and storage for AI workloads
  • Proficiency in GPU cluster management, including NVIDIA GPUs, Slurm, and Kubernetes
  • Deep understanding of distributed training frameworks and multi-cloud architectures (AWS, GCP, Azure, and emerging GPU clouds)
  • Experience managing large-scale clusters, including team leadership, hiring, and scaling operations
  • Expertise in high-performance storage (Ceph, S3, ZFS, Lustre, and others) for massive AI datasets
  • Ability to optimize cluster utilization, uptime, and scheduling for cost-effective operations
  • Understanding of colocation strategies, managing AI data centers, and running HPC workloads in mixed environments
  • DevOps/MLOps experience, automating training pipelines for large-scale AI models
  • Experience working with AI/ML researchers, optimizing infrastructure for deep learning training

Responsibilities

  • Scalable, distributed AI infrastructure across cloud, on-prem, and colocation environments
  • GPU orchestration and fault-tolerant scheduling (Slurm, Kubernetes, Ray, and other orchestration frameworks)
  • Supercomputing clusters and high-performance storage solutions for AI workloads
  • Ultra-fast data pipelines for petabyte-scale AI training workloads
  • Multi-cloud orchestration and on-premise AI data centers, making compute feel like a single, unified system
  • DevOps & MLOps automation for streamlined model training and deployment
  • Security and reliability for distributed computing across the public internet
  • Scaling compute clusters 10-20x, from 128 to 1024+ GPUs, ensuring high uptime, reliability, and utilization
  • Optimizing HPC clusters for AI training, including upgrade pathways and cost-efficiency strategies

Benefits

  • Top spec Macbook + separate GPU cluster dev environments for each engineer
  • Weekly cash bonus when you workout out 3+ times a week
  • Comprehensive health benefits, including a choice of Kaiser, Aetna OAMC, and HDHP (HSA-eligible) plans for our SF-based team members
  • Highest in the world 20 year exercise window for options
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service