Platform Engineer (Full-time)

Strong Compute•San Francisco, CA

306d

About The Position

We’re building the operating system for AI compute—seamless workstation style access as a single entry point into global compute, with ultra fast data transit connecting everything. If you love high-performance computing, distributed systems, and AI infrastructure, and have experience managing large-scale GPU clusters and storage systems, you’ll fit right in.

Requirements

Strong systems engineering skills with experience in distributed computing and storage for AI workloads
Proficiency in GPU cluster management, including NVIDIA GPUs, Slurm, and Kubernetes
Deep understanding of distributed training frameworks and multi-cloud architectures (AWS, GCP, Azure, and emerging GPU clouds)
Experience managing large-scale clusters, including team leadership, hiring, and scaling operations
Expertise in high-performance storage (Ceph, S3, ZFS, Lustre, and others) for massive AI datasets
Ability to optimize cluster utilization, uptime, and scheduling for cost-effective operations
Understanding of colocation strategies, managing AI data centers, and running HPC workloads in mixed environments
DevOps/MLOps experience, automating training pipelines for large-scale AI models
Experience working with AI/ML researchers, optimizing infrastructure for deep learning training

Responsibilities

Scalable, distributed AI infrastructure across cloud, on-prem, and colocation environments
GPU orchestration and fault-tolerant scheduling (Slurm, Kubernetes, Ray, and other orchestration frameworks)
Supercomputing clusters and high-performance storage solutions for AI workloads
Ultra-fast data pipelines for petabyte-scale AI training workloads
Multi-cloud orchestration and on-premise AI data centers, making compute feel like a single, unified system
DevOps & MLOps automation for streamlined model training and deployment
Security and reliability for distributed computing across the public internet
Scaling compute clusters 10-20x, from 128 to 1024+ GPUs, ensuring high uptime, reliability, and utilization
Optimizing HPC clusters for AI training, including upgrade pathways and cost-efficiency strategies

Benefits

Top spec Macbook + separate GPU cluster dev environments for each engineer
Weekly cash bonus when you workout out 3+ times a week
Comprehensive health benefits, including a choice of Kaiser, Aetna OAMC, and HDHP (HSA-eligible) plans for our SF-based team members
Highest in the world 20 year exercise window for options

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Career Level

Senior

Platform Engineer (Full-time)

About The Position

Requirements

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company