Director Engineering, Training Platform

Core WeaveSunnyvale, CA
77d$206,000 - $303,000Hybrid

About The Position

CoreWeave is looking for a Director of Engineering to own and scale our next-generation Large Scale Training Platform. In this highly technical, strategic role, you will lead a world-class engineering organization to design, build, and operate the fastest, most cost-efficient, and most reliable GPU training services in the industry. Your charter spans everything from distributed training frameworks (e.g., Megatron-LM, DeepSpeed, PyTorch FSDP, Horovod) and large-scale data pipelines to checkpointing, fault tolerance, and developer-friendly APIs - all delivered on CoreWeave's unique accelerated-compute infrastructure.

Requirements

  • 10+ years building large-scale distributed systems or HPC/cloud services, with 5+ years leading engineering teams.
  • Proven success delivering mission-critical distributed training platforms or large-scale ML pipelines.
  • Deep understanding of GPU/CPU resource allocation, NUMA-aware scheduling, interconnect topologies (NVLink, InfiniBand), and large-scale data handling.
  • Experience with data, model, tensor, and pipeline parallelism, and advanced optimizer techniques for training massive models.
  • Expertise in Kubernetes, service meshes, and CI/CD pipelines for ML workloads; familiarity with Slurm, Ray, or similar orchestration systems is a plus.
  • Hands-on experience with PyTorch, DeepSpeed, Megatron-LM, or other large-scale training frameworks.
  • Background in scaling pretraining of LLMs or multimodal models to thousands of GPUs.
  • Excellent communicator capable of translating complex engineering trade-offs into clear business outcomes.
  • Bachelor's or Master's degree in CS, EE, or related field (or equivalent practical experience).

Nice To Haves

  • Experience operating multi-region training clusters for hyperscalers or large AI labs.
  • Familiarity with open-source ML training frameworks (e.g., DeepSpeed, FSDP, Alpa, MosaicML).
  • Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry) and training-specific telemetry.

Responsibilities

  • Define and continuously refine the end-to-end Training Platform roadmap, prioritizing scalability, throughput, and cost optimization for training the largest AI models.
  • Set technical standards for distributed training frameworks, model/data parallelism strategies, mixed-precision techniques (FP8, BF16), and advanced checkpointing/restart mechanisms.
  • Design and implement a Kubernetes-native training control plane capable of managing multi-thousand GPU jobs with high reliability and efficiency.
  • Build solutions for elastic distributed training, including job-aware autoscaling, dynamic GPU allocation, and multi-node communication optimizations using NCCL, SHARP, and RDMA.
  • Integrate data pipeline optimizations, such as caching layers, streaming datasets, and sharded data loading to eliminate I/O bottlenecks.
  • Implement state-of-the-art distributed training optimizations including tensor/sequence parallelism, pipeline parallelism, optimizer sharding, activation checkpointing, and gradient compression.
  • Establish SLOs/SLA dashboards, real-time observability, and self-healing mechanisms for thousands of concurrent training jobs across multiple regions.
  • Develop cost-performance trade-off tooling that enables customers to seamlessly select hardware configurations that minimize time-to-train while optimizing for cost.
  • Build robust fault tolerance and automatic recovery workflows to handle large-scale preemption, checkpoint failures, or data pipeline interruptions.
  • Hire, mentor, and grow a diverse team of engineers and managers passionate about building the world's leading AI training platform.
  • Foster a customer-obsessed, metrics-driven engineering culture with crisp design reviews, deep technical rigor, and blameless post-mortems.
  • Partner closely with Product, Orchestration, Networking, and Storage teams to deliver a unified CoreWeave experience.
  • Work directly with flagship customers training frontier models to gather feedback, optimize workflows, and shape the platform roadmap.

Benefits

  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Professional, Scientific, and Technical Services

Education Level

Bachelor's degree

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service