About The Position

Our team’s mission is to make PyTorch models high-performing, deterministic and stable, via a robust foundational framework that supports the latest hardware, without sacrificing the flexibility and ease of use of PyTorch. We are seeking a PhD Research Intern to work on next-generation Mixture-of-Experts (MoE) systems for PyTorch, focused on substantially improving end-to-end training and inference throughput on modern accelerators (e.g., NVIDIA Hopper and beyond). This internship will explore novel combinations of communication-aware distributed training and kernel- and IO-aware execution optimizations (inspired bySonicMoE and related works) to unlock new performance regimes for large-scale sparse models. The project spans systems research, GPU kernel optimization, and framework optimization, with opportunities for open-source contributions and publication. Team scope: - Improve PyTorch out-of-the-box performance on GPU, CPU, accelerators - Vertical performance optimization for models for training and inference - Model optimization techniques like quantization for improved efficiency - Improve stability and extensibility of the PyTorch framework Our internships are twelve (12) to twenty-four (24) weeks long and we have various start dates throughout the year.

Requirements

  • Currently has, or is in the process of obtaining, a PhD degree in the field of Computer Science or a related STEM field
  • Deep knowledge of transformer architectures, including attention, feed-forward layers, and Mixture-of-Experts (MoE) models
  • Strong background in ML systems research, with domain knowledge in MoE efficiency, such as routing, expert parallelism, communication overheads, and kernel-level optimizations
  • Hands-on experience writing GPU kernels using CUDA and/or cuteDSL
  • Working knowledge of quantization techniques and their impact on performance and accuracy
  • Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment

Nice To Haves

  • Experience working on other ML compiler stack, especially on PT2 stack
  • Familiarity with distributed training and inference, such as data parallelism and collective communication
  • Ability to independently design experiments, analyze complex performance tradeoffs, and clearly communicate technical findings in writing and presentations
  • Intent to return to degree program after the completion of the internship/co-op
  • Proven track record of achieving significant results as demonstrated by grants, fellowships, patents, as well as first-authored publications at leading workshops or conferences such as NeurIPS, MLSys, ASPLOS, PLDI, CGO, PACT, ICML, or similar
  • Experience working and communicating cross functionally in a team environment

Responsibilities

  • Design and evaluate communication-aware, kernel-aware, and quantization-aware MoE execution strategies, combining ideas such as expert placement, routing, batching, scheduling, and precision selection.
  • Develop and optimize GPU kernels and runtime components for MoE workloads, including fused kernels, grouped GEMMs, memory-efficient forward and backward passes.
  • Explore quantization techniques (e.g., MXFP8, FP8) in the context of MoE, balancing accuracy, performance, and hardware efficiency.
  • Build performance models and benchmarks to analyze compute, memory, communication, and quantization overheads across different sparsity regimes.
  • Run experiments on single-node and multi-node GPU systems.
  • Collaborate with the open-source community to gather feedback and iterate on the project.
  • Contribute to PyTorch (Core, Compile, Distributed) within the scope of the project.
  • Improve PyTorch performance in general.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service