Performance Engineering

Advanced Micro Devices, IncGothenburg, NE
3dHybrid

About The Position

As a Performance Engineer, you will spearhead the next generation of AI infrastructure by defining GPU architecture specifications that enable massive model training at scale. Your expertise will drive 2-3x performance gains in both training and inference pipelines through innovative system design and optimization. You will champion the adoption of cutting-edge techniques across the engineering organization, from efficient attention mechanisms to advanced parallelization strategies. By establishing comprehensive best practices for distributed ML systems, you will create a framework that enables seamless scaling from single-GPU to thousand-GPU deployments.

Requirements

  • Extensive and Senior experience optimizing large-scale ML systems and GPU architectures
  • Deep expertise in CUDA programming, GPU memory hierarchies, and hardware-specific optimizations
  • Proven track record architecting distributed training systems handling large scale systems
  • Expert knowledge of transformer architectures, attention mechanisms, and model parallelism techniques
  • PyTorch, CUDA, TensorRT, OpenAI Triton
  • Distributed systems: Ray, Megatron-LM
  • Performance analysis tools: NSight Compute, nvprof, PyTorch Profiler
  • KV cache optimization, Flash Attention, Mixture of Experts
  • High-speed networking: InfiniBand, RDMA, NVLink
  • Bachelors, MS/PhD in Computer Science/Engineering or equivalent industry experience

Responsibilities

  • Lead performance modeling and optimization for multi-trillion parameter LLM training/inference including Dense, Mixture of Experts (MoE) with multiple modalities (text, vision, speech)
  • Model/optimize novel parallelization strategies across tensor, pipeline, context, expert and data parallel dimensions
  • Architect memory-efficient training systems utilizing techniques like structured pruning, quantization (MX formats), continuous batching/chunked prefill, speculative decoding
  • Incorporate and extend SOTA models such as GPT-4, Reasoning models (Deepseek-R1), and multi-modal architectures
  • Collaborate with internal and external stakeholders/ML researchers to disseminate results and iterate at rapid pace.

Benefits

  • AMD benefits at a glance.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service