Anthropic-posted 3 months ago
$315,000 - $560,000/Yr
Full-time • Mid Level
San Francisco, CA
Publishing Industries

Pioneering the next generation of AI requires breakthrough innovations in GPU performance and systems engineering. As a GPU Performance Engineer, you'll architect and implement the foundational systems that power Claude and push the frontiers of what's possible with large language models. You'll be responsible for maximizing GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency. Working at the intersection of hardware and software, you'll implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack-from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization. Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.

  • Architect and implement foundational systems for AI models.
  • Maximize GPU utilization and performance at scale.
  • Develop optimizations for new model capabilities and inference efficiency.
  • Implement techniques from custom kernel development to distributed system architectures.
  • Work on low-level tensor core optimizations and GPU orchestration.
  • Deep experience with GPU programming and optimization at scale.
  • Impact-driven and passionate about delivering measurable performance breakthroughs.
  • Ability to navigate complex systems from hardware interfaces to high-level ML frameworks.
  • Enjoy collaborative problem-solving and pair programming.
  • Desire to work on state-of-the-art language models with real-world impact.
  • Care about the societal impacts of work.
  • Thrive in ambiguous environments.
  • Experience with GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention.
  • Familiarity with ML Compilers & Frameworks: PyTorch/JAX internals, torch.compile, XLA.
  • Knowledge of Performance Engineering: Kernel fusion, memory bandwidth optimization.
  • Experience with Distributed Systems: NCCL, NVLink, collective communication.
  • Understanding of Low-Precision techniques: INT8/FP8 quantization.
  • Experience with Production Systems: Large-scale training infrastructure.
  • Competitive compensation and benefits.
  • Optional equity donation matching.
  • Generous vacation and parental leave.
  • Flexible working hours.
  • Lovely office space for collaboration.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service