Senior Staff Software Development Engineer- GPU/AI/ML

Advanced Micro Devices, Inc•Santa Clara, CA

2d•Hybrid

About The Position

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. This role sits at the center of that mission: making AMD the platform of choice for the most demanding AI workloads by improving how models train, align, and run on our GPUs. We're looking for a senior software engineer who combines deep systems performance work with modern AI—someone who can shape software from GPU kernels through distributed training and inference. You'll join a core team of specialists working on the latest AMD hardware and software. Your work will directly influence the ROCm ecosystem and how foundation models and agentic systems perform on AMD GPUs. The challenge: Help train and run AI systems that make AI itself more efficient on GPUs—tuning stacks, kernels, and workflows in ways that can materially shift what's possible on our hardware. This is a high-impact, hands-on role. You'll own hard technical problems, influence direction across teams, and mentor others as we scale AMD's AI software strategy.

Requirements

Expert-level modern C++ and design of large, performance-critical systems.
Strong grasp of GPU architecture, memory hierarchy, and kernel optimization (HIP/CUDA).
Hands-on delivery on large-scale C++/HIP/CUDA codebases, such as ROCm (rocBLAS, hipDNN, Composable Kernel, AITemplate), the CUDA ecosystem (cuBLAS, cuDNN, CUTLASS, Thrust, CUB, NCCL), and ML framework cores such as PyTorch, TensorFlow, or JAX (C++/HIP/CUDA paths).
Comfort diagnosing bottlenecks with profilers (for example, ROCm Profiler and Nsight) in multi-GPU, distributed settings.
Deep understanding of transformers, attention, and the full model lifecycle.
Hands-on work in alignment and post-training—for example, SFT, RLHF, and GRPO.
Awareness of current LLM trends, including MoE, quantization, speculative decoding, and agentic systems.
Experience optimizing post-training and inference pipelines at scale.
Substantial professional experience in software development within performance-critical environments.
Strong technical ownership and a track record of shipping complex systems.
Clear communication and influence across teams.
Bachelor's in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience.

Nice To Haves

Extensive HIP/CUDA experience optimizing deep learning and OSS LLM inference/training kernels and operators.
Master's preferred; PhD a plus.
Publications in AI/ML, GPU computing, or systems optimization are valued.
Deep familiarity with the AMD ROCm/HIP ecosystem.
Working knowledge of RTL design and Verilog/SystemVerilog for hardware–software co-design.

Responsibilities

Own the AI software stack: Establish best practices and drive performance from low-level GPU kernels to large-scale distributed systems. Use modern LLMs and agent-based tooling where it accelerates development and tuning of the ROCm ecosystem.
Accelerate foundation models and agents: Improve training, post-training, and inference for LLMs and autonomous AI workloads so AMD is the default platform for the most demanding use cases.
Co-design hardware and software: Partner on the full lifecycle—from GPU architecture input to software for new accelerators—and engage with the broader AI community to keep AMD at the forefront.