Principal SoC Performance Architect-Microbenchmarks

Advanced Micro Devices, Inc•Austin, TX

About The Position

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. AMD is looking for an outstanding technical contributor to drive performance analysis, characterization, and optimization of next-generation Data Center GPU (DCGPU) platforms. This role focuses on extracting maximum performance across the full system stack—including hardware, firmware, drivers, runtime, libraries, and workloads—through deep architectural understanding and data-driven methodologies. The engineer will develop and maintain microbenchmarks and system-level workloads spanning pre-silicon and post-silicon environments to enable performance validation, debug, and optimization.

Requirements

Proven experience working on highly parallel compute systems or SoCs (GPUs preferred)
Experience developing and maintaining microbenchmarks tied to architectural features
Strong exposure to performance analysis across pre-silicon and post-silicon environments
Solid understanding of GPU compute, memory systems, and interconnect architectures
Experience with profiling, tracing, and performance counter analysis
Ability to debug complex system-level performance issues across multiple layers
MS/PhD in Computer Engineering, Computer Science, or related field
Excellent communication skills and ability to present complex performance insights clearly

Nice To Haves

10–15+ years of experience in performance engineering for GPUs, HPC systems, or highly parallel SoCs
Strong understanding of GPU architecture, parallel computing, and memory hierarchies
Experience with microbenchmark development and system-level workload analysis
Hands-on experience with performance profiling tools (rocprof, Nsight, perf, etc.)
Experience analyzing AI/HPC workloads (LLMs, training, inference, communication libraries like RCCL/NCCL)
Strong background in hardware/software co-design and performance optimization
Familiarity with pre-silicon (simulation/emulation/models) and post-silicon performance workflows
Programming expertise in C/C++, Python; experience with GPU programming models (HIP, CUDA, OpenCL)
Strong analytical and debugging skills with a data-driven mindset
Experience working across full software stack (compiler → runtime → kernels → system)
Exposure to performance modeling, scaling analysis, or competitive benchmarking is a plus
Bachelor’s or Master’s degree in related discipline preferred

Responsibilities

Analyze and optimize performance of DCGPU systems across AI training, inference, and HPC workloads
Identify bottlenecks across hardware, firmware, drivers, runtime, libraries, and applications
Perform deep kernel-level and system-level profiling to understand performance behavior
Provide actionable insights to architecture, software, and design teams to improve performance
Design and develop targeted microbenchmarks to characterize GPU subsystems (compute, memory, interconnect, collectives)
Build representative system-level workloads reflecting real-world AI/HPC use cases
Ensure microbenchmarks correlate to application-level performance and architectural intent
Maintain and evolve benchmark suites across multiple GPU generations
Enable performance validation in pre-silicon environments (simulation/emulation/models)
Correlate performance data across pre-silicon models and post-silicon measurements
Develop methodologies to reuse workloads and microbenchmarks across the full lifecycle
Support bring-up and early silicon performance characterization
Work across the entire software stack: compiler, runtime, libraries, drivers, and firmware
Collaborate with ROCm / AI frameworks / kernel teams to improve performance
Analyze interactions between workload characteristics and hardware execution
Optimize key kernels (e.g., GEMMs, collectives, attention) and system-level behavior
Develop and enhance performance measurement, profiling, and analysis tools
Enable scalable, repeatable workflows for benchmarking and analysis
Build automation for performance regression tracking and reporting
Contribute to unified infrastructure spanning pre-silicon and post-silicon environments
Partner with SoC architecture, GPU IP, software, and system teams
Influence design decisions using data-driven performance insights
Collaborate with competitive analysis teams to understand gaps vs. industry platforms
Develop strong intuition and/or models for performance scaling and limits
Translate performance data into architectural feedback for future GPU designs
Support competitive benchmarking and performance projections