AI Systems Performance Engineer

Broadcom•San Jose, CA

55d

About The Position

We are seeking a highly talented and experienced Senior AI Fabric Performance Engineer to take on a critical role within our Performance Lab. In this high-impact position, you will drive the performance benchmarking of AI inference, training and storage workloads with focus on our network infrastructure. You will be responsible to generate reports that aid the customers in deployment and marketing team to position the product. While the AI workloads (inference and training) run on our servers, your primary focus will be optimizing the Ethernet fabric that connects them. You will be responsible for executing rigorous performance benchmarks, isolating complex system bottlenecks, and tuning parameters to achieve maximum throughput and minimum latency. If you possess a deep understanding of Ethernet fabric, machine learning system demands, and Linux environments, and you thrive on solving complex performance puzzles, we want you on our team.

Requirements

Bachelor's / Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field plus 12+ years / 10+ years related industry experience.
Deep familiarity and hands-on experience with Linux operating systems, including system-level performance tuning and troubleshooting.
Strong proficiency in programming and scripting languages, specifically Python and C++.
Familiarity with modern machine learning frameworks, particularly PyTorch, and a solid understanding of how AI models consume compute and network resources.
Proven experience in performance testing and validating Ethernet switch systems.
Extensive experience with performance metrics, profiling, and benchmarking tools. Strong problem-solving skills with a proven ability to diagnose root causes in complex, distributed systems.

Nice To Haves

Experience with RDMA (Remote Direct Memory Access) and RoCEv2 (RDMA over Converged Ethernet).
Prior experience building CI/CD pipelines for automated hardware or software performance regression testing.
Familiarity with containerization and orchestration tools (Docker, Kubernetes) used in AI deployments.

Responsibilities

Install, configure, and run industry-standard AI performance benchmarks, with a strong emphasis on MLPerf (Training and Inference) and NCCL tests.
Tune and optimize network parameters, focusing heavily on Ethernet fabric performance, to ensure seamless data flow for distributed AI workloads running on server clusters.
Identify, isolate, and troubleshoot complex system performance bottlenecks spanning across the Linux OS, server hardware, and Ethernet switches.
Design, develop, and implement robust performance testing frameworks and automation tools to streamline continuous benchmarking.
Document test methodologies, communicate performance findings, and provide actionable improvement recommendations to hardware, software, and networking stakeholders.