Performance Analysis Engineer (NCG 2026)

Astera Labs Early Career•San Jose, CA

23d

About The Position

We are seeking a Performance Analysis Engineer to drive system-level performance optimization across large-scale AI training and inference environments. In this role, you will analyze, profile, and optimize distributed workloads running on high-density accelerator clusters, working across the full stack, from ML frameworks and communication libraries to network fabrics and hardware architecture. You will play a critical role in ensuring that next-generation AI workloads achieve near-peak hardware efficiency , while directly influencing software architecture, infrastructure design, and future silicon and networking roadmaps.

Requirements

Education: Bachelor’s, Master’s, or PhD in Computer Engineering, Electrical Engineering or a related field.
Hands-on experience optimizing distributed ML workloads across multi-node accelerator clusters.
Strong understanding of data parallelism, model parallelism, and pipeline parallelism
Deep knowledge of GPU or accelerator architectures , including compute units, memory hierarchies, and interconnects (PCIe, NVLink, or equivalents).
Experience working with NCCL, RCCL, MPI , or similar collective communication frameworks.
Strong understanding of high-performance networking (Ethernet, InfiniBand, RoCE) and their impact on distributed workloads.
PyTorch & ML Systems Proficiency Advanced experience with PyTorch , including distributed training internals and execution tracing. Ability to diagnose and optimize framework-level and runtime bottlenecks.
Comfortable debugging issues across software, firmware, and hardware boundaries
Strong proficiency in Python and C/C++
Experience building performance analysis tools, automation, and benchmarking frameworks.
Ability to clearly communicate complex performance findings to cross-functional teams.
Comfortable working in fast-moving, ambiguous environments.

Responsibilities

Cluster-Scale Performance Profiling Execute and profile state-of-the-art training and inference workloads (e.g., LLMs, diffusion models) across large-scale accelerator clusters. Identify and resolve bottlenecks across compute, memory bandwidth, and interconnect latency that impact end-to-end Job Completion Time (JCT).
Collective Library Optimization Tune and optimize distributed communication backends such as NCCL, RCCL, and MPI . Improve efficiency of collective operations including All-Reduce, All-to-All, Reduce-Scatter , and broadcast to minimize synchronization overhead.
Network Fabric Analysis Conduct deep-dive analysis of network performance , diagnosing issues such as packet loss, congestion, head-of-line blocking, and tail latency. Partner with infrastructure teams to improve network behavior under real-world AI workloads.
Advanced Load Balancing & Traffic Optimization Design and implement intelligent load-balancing strategies and traffic-shaping algorithms. Prevent network and compute “hot spots” in high-density AI clusters and improve workload fairness and throughput.
PyTorch Stack Optimization Leverage advanced PyTorch capabilities including DistributedDataParallel (DDP), Fully Sharded Data Parallel (FSDP), and torch.compile . Optimize execution graphs, runtime traces, and memory usage for maximum hardware efficiency.
GPU & Accelerator Utilization Apply best practices in kernel fusion, mixed-precision execution (FP16/FP8/INT8), and memory management . Reduce idle “bubble” time and drive sustained peak FLOPS utilization during training and inference.
Performance Modeling & Benchmarking Build automated benchmarking suites and performance regression tests. Develop quantitative models to predict how architectural changes (e.g., attention mechanisms, batch sizes, parallelism strategies) scale across different cluster topologies.
Hardware–Software Co-Design Collaborate closely with systems, infrastructure, and silicon teams to translate performance findings into actionable requirements. Influence the design of next-generation AI accelerators, NICs, and interconnects

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume