Performance Analysis Engineer (NCG 2026)

Astera Labs Early CareerSan Jose, CA
14d

About The Position

We are seeking a Performance Analysis Engineer to drive system-level performance optimization across large-scale AI training and inference environments. In this role, you will analyze, profile, and optimize distributed workloads running on high-density accelerator clusters, working across the full stack, from ML frameworks and communication libraries to network fabrics and hardware architecture. You will play a critical role in ensuring that next-generation AI workloads achieve near-peak hardware efficiency , while directly influencing software architecture, infrastructure design, and future silicon and networking roadmaps.

Requirements

  • Education: Bachelor’s, Master’s, or PhD in Computer Engineering, Electrical Engineering or a related field.
  • Hands-on experience optimizing distributed ML workloads across multi-node accelerator clusters.
  • Strong understanding of data parallelism, model parallelism, and pipeline parallelism
  • Deep knowledge of GPU or accelerator architectures , including compute units, memory hierarchies, and interconnects (PCIe, NVLink, or equivalents).
  • Experience working with NCCL, RCCL, MPI , or similar collective communication frameworks.
  • Strong understanding of high-performance networking (Ethernet, InfiniBand, RoCE) and their impact on distributed workloads.
  • PyTorch & ML Systems Proficiency Advanced experience with PyTorch , including distributed training internals and execution tracing. Ability to diagnose and optimize framework-level and runtime bottlenecks.
  • Comfortable debugging issues across software, firmware, and hardware boundaries
  • Strong proficiency in Python and C/C++
  • Experience building performance analysis tools, automation, and benchmarking frameworks.
  • Ability to clearly communicate complex performance findings to cross-functional teams.
  • Comfortable working in fast-moving, ambiguous environments.

Responsibilities

  • Cluster-Scale Performance Profiling Execute and profile state-of-the-art training and inference workloads (e.g., LLMs, diffusion models) across large-scale accelerator clusters. Identify and resolve bottlenecks across compute, memory bandwidth, and interconnect latency that impact end-to-end Job Completion Time (JCT).
  • Collective Library Optimization Tune and optimize distributed communication backends such as NCCL, RCCL, and MPI . Improve efficiency of collective operations including All-Reduce, All-to-All, Reduce-Scatter , and broadcast to minimize synchronization overhead.
  • Network Fabric Analysis Conduct deep-dive analysis of network performance , diagnosing issues such as packet loss, congestion, head-of-line blocking, and tail latency. Partner with infrastructure teams to improve network behavior under real-world AI workloads.
  • Advanced Load Balancing & Traffic Optimization Design and implement intelligent load-balancing strategies and traffic-shaping algorithms. Prevent network and compute “hot spots” in high-density AI clusters and improve workload fairness and throughput.
  • PyTorch Stack Optimization Leverage advanced PyTorch capabilities including DistributedDataParallel (DDP), Fully Sharded Data Parallel (FSDP), and torch.compile . Optimize execution graphs, runtime traces, and memory usage for maximum hardware efficiency.
  • GPU & Accelerator Utilization Apply best practices in kernel fusion, mixed-precision execution (FP16/FP8/INT8), and memory management . Reduce idle “bubble” time and drive sustained peak FLOPS utilization during training and inference.
  • Performance Modeling & Benchmarking Build automated benchmarking suites and performance regression tests. Develop quantitative models to predict how architectural changes (e.g., attention mechanisms, batch sizes, parallelism strategies) scale across different cluster topologies.
  • Hardware–Software Co-Design Collaborate closely with systems, infrastructure, and silicon teams to translate performance findings into actionable requirements. Influence the design of next-generation AI accelerators, NICs, and interconnects
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service