Performance Analysis Engineer Intern (Summer 2026)

Astera Labs Early CareerSan Jose, CA
1d

About The Position

We are seeking a Performance Analysis Engineer to drive system-level performance optimization across large-scale AI training and inference environments. In this role, you will analyze, profile, and optimize distributed workloads running on high-density accelerator clusters, working across the full stack, from ML frameworks and communication libraries to network fabrics and hardware architecture. You will play a critical role in ensuring that next-generation AI workloads achieve near-peak hardware efficiency , while directly influencing software architecture, infrastructure design, and future silicon and networking roadmaps.

Requirements

  • Education: Bachelor’s, Master’s, or PhD in Computer Engineering, Electrical Engineering or a related field.
  • Hands-on experience optimizing distributed ML workloads across multi-node accelerator clusters.
  • Strong understanding of data parallelism, model parallelism, and pipeline parallelism .
  • Deep knowledge of GPU or accelerator architectures , including compute units, memory hierarchies, and interconnects (PCIe, NVLink, or equivalents).
  • Experience working with NCCL, RCCL, MPI , or similar collective communication frameworks.
  • Strong understanding of high-performance networking (Ethernet, InfiniBand, RoCE) and their impact on distributed workloads.
  • PyTorch & ML Systems Proficiency
  • Advanced experience with PyTorch , including distributed training internals and execution tracing.
  • Ability to diagnose and optimize framework-level and runtime bottlenecks.
  • Comfortable debugging issues across software, firmware, and hardware boundaries .
  • Strong proficiency in Python and C/C++ .
  • Experience building performance analysis tools, automation, and benchmarking frameworks.
  • Ability to clearly communicate complex performance findings to cross-functional teams.
  • Comfortable working in fast-moving, ambiguous environments.

Responsibilities

  • Cluster-Scale Performance Profiling
  • Collective Library Optimization
  • Network Fabric Analysis
  • Advanced Load Balancing & Traffic Optimization
  • PyTorch Stack Optimization
  • GPU & Accelerator Utilization
  • Performance Modeling & Benchmarking
  • Hardware–Software Co-Design
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service