About The Position

Graphcore is a globally recognised leader in Artificial Intelligence computing systems. The company designs advanced semiconductors and data centre hardware that provide the specialised processing power needed to drive AI innovation, while delivering the efficiency required to support its broader adoption. As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world’s most transformative technologies. We are opening a new AI Engineering Campus in Austin, which will play a central role in Graphcore's work building the future of AI computing. Job Overview: Responsibilities: As a Performance Engineer, you will lead benchmarking, performance analysis, and system optimization across AI and HPC workloads on Arm-based architectures. You will collaborate with hardware architects, software developers, and customer engineering teams to enhance system efficiency and scalability, ensuring Arm technology delivers industry-leading datacenter solutions.

Requirements

  • Demonstrated ability in HPC and AI performance engineering, with proven hands-on expertise in distributed systems.
  • Solid understanding of CPU/GPU/accelerator performance analysis, workload profiling, and scalability optimization.
  • Proven experience with ARM64, x86, and GPU architectures in large-scale datacenter environments.
  • Proficiency in performance tools such as VTune, Nsight, Rocprof, Pytorch profiler, MPI/OpenMP profilers, Cray/Allinea tools.
  • Strong programming skills in Python, C/C++, Fortran, CUDA, and parallel frameworks (MPI, OpenMP, SYCL).
  • Experience with large AI frameworks (PyTorch, TensorRT, Megatron-LM, vLLM, SGLang, TorchTitan).
  • Familiarity with distributed training at scale (multi-node, multi-GPU clusters).
  • Excellent communication skills and experience working with cross-functional engineering teams.

Nice To Haves

  • Experience with datacenter-scale benchmarking and system acceptance testing.
  • Knowledge of interconnect fabrics (Infiniband, Slingshot, Omni-Path, RoCE, EFA) and distributed storage systems (Lustre, GPFS, Weka).
  • Hands-on background with cloud HPC/AI deployments (AWS, Azure, GCP).
  • Familiarity with containerization and orchestration (Docker, Kubernetes, SLURM, PBS).
  • Background in exascale or pre-exascale performance co-design projects.
  • Strong publication record in HPC/AI performance analysis.
  • Experience leading small teams or cross-company performance projects.

Responsibilities

  • Design, implement, and analyze performance experiments for AI training, inference, and HPC applications across distributed clusters.
  • Develop tools and workflows to monitor, measure, and validate system and workload scalability.
  • Partner with system architects and software teams to identify bottlenecks and propose optimizations across the hardware/software stack.
  • Lead performance bring-up and validation of new hardware platforms, interconnects, and accelerators.
  • Collaborate with customers and Tier-1 partners to provide guidance on performance tuning and cluster-level deployment strategies.
  • Drive innovation in performance methodology, including predictive modeling, profiling frameworks, and benchmark development.
  • Present findings to engineering leadership, customers, and partners to influence architectural and design decisions.

Benefits

  • Be part of a groundbreaking team influencing the next generation of data center systems.
  • Collaborate with premier engineers and vendors to develop industry-leading AI hardware.
  • Drive innovation in performance methodology with global impact.
  • Access professional growth through sophisticated project involvement and multidisciplinary teamwork.
  • Join a company committed to diversity and inclusion, where your work matters and drives global progress.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service