AI/HPC System Performance Engineer

MetaMenlo Park, CA
$154,000 - $217,000

About The Position

Meta is building some of the world's largest AI and high-performance computing infrastructure to power next-generation AI research and products. As an AI/HPC System Performance Engineer on the Network Infrastructure Engineering team, you will drive end-to-end performance characterization, bottleneck analysis, and optimization of large-scale AI training and inference clusters. In this role, you will work at the intersection of network fabric design, distributed computing, and AI workload behavior to ensure Meta's HPC systems deliver maximum throughput and efficiency for frontier model development.

Requirements

  • Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI
  • Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers
  • Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure
  • Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments

Nice To Haves

  • Experience in developing systems software in languages like C++
  • Experience with machine learning frameworks such as PyTorch and TensorFlow
  • Understanding of RDMA congestion control mechanisms on IB and RoCE Networks
  • Understanding of the latest artificial intelligence (AI) technologies
  • Understanding of AI training workloads and demands they exert on networks
  • Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
  • Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
  • Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies

Responsibilities

  • Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks
  • Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency
  • Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling
  • Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations
  • Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure
  • Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents
  • Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets
  • Lead technical design reviews for network and system architecture changes affecting AI workload performance, communicating trade-offs clearly to engineering and product stakeholders
  • Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices
  • Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack

Benefits

  • bonus
  • equity
  • benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service