HPC Engineer

Periodic LabsMenlo Park, CA
Hybrid

About The Position

As HPC Engineer at Periodic Labs, you will design, build, and operate the high-performance computing infrastructure that powers our AI and scientific research. Our models demand extreme compute at scale — large GPU and CPU clusters, high-speed interconnects, low-latency parallel storage, and workload schedulers that make every cycle count. You will work directly with researchers and infrastructure engineers to ensure our compute environment is fast, reliable, and optimized for scientific discovery at the frontier. This is a deeply hands-on role. You will architect and tune systems, automate provisioning, diagnose performance bottlenecks, and design for resilience at scale. You’ll partner with research and ML teams to understand their workloads and shape an HPC environment that removes friction and accelerates science.

Requirements

  • Experience designing and operating large-scale HPC or GPU clusters in research, cloud, or enterprise environments
  • Deep knowledge of high-speed interconnects such as InfiniBand (HDR/NDR) or RoCE, including fabric management, tuning, and troubleshooting
  • Hands-on experience with parallel and distributed storage systems (Lustre, GPFS, WEKA, BeeGFS, or similar) — configuration, performance tuning, and capacity management
  • Experience with workload managers and schedulers such as Slurm, PBS Pro, LSF, or Kubernetes-based HPC orchestration
  • Linux systems administration at scale, including kernel tuning, NUMA optimization, CPU and memory affinity, and GPU driver management
  • Infrastructure automation using Ansible, Terraform, or equivalent — you treat infrastructure as code
  • Experience with GPU computing environments including CUDA, NCCL, MPI, and multi-node distributed training or simulation setups
  • Performance profiling, benchmarking, and tuning of computational workloads across CPU, GPU, memory, network, and storage
  • Experience with monitoring and observability tooling (Prometheus, Grafana, or equivalent) in large, heterogeneous compute environments
  • Ability to collaborate with researchers or data scientists to understand workload requirements and translate them into infrastructure decisions

Nice To Haves

  • Experience operating GPU clusters for large-scale AI or ML training workloads such as multi-node transformer training
  • Familiarity with AI accelerators beyond GPUs, such as TPUs, Trainium, or custom ASIC environments
  • Experience in mixed on-prem and cloud HPC environments, including burst-to-cloud or hybrid scheduling patterns
  • Background in scientific computing domains such as computational chemistry, physics simulation, or bioinformatics
  • Experience with containerized HPC environments (Singularity/Apptainer, Docker, or container-aware schedulers)
  • Knowledge of network security, access control, and compliance requirements for regulated research data
  • Contributions to open-source HPC tooling or published work on HPC system design or performance

Responsibilities

  • Design, deploy, and operate large-scale GPU and CPU clusters for AI training, scientific simulation, and research workloads
  • Manage and optimize high-speed interconnect fabrics (InfiniBand, RoCE) and parallel filesystems (Lustre, GPFS, WEKA, or equivalent) for maximum throughput and minimum latency
  • Own workload scheduling and resource management using Slurm, Kubernetes, or similar systems — tuning for throughput, fairness, and researcher productivity
  • Implement and maintain automated cluster provisioning, configuration management, and lifecycle tooling using Ansible, Terraform, or custom orchestration
  • Monitor cluster health, performance, and utilization; build dashboards and alerting to proactively identify and resolve bottlenecks
  • Partner with research and ML engineering teams to profile workloads, diagnose performance issues, and tune hardware and software stacks for specific computational demands
  • Design and implement backup, disaster recovery, and fault-tolerance strategies for research data and compute infrastructure
  • Evaluate and integrate new hardware (GPUs, accelerators, networking) and software technologies as the field evolves
  • Establish standards and runbooks for HPC operations, capacity planning, and incident response
  • Collaborate with security and infrastructure teams to implement access controls, network segmentation, and compliance controls appropriate for a research environment

Benefits

  • Visa sponsorship
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service