Senior HPC & GPU Infrastructure Engineer

Sciforium•San Francisco, CA

32d

About The Position

We are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of our GPU compute cluster. You will be the primary PyTOrchcustodian of our high-density accelerator environment and the linchpin between hardware operations, distributed systems, and machine learning workflows. This role spans everything from hands-on Linux systems engineering and GPU driver bring-up to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you love squeezing every bit of performance out of hardware, enjoy debugging GPUs at scale, and want to build world-class AI infrastructure, this role is for you.

Requirements

5+ years of experience in HPC, GPU cluster operations, Linux systems engineering, or similar roles.
Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field.
Strong expertise with NVIDIA (H100/B200) or AMD (MI325x/MI355x) GPUs, including driver and kernel-level debugging.
Deep understanding of Linux internals, kernel modules, hardware bring-up, and systems performance tuning.
Experience with network security, including VPNs, iptables/firewalld, SSH, and identity management (LDAP/FreeIPA/AD).
Proficiency in Bash and Python for scripting, automation, and workflow tooling.
Familiarity with ML software stacks: CUDA toolkit, cuDNN, NCCL, ROCm, JAX/PyTorch runtime behavior.
Deep debugging experience with NVLink/NVSwitch fabrics and RDMA networking.

Nice To Haves

Experience with job schedulers such as Slurm, Kubernetes, or Run:AI.
Exposure to vLLM, model serving optimizations, or inference systems.
Hands-on experience with configuration management tools (Ansible, SaltStack, Terraform).
Previous experience supporting ML research teams in a startup or research-heavy environment.

Responsibilities

On-Call Response: Act as the primary responder for system outages, GPU failures, node crashes, and cluster-wide incidents. Minimize downtime by resolving issues rapidly.
Cluster Monitoring: Implement and maintain monitoring for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and overall system load.
Vendor Liaison: Coordinate with data center staff, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.
OS Management: Install, patch, and maintain Linux distributions (Ubuntu / CentOS / RHEL). Ensure consistent configuration, kernel tuning, and automation for large node fleets.
Security & Access Controls: Configure VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computer infrastructure.
Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity, and administer distributed file systems such as NFS, GPFS, or Lustre.
Deployment & Bring-Up: Lead deployment of new GPU nodes, including BIOS configuration, NUMA tuning, GPU topology validation, and cluster integration.
Driver & Kernel Management: Build and optimize kernel modules, maintain GPU drivers and runtime stacks for both NVIDIA (CUDA) and AMD (ROCm).
Software Stack Maintenance: Maintain and optimize ML frameworks and libraries PyTorch, JAX, CUDA toolkit, cuDNN, ROCm, NCCL, and supporting runtime systems.
Advanced Debugging: Troubleshoot complex interactions involving GPUs, compilers, ML frameworks, and distributed training runtimes (e.g., vLLM compilation failures, CUDA memory leaks, ROCm kernel crashes).