HPC/ML Infrastructure Engineer

SpellbrushSan Francisco, CA
Onsite

About The Position

We’re looking for an experienced HPC infrastructure engineer to lead bringup, administration, and operations on what is probably the largest anime AI training cluster in the world. You’ll serve as the bridge between our researchers and the bare GPU machines, helping to make sure that SLURM jobs are running, parallel filesystems are serving, network is transmitting, and that the anime models are training.

Requirements

  • Familiarity with the modern HPC software landscape, including SLURM deploys through Slinky on K8s, provisioning through warewulf/MAAS/ansible, filesystems through WEKA/VAST/Ceph, VPN and access through tailscale, and monitoring via the Grafana/Prometheus stack.
  • Strong Linux sysadmin skills, including wrangling ldap, triaging dmesg, and setting sticky bits on directories.
  • Comfortable working with physical computers and datacenters.
  • Experience working on small, fast-paced teams.
  • Ability to directly help AI researchers train image models.

Nice To Haves

  • Love of anime and the anime aesthetic.
  • Experience with SLURM deploys through Slinky on K8s.
  • Experience with provisioning through warewulf/MAAS/ansible.
  • Experience with filesystems through WEKA/VAST/Ceph.
  • Experience with VPN and access through tailscale.
  • Experience with monitoring via the Grafana/Prometheus stack.
  • Experience with wrangling ldap.
  • Experience with triaging dmesg.
  • Experience setting sticky bits on directories for misbehaving users and tools.
  • Experience with bringing up and managing clusters.
  • Experience with racking, stacking, and provisioning HGX-based nodes.
  • Experience with VLAN design.
  • Experience with fiber routing.

Responsibilities

  • Lead bringup, administration, and operations on a large-scale AI training cluster.
  • Serve as the bridge between researchers and bare GPU machines.
  • Ensure SLURM jobs are running.
  • Ensure parallel filesystems are serving data.
  • Ensure network is transmitting data.
  • Support the training of anime models.

Benefits

  • Visa sponsorships are available.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service