HPC/ML Infrastructure Engineer

Spellbrush•San Francisco, CA

1d•Onsite

About The Position

We’re looking for an experienced HPC infrastructure engineer to lead bringup, administration, and operations on what is probably the largest anime AI training cluster in the world. You’ll serve as the bridge between our researchers and the bare GPU machines, helping to make sure that SLURM jobs are running, parallel filesystems are serving, network is transmitting, and that the anime models are training. You may be a good fit if: You love anime and the anime aesthetic. This probably one of the only jobs in the world where you will get to combine your love of anime and large-scale GPU systems. You’re familiar with the modern HPC software landscape. Once upon a time, our team could install SLURM on a few bare metal nodes and get away with it. Now the landscape has become unbelievable complex, with SLURM deploys through Slinky on K8s, provisioning through warewulf/MAAS/ansible, filesystems through WEKA/VAST/Ceph, VPN and access through tailscale, and monitoring via the Grafana/Prometheus stack. We’re looking for someone with relevant experience up and down the stack (and maybe a papercut or two to show for it!) As well as the traditional sysadmin landscape. Bringing up and managing cluster still requires good old linux sysadmin skills, including wrangling ldap, triaging dmesg, and setting sticky bits on directories for misbehaving users and tools. You're not afraid of physical computers. We’re building out edge datacenters and our CEO is still personally racking, stacking, and provisioning HGX-based nodes in our living room. Also his VLAN design sucks and he’s bad at fiber routing. Please send help. And you're comfortable working on small, fast-paced teams. We currently have a very tiny research team, and you’ll be directly helping some of the AI researchers in the world train the best anime image model in the world. We also believe in the unmatched speed of in-person teams, and prefer on-site collaboration in either our primary research office in Tokyo (downtown Akihabara), or San Francisco (dogpatch!). Bay area is strongly preferred as we have physical hardware in the Bay Area. Visa sponsorships are available.

Requirements

Familiarity with the modern HPC software landscape.
Experience with SLURM deployments (e.g., Slinky on K8s).
Experience with provisioning tools (e.g., warewulf/MAAS/ansible).
Experience with filesystems (e.g., WEKA/VAST/Ceph).
Experience with VPN and access tools (e.g., Tailscale).
Experience with monitoring stacks (e.g., Grafana/Prometheus).
Strong Linux sysadmin skills.
Experience with LDAP.
Experience with dmesg.
Experience with physical computer hardware management.
Comfort working on small, fast-paced teams.
Ability to work on-site.

Nice To Haves

Love of anime and the anime aesthetic.
Experience with edge datacenters.
Experience with HGX-based nodes.

Responsibilities

Lead bringup, administration, and operations on a large-scale AI training cluster.
Serve as the bridge between researchers and GPU hardware.
Ensure SLURM jobs are running.
Maintain parallel filesystems.
Ensure network transmission.
Support anime model training.
Install and manage SLURM deployments.
Handle provisioning through warewulf/MAAS/ansible.
Manage filesystems through WEKA/VAST/Ceph.
Configure VPN and access through Tailscale.
Implement monitoring via Grafana/Prometheus stack.
Perform Linux sysadmin tasks including wrangling LDAP, triaging dmesg, and managing directory permissions.
Work with physical computers, including racking, stacking, and provisioning nodes.
Collaborate with AI researchers.