Member of Technical Staff, Cluster Administration

Inferact•San Francisco, CA

8d•$200,000 - $400,000•Hybrid

About The Position

Inferact is seeking a hands-on cluster administration engineer to manage and operate its high-performance GPU compute infrastructure. This role is crucial for ensuring the productivity of Inferact's engineering teams by maintaining the health, availability, observability, and usability of expensive, high-performance GPU and HPC clusters across various cloud and dedicated compute providers. The engineer will be responsible for cluster health, GPU availability, monitoring, alerting, scheduling, access, diagnostics, and incident response. This position involves close collaboration with engineering leadership and infrastructure owners to standardize compute provisioning, operation, debugging, and scaling across providers, directly impacting the speed of development and improvement of vLLM systems.

Requirements

Bachelor's degree or equivalent experience in computer science, engineering, systems administration, or similar.
Hands-on experience administering large compute clusters, HPC environments, university or research clusters, supercomputing systems, or production GPU clusters.
Strong Linux systems administration fundamentals across networking, processes, storage, package management, shell scripting, logs, access control, and system debugging.
Experience operating GPU servers, including driver management, GPU health monitoring, node failures, memory errors, scheduler issues, and hardware diagnostics.
Experience with cluster scheduling and resource allocation using SLURM, Kubernetes, or equivalent tooling.
Ability to own urgent infrastructure incidents end-to-end when compute issues are blocking engineering teams.
Ability to automate operational workflows using Bash, Python, Ansible, Terraform, Helm, or similar tooling.

Nice To Haves

Experience operating GPU compute across providers such as Lambda, CoreWeave, Crusoe, Nebius, Together, Fireworks, RunPod, or similar environments.
Experience improving cluster utilization, reducing idle or unavailable GPU capacity, and debugging scheduling or resource contention issues.
Familiarity with high-performance GPU networking such as InfiniBand, RoCE, NVLink / NVSwitch, RDMA, NCCL, or equivalent systems.
Experience with storage for HPC or ML workloads, including NFS, Lustre, Ceph, distributed filesystems, or other high-throughput storage systems.
Experience managing secure access, identity, permissions, SSH, VPNs, bastion hosts, secrets, and basic infrastructure security hygiene.
Background in research computing, scientific computing, ML infrastructure, SRE, platform engineering, or infrastructure operations for engineering-heavy teams.
Managed GPU or HPC infrastructure in a university lab, national lab, research institution, AI infrastructure company, hedge fund, HFT firm, or large-scale ML platform team.
Built monitoring, alerting, runbooks, health checks, or remediation workflows that materially reduced operational toil or incident resolution time.
Operated Kubernetes clusters for ML or GPU workloads at meaningful scale.
Standardized provisioning, diagnostics, monitoring, and operating patterns across multiple compute providers.
Carried real operational responsibility for infrastructure used by many engineers or researchers.

Responsibilities

Own and operate high-performance GPU compute infrastructure.
Ensure cluster health, GPU availability, monitoring, alerting, scheduling, access, diagnostics, and incident response.
Work with engineering leadership and infrastructure owners to standardize compute provisioning, operation, debugging, and scaling across providers.
Directly impact the speed at which Inferact can build, test, and improve vLLM systems.
Take ownership of urgent infrastructure incidents end-to-end when compute issues are blocking engineering teams.
Automate operational workflows using Bash, Python, Ansible, Terraform, Helm, or similar tooling.
Improve cluster utilization, reduce idle or unavailable GPU capacity, and debug scheduling or resource contention issues.
Manage secure access, identity, permissions, SSH, VPNs, bastion hosts, secrets, and basic infrastructure security hygiene.
Build monitoring, alerting, runbooks, health checks, or remediation workflows that materially reduced operational toil or incident resolution time.
Standardize provisioning, diagnostics, monitoring, and operating patterns across multiple compute providers.
Carry real operational responsibility for infrastructure used by many engineers or researchers.