Inferact is seeking a hands-on cluster administration engineer to manage and operate its high-performance GPU compute infrastructure. This role is crucial for ensuring the productivity of Inferact's engineering teams by maintaining the health, availability, observability, and usability of expensive, high-performance GPU and HPC clusters across various cloud and dedicated compute providers. The engineer will be responsible for cluster health, GPU availability, monitoring, alerting, scheduling, access, diagnostics, and incident response. This position involves close collaboration with engineering leadership and infrastructure owners to standardize compute provisioning, operation, debugging, and scaling across providers, directly impacting the speed of development and improvement of vLLM systems.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
Associate degree