As a Machine Learning Systems Administrator - HPC Infrastructure, you will be responsible for maintaining and developing the core infrastructure behind our machine learning research and production efforts. You’ll work closely with various training and inference teams to ensure the smooth operation of our systems while laying the groundwork for scalable, secure, and efficient workflows. You’ll work across: Administration and automation of our Linux-based cluster environments Managing user onboarding/offboarding, security auditing, and access control Monitoring system resources and job scheduling Supporting and improving developer workflows (e.g., VSCode compatibility, Docker) Enabling and supporting AI/ML workloads, including large-scale training jobs Comfortable operating across a wide range of infrastructure concerns and excited to own and improve critical systems. You’ll have a significant impact on both developer productivity and training and inference performance.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed