We are seeking a highly skilled Senior Site Reliability Engineer to join our Technical Operations team and lead reliability, scalability, and performance initiatives for AI/ML workloads across multi-cloud and on-prem environments. This role will focus on building and maintaining resilient infrastructure for advanced data science workflows, including NVIDIA DGX systems , leveraging platforms such as Domino Data Lab , Slurm , and NVIDIA Base Command , while driving automation, observability, and networking optimization
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
1-10 employees