We are seeking a highly skilled Senior Site Reliability Engineer to join our Technical Operations team and lead reliability, scalability, and performance initiatives for AI/ML workloads across multi-cloud and on-prem environments. This role will focus on building and maintaining resilient infrastructure for advanced data science workflows, including NVIDIA DGX systems , leveraging platforms such as Domino Data Lab , Slurm , and NVIDIA Base Command , while driving automation, observability, and networking optimization