Large-scale machine learning (ML) training requires a huge infrastructure footprint (spanning thousands of machines and even datacenters), all of which is connected by a equally large and dense networking infrastructure. Join us to directly enable this next generation of Google's ML/Supercomputer infrastructure. Our mission is to find ways to increase availability, reduce risk to production traffic, and more efficiently operate ML hardware over its lifecycle. We are on the critical path of delivering new ML infrastructure, while also helping save infrastructure costs. We are building key software components in making Cloud and ML infrastructure reliable, highly available, and cheap for Google's internal and external customers. Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about .
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Education Level
No Education Listed
Number of Employees
5,001-10,000 employees