Zoox is looking for a software engineer to work on our custom High-Performance Computing infrastructure and its supporting ecosystem of tools and services. This infrastructure is central to machine learning workflows across all Zoox software divisions, from data engineering to computer vision perception to simulation and more. You will take on a breadth of end-to-end responsibilities including distributed system design, algorithmic job scheduling, and adaptive cloud scaling in support of all of Zoox’s computational needs. In this role, you will: Design, build, and optimize a petabyte-scale, in-house HPC storage infrastructure, ensuring high performance and reliability for our machine learning workloads across both cloud and on-premise data centers. Drive GPU efficiency by strategically collocating storage and compute, architecting a storage layer that keeps tens of thousands of GPUs fully utilized and prevents bottlenecks. Drive key initiatives in training and storage optimization by partnering with ML practitioners, applying your deep understanding of frameworks such as PyTorch and TensorFlow to meet their evolving demands. Investigate and adopt new distributed system paradigms and cutting-edge technologies to ensure our infrastructure can scale to meet ever-growing computational and storage demands. Create production-grade web service APIs, SDKs, and other essential tools to deliver a world-class developer experience for all software teams at Zoox.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Number of Employees
1,001-5,000 employees