There are still lots of open positions. Let's find the one that's right for you.
We are looking for a Principal Software Engineer with experience in building highly scalable and reliable software to join us. We are building a powerful operational automation platform for GPU clusters to improve their performance and utilization while reducing operational toil. In this role, you will be responsible for architecting the product to discover cluster resources such as hosts, GPUs, and switches, and automate debug and repair actions on these resources. You will design the platform to support GPU clusters across different Cloud Service Providers (CSPs) and platforms such as Kubernetes and Slurm. Additionally, you will develop a distributed workflow execution runtime for parallel and fault-tolerant actions on a large number of resources. Your responsibilities will also include operating critical software services with high availability and reliability for customers, influencing the product roadmap in collaboration with teams across various departments with the goal of reducing Site Reliability Engineering (SRE) toil and improving hardware utilization. You will optimize the performance of the system to increase scalability and improve user experience, leading and delivering high-impact projects with high quality, performance, and stability while ensuring the lowest resource consumption. Furthermore, you will elevate the productivity and creativity of the technical staff by optimizing engineering practices, guiding junior engineers, and providing quality design and code reviews. Programming in systems languages like Go and Rust will be a key part of your role.