CoreWeave is seeking a Staff Software Engineer to join their Applied Training team. This role focuses on solving the problem of AI labs spending valuable research time on cluster setup and operations instead of model training. The team is building a Kubernetes-native research cluster platform and sandbox client for agentic training and evaluation. The goal is to provide customers with research infrastructure comparable to that found in frontier labs. The engineer will be an early member of a small team, contributing to the roadmap, working closely with customers and internal teams, and potentially owning specific projects like the research cluster platform or sandbox infrastructure. For the research cluster platform, responsibilities include designing and building a complete research cluster experience, including CLI, job configuration schema, Kubernetes operators, and daemons. This involves addressing researcher pain points such as code distribution, checkpoint-triggered evaluation, cross-cluster scheduling, and programmatic job control. For sandbox infrastructure, the role involves owning the Python SDK and collaborating with the backend team to enable large-scale RL training runs with isolated containers for agent rollouts and benchmarks. The engineer will also write documentation for running popular OSS training frameworks on CoreWeave to assist customers and work directly with large AI labs to understand their internal supercomputing stacks and incorporate that knowledge into product development.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed