The Fleet Reliability Operations Team is the heart of CoreWeave’s capacity delivery and maintenance effort. This team is responsible for provisioning, updating and triaging server nodes, and executing the processes and tooling that configure and validate our server fleet. This team is the first in line to respond to hardware issues in production, and is empowered to drive automation and observability design and priority for our server fleet lifecycle. We are seeking an Operations Manager for the Fleet Reliability Operations team who can help us maintain and improve our high volume of delivery and scale as we 10x the size of our fleet. This individual will develop a strong pipeline of talent, manage onboarding and training, provide process and thought leadership across the team’s domain, and champion reliability and customer satisfaction. As the manager of this team, you would have the opportunity to: Build and lead a 24/7 team of process-oriented, reliability and observability-focused engineers. Lead the socialization and documentation of clear and consistent processes for provisioning, validating and troubleshooting nodes in our server fleet. Think critically about and advocate for process and automation improvements prioritizing event-driven automated remediation as the end goal. Provide a 24/7 engineering support function for high-criticality, time-sensitive node delivery and maintenance. Drive and improve our program of onboarding, documentation, enablement, and performance management to help your team members achieve new heights of personal growth and capability. Drive the culture and tone for how your team keeps score both in how they communicate with and support each other and how they enable the rest of CoreWeave.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Manager
Education Level
No Education Listed