Storage Reliability Engineer

CoreWeave•Sunnyvale, CA

1d•$139,000 - $204,000•Hybrid

About The Position

CoreWeave’s Storage Reliability team sits at the intersection of infrastructure engineering, operations, and customer enablement. The team is responsible for ensuring the stability, performance, and operational excellence of the storage systems powering some of the world’s largest AI workloads. We work directly with production systems at scale, partnering closely with engineering, solutions, and customer-facing teams to maintain reliability while continuously improving the tooling, automation, and observability that support our storage platform. About the role: As a Storage Reliability Engineer, you will operate and support mission-critical storage systems that power large-scale AI and data-intensive workloads. You will work hands-on with production infrastructure, triaging complex incidents, debugging issues across the application, system, and kernel layers, and contributing fixes and improvements to the storage stack. This role sits at the boundary between engineering and operations, turning real-world production learnings into long-term reliability improvements through tooling, automation, and operational best practices. You’ll also partner closely with internal teams and customers to diagnose and resolve complex deployment and performance issues.

Requirements

Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
5+ years of experience working with storage systems, distributed infrastructure, or low-level systems in production environments
Strong debugging and troubleshooting skills across user space and kernel space, including experience analyzing core dumps
Hands-on experience working with Kubernetes and Kubernetes CSI drivers
Experience working with storage protocols and APIs such as NFS and/or S3
Proficiency in systems programming and debugging in Go or a comparable language
Strong understanding of Linux internals, system performance, and system behavior under load
Experience operating production systems within an on-call rotation and responding to high-impact incidents
Demonstrated experience building tooling, automation, or diagnostics to improve reliability and operational efficiency
Experience supporting complex infrastructure deployments in collaboration with customer-facing or solutions engineering teams

Nice To Haves

Experience working with distributed storage systems in large-scale production environments
Experience contributing fixes or improvements to storage infrastructure or storage-related services
Experience building observability tooling or reliability frameworks for infrastructure systems
Experience supporting AI, HPC, or other high-performance computing workloads

Responsibilities

operate and support mission-critical storage systems that power large-scale AI and data-intensive workloads
work hands-on with production infrastructure, triaging complex incidents, debugging issues across the application, system, and kernel layers, and contributing fixes and improvements to the storage stack
turning real-world production learnings into long-term reliability improvements through tooling, automation, and operational best practices
partner closely with internal teams and customers to diagnose and resolve complex deployment and performance issues