Storage Reliability Engineer

CoreWeaveSunnyvale, CA
1d$139,000 - $204,000Hybrid

About The Position

CoreWeave’s Storage Reliability team sits at the intersection of infrastructure engineering, operations, and customer enablement. The team is responsible for ensuring the stability, performance, and operational excellence of the storage systems powering some of the world’s largest AI workloads. We work directly with production systems at scale, partnering closely with engineering, solutions, and customer-facing teams to maintain reliability while continuously improving the tooling, automation, and observability that support our storage platform. About the role: As a Storage Reliability Engineer, you will operate and support mission-critical storage systems that power large-scale AI and data-intensive workloads. You will work hands-on with production infrastructure, triaging complex incidents, debugging issues across the application, system, and kernel layers, and contributing fixes and improvements to the storage stack. This role sits at the boundary between engineering and operations, turning real-world production learnings into long-term reliability improvements through tooling, automation, and operational best practices. You’ll also partner closely with internal teams and customers to diagnose and resolve complex deployment and performance issues.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 5+ years of experience working with storage systems, distributed infrastructure, or low-level systems in production environments
  • Strong debugging and troubleshooting skills across user space and kernel space, including experience analyzing core dumps
  • Hands-on experience working with Kubernetes and Kubernetes CSI drivers
  • Experience working with storage protocols and APIs such as NFS and/or S3
  • Proficiency in systems programming and debugging in Go or a comparable language
  • Strong understanding of Linux internals, system performance, and system behavior under load
  • Experience operating production systems within an on-call rotation and responding to high-impact incidents
  • Demonstrated experience building tooling, automation, or diagnostics to improve reliability and operational efficiency
  • Experience supporting complex infrastructure deployments in collaboration with customer-facing or solutions engineering teams

Nice To Haves

  • Experience working with distributed storage systems in large-scale production environments
  • Experience contributing fixes or improvements to storage infrastructure or storage-related services
  • Experience building observability tooling or reliability frameworks for infrastructure systems
  • Experience supporting AI, HPC, or other high-performance computing workloads

Responsibilities

  • operate and support mission-critical storage systems that power large-scale AI and data-intensive workloads
  • work hands-on with production infrastructure, triaging complex incidents, debugging issues across the application, system, and kernel layers, and contributing fixes and improvements to the storage stack
  • turning real-world production learnings into long-term reliability improvements through tooling, automation, and operational best practices
  • partner closely with internal teams and customers to diagnose and resolve complex deployment and performance issues

Benefits

  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service