Senior Site Reliability Engineer, Storage

CrusoeSunnyvale, CA
70d$166,000 - $201,000

About The Position

At Crusoe Energy Systems, our Site Reliability Engineering (SRE) team plays a mission-critical role in maintaining the performance and reliability of our AI-optimized cloud infrastructure. The Storage-focused SRE role is responsible for ensuring the availability, performance, and scalability of Crusoe’s cloud storage products and services, which power compute-intensive, latency-sensitive workloads for AI and HPC use cases. This role directly supports our vertically integrated, sustainable cloud platform by building and optimizing distributed, fault-tolerant storage systems at scale.

Requirements

  • 5+ years of professional experience in SRE, systems, or storage engineering.
  • Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms.
  • Proficiency in a programming language such as Python, Go, Java, or C.
  • Experience with Infrastructure as Code and deployment tooling such as Terraform, Ansible, or Puppet.
  • Deep knowledge of Linux internals with a focus on I/O subsystems, memory management, and storage scheduling.
  • Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF.
  • Strong experience working with containerized workloads and orchestration platforms (e.g., Kubernetes, Docker).
  • Excellent incident response, troubleshooting, and documentation practices.
  • Experience with building and operating managed services at scale such as object, file and block storage (AWS, GCP, Azure)
  • Excellent communication skills
  • Must be able to pass a background check
  • Embody the Company values

Nice To Haves

  • Contributions to open-source storage projects or the Linux storage stack.
  • Experience with hybrid storage models across on-prem and cloud environments.
  • Familiarity with high-throughput network topologies for storage backplanes (e.g., RoCE, RDMA, InfiniBand).

Responsibilities

  • Build automation and self-healing tools to monitor and maintain Crusoe’s distributed cloud storage infrastructure, which includes block, file, and object storage systems.
  • Drive reliability initiatives focused on data replication, encryption, backup and restore strategies, and robust failover mechanisms.
  • Collaborate closely with storage engineers, you will help implement and maintain high-performance NVMe- and SSD-backed volumes that support large-scale AI compute clusters.
  • Support user-facing storage services with a focus on availability, performance tuning, and adherence to error budgets.
  • Investigate and resolve storage-related incidents using deep telemetry, logs, and performance profiling, while also partnering with hardware and kernel teams to diagnose low-level I/O issues and optimize I/O paths, cache policies, and file systems.
  • Contribute to the architecture of fault-tolerant, scalable storage backends tailored for AI-first cloud environments.

Benefits

  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300 per month
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service