Senior Storage Systems Engineer

Crusoe•San Francisco, CA

48d

About The Position

At Crusoe, we are on a mission to align the future of computing with the future of the climate. As a Senior Storage Systems Engineer, you will be the primary operator of our high-performance data layer. This role focuses on the availability, scaling, and operational excellence of our all-flash storage ecosystems—specifically VAST Data or Pure Storage, ensuring they deliver the sub-millisecond latency required for world-class AI training and HPC workloads. You will lead the day-to-day administration of our global storage footprint, serving as the subject matter expert for our flash-based platforms. Your work ensures that our sustainable GPU clusters have the reliable, high-throughput data backbone needed to power the AI revolution.

Requirements

5–8+ years of experience in Storage Administration, with at least 3+ years of hands-on experience managing VAST Data or Pure Storage in a production environment.
Deep understanding of NFS over RDMA, SMB, and NVMe-oF, and how they are implemented within VAST and Pure architectures.
Strong command of the Linux CLI, specifically for mounting, tuning, and troubleshooting high-performance file systems.
Understanding of how storage interacts with InfiniBand and RoCE fabrics to ensure low-latency data delivery to GPU nodes.
Proficiency in Python, Bash, or similar for automating volume creation, quota management, and reporting via storage APIs.
A meticulous approach to capacity planning and documentation, ensuring the environment remains stable as we add petabytes of scale.

Nice To Haves

Experience with Pure1 or VAST VMS/Insight for predictive analytics and capacity forecasting.
Familiarity with Slurm or Kubernetes (CSI) integration with high-performance storage.
Prior experience in a "Large Scale" environment (multi-petabyte footprints).

Responsibilities

Own the end-to-end management of VAST Data (Universal Storage) and Pure Storage (FlashBlade/FlashArray) environments, including initial setup, volume provisioning, and export management.
Proactively monitor VAST and Pure clusters for IOPS, throughput, and latency bottlenecks, ensuring storage performance stays ahead of GPU demand.
Execute software upgrades (Purity//FB, VAST OS), expansion of D-Nodes/C-Nodes, and hardware refreshes with zero downtime for our AI customers.
Manage snapshots, replication policies, and data reduction (deduplication/compression) strategies to optimize TCO while ensuring 100% data durability.
Act as the lead technical point of contact for storage incidents, working directly with VAST and Pure support engineering to resolve complex fabric or metadata issues.
Use APIs (REST, Python) to automate provisioning and integrate storage health metrics into our centralized observability stack (Grafana/Prometheus).