Senior Storage Engineer

Hydra Host•Miami, FL

1d•$150,000 - $200,000

About The Position

Hydra Host is a Founders Fund-backed NVIDIA cloud partner building the infrastructure platform that powers AI at scale. We connect AI Factories - high-performance GPU data centers - with the teams that depend on them: research labs training foundation models, enterprises running production inference, and developer platforms demanding scalable compute capacity. Hydra Host is building the next-generation bare-metal GPU infrastructure network and marketplace under its Brokkr platform. The company enables independent data centers to monetize GPU capacity while providing enterprises with scalable, high-performance access to NVIDIA-based compute (e.g., H100, H200, B200, L40S, RTX 4090). As we expand our infrastructure capabilities, Hydra Host is now seeking a Storage Engineer to lead the architecture, development, and deployment of our next-generation AI/HPC storage platform. As a Storage Engineer, you will be responsible for designing and building Hydra Host’s first production-grade storage platform from the ground up, supporting the company’s rapidly expanding network of bare-metal GPU clusters. You’ll own the architecture, technology selection, implementation, and evolution of this platform, defining how Hydra Host manages data for large-scale, distributed AI workloads across global data centers. This is a senior, hands-on role for an engineer who has built storage systems for GPU clusters before, with deep expertise in both block and object storage and a strong understanding of parallel file systems, performance optimization, and large-scale orchestration.

Requirements

8+ years of progressive, hands-on experience designing and implementing high-performance storage systems for compute clusters in HPC, AI, or bare-metal cloud environments.
Proven track record building storage infrastructure from scratch, not just operating existing systems.
Deep expertise in block storage (NVMe, SAN, Ceph, distributed block systems) and object storage (S3, MinIO, Ceph Object Gateway, etc.).
Strong background in parallel file systems (WekaIO, BeeGFS, Lustre, Spectrum Scale, or similar) supporting GPU or AI cluster workloads.
Solid foundation in Linux systems engineering, automation, and scripting for distributed environments.
Familiarity with BMC, Redfish APIs, and OEM server firmware for bare-metal management.
Deep understanding of AI/ML data pipelines: model checkpointing, data locality, and multi-tiered storage optimization.
Excellent problem-solving, debugging, and communication skills, able to translate technical decisions into clear architectural direction.

Nice To Haves

Experience building storage solutions for large-scale GPU or HPC infrastructure.
History of technical leadership or mentorship, growing teams or owning a product roadmap.
Experience evaluating and managing vendor relationships and negotiating storage hardware/software contracts.
Contributions to open-source HPC or storage projects (Ceph, Lustre, BeeGFS, etc.).
Familiarity with confidential computing, secure data handling, or high-availability architectures.

Responsibilities

Define, architect, and implement Hydra Host’s first production storage platform tailored for bare-metal GPU clusters and AI/HPC workloads.
Lead all technical decisions around storage stack design, from hardware infrastructure to parallel file system orchestration and performance tuning.
Select, build, and maintain storage solutions spanning both block (NVMe, SAN, Ceph, etc.) and object storage (S3-compatible, custom, or Ceph Object Gateway) layers.
Design for high-throughput, low-latency access, supporting large datasets, rapid checkpointing, and parallel access for distributed AI training workloads.
Integrate and optimize parallel file systems such as Lustre, BeeGFS, Spectrum Scale, WekaIO, or CephFS, ensuring maximum performance and fault tolerance.
Ensure compatibility across Hydra’s diverse GPU/OEM ecosystem, accounting for unique firmware, BMC/Redfish APIs, and hardware configurations.
Develop automation, observability, and management tooling for storage, focusing on reliability, scalability, and efficiency.
Act as a builder and architect: deeply hands-on in deployment, troubleshooting, and optimization, while guiding long-term storage roadmap.
Collaborate cross-functionally with GPU, HPC, and platform engineering teams to integrate storage with compute and network layers.
Interface with customers and product leadership to define feature priorities, performance benchmarks, and future enhancements.