HPC Storage Engineer | Experienced Hire

Susquehanna International Group, LLPBala Cynwyd (Philadelphia Area), PA
5hOnsite

About The Position

We are looking for an experienced HPC Storage Engineer to design, implement, and optimize the storage and data movement infrastructure that underpins our high-performance computing (HPC) environment. This role focuses on distributed and parallel filesystems, storage systems, and large-scale data movement, ensuring reliable, high-throughput access to data for compute-intensive workloads. You will work closely with HPC platform engineers, compute and networking teams, and application users to deliver scalable, performant, and resilient storage solutions that tightly integrate the storage layer with compute nodes. In this role, you will: Design, deploy, and operate HPC storage systems and parallel/distributed filesystems (e.g., Lustre, GPFS/IBM Spectrum Scale, BeeGFS, Ceph). Own data movement workflows across environments, including data ingest, replication, tiering, and archiving. Optimize filesystem and storage performance for large-scale parallel workloads. Design and tune load-balancing strategies across storage targets, metadata services, and data movement pipelines to ensure even utilization, high throughput, and predictable performance at scale. Troubleshoot storage, I/O, and data movement issues across HPC compute clusters. Develop and maintain automation for storage provisioning, monitoring, and lifecycle management. Partner with compute and networking teams to ensure end-to-end performance and reliability. Advise users and application teams on best practices for I/O patterns, data layout, and performance tuning. Evaluate and integrate new storage technologies and architectures as requirements evolve.

Requirements

  • Hands-on experience with parallel or distributed filesystems in production environments
  • Strong understanding of Linux systems administration
  • Experience with high-performance I/O, data locality, and throughput optimization
  • Proficiency in large-scale distributed systems development, preferably in C++
  • Proven ability to troubleshoot complex performance and reliability issues across storage and compute stacks
  • Experience with data transfer and movement tools

Nice To Haves

  • Familiarity with object storage and hierarchical storage management (HSM)
  • Experience integrating storage with HPC schedulers (e.g., Slurm) and compute workflows
  • Background supporting scientific, ML/AI, or other data-intensive workloads

Responsibilities

  • Design, deploy, and operate HPC storage systems and parallel/distributed filesystems (e.g., Lustre, GPFS/IBM Spectrum Scale, BeeGFS, Ceph)
  • Own data movement workflows across environments, including data ingest, replication, tiering, and archiving
  • Optimize filesystem and storage performance for large-scale parallel workloads
  • Design and tune load-balancing strategies across storage targets, metadata services, and data movement pipelines to ensure even utilization, high throughput, and predictable performance at scale
  • Troubleshoot storage, I/O, and data movement issues across HPC compute clusters
  • Develop and maintain automation for storage provisioning, monitoring, and lifecycle management
  • Partner with compute and networking teams to ensure end-to-end performance and reliability
  • Advise users and application teams on best practices for I/O patterns, data layout, and performance tuning
  • Evaluate and integrate new storage technologies and architectures as requirements evolve

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service