Sr. HPC Systems Architect (Storage)

KLA•Ann Arbor, MI

1d•Onsite

About The Position

This role provides senior technical leadership for the architecture, deployment, and long‑term scalability of large‑scale HPC storage and compute platforms. It owns systems end‑to‑end—from early architectural definition through full production—partnering across engineering, manufacturing, and strategic vendors to deliver highly available, high‑performance infrastructure at scale. The scope emphasizes deep technical ownership, architectural decision‑making, and solving sophisticated infrastructure challenges in live production environments. This work directly develops critically important HPC platforms built for adaptability, scale, and operational excellence, driving real‑world impact across core products and technologies.

Requirements

Proven experience with HPC systems, storage, or large‑scale Linux infrastructure
Deep, hands‑on expertise in HPC storage and Linux‑based infrastructure
Strong, distro‑agnostic Linux experience (Rocky, RHEL, SuSE, Ubuntu)
Proven experience crafting and operating large‑scale parallel storage systems
Strong understanding of HPC hardware platforms (servers, GPUs, networking, storage, BIOS/BMC)
Advanced Linux systems knowledge (PXE/netboot, systemd, HA concepts)
Solid networking fundamentals (TCP/IP, DNS, DHCP, LDAP, HTTP)
Strong scripting skills in Shell and Python
Experience with configuration management and automation (Salt, Puppet, Chef, etc.)
Minimum of 8 years of related experience with a Bachelor's degree; or 6 years and a Master's degree; or a PhD with 3 years experience; or equivalent experience.

Nice To Haves

Strong DevOps and automation mentality (CI/CD pipelines, Git, infrastructure as code)
Experience with containers for HPC (Singularity, Docker)
Monitoring and observability experience (Prometheus, Grafana)
Familiarity with Apache/Nginx and supporting infrastructure services

Responsibilities

Lead the design, implementation, and ongoing support of high‑performance compute (HPC) clusters, taking accountability for system performance, reliability, and scalability
Serve as a technical authority for HPC storage, with deep hands‑on expertise in parallel file systems such as Lustre, GPFS, and BeeGFS
Apply sophisticated systems knowledge across CPU and GPU architectures, high‑bandwidth interconnects, and robust storage subsystems to deliver balanced, high‑performance solutions
Lead the creation of hardware BOMs for HPC clusters, working directly with vendors and coordinating hardware release activities
Design, configure, and optimize Linux operating systems for HPC environments.
Translate project specifications and performance requirements into subsystem and system‑level designs, driving execution while meeting technical and schedule commitments
Support the design, release, and transition of new systems to manufacturing and customers, providing high‑quality golden images, procedures, scripts, and documentation
Lead EOL part re‑qualification activities to ensure long‑term system viability and supportability