Staff HPC Engineer

KLA•Milpitas, CA

6d•$162,700 - $284,700•Onsite

About The Position

The Staff HPC Engineer designs, builds, optimizes, and supports large scale compute environments used for scientific computing, AI/ML workloads, simulation, and data intensive research. This role blends systems engineering, performance tuning, cluster architecture, and hands on troubleshooting. The engineer partners with researchers, developers, and IT teams to deliver reliable, scalable, and high performance compute infrastructure.

Requirements

Extensive experience with Linux systems engineering in large‑scale compute environments.
Solid understanding of distributed systems and cloud infrastructure
Deep knowledge of HPC schedulers (Slurm preferred), MPI stacks, and parallel computing models.
Strong understanding of high‑speed interconnects (InfiniBand, RoCE) and distributed storage systems.
Proficiency in scripting languages (Python, Go, Bash) and automation frameworks.
Experience with GPUs (NVIDIA CUDA, MIG, NVLink) and accelerator‑based computing.
Familiarity with containerization (Singularity/Apptainer, Docker) in HPC contexts.
Strong troubleshooting skills across hardware, OS, and application layers.
Understanding of networking fundamentals (TCP/IP, DNS, load balancing)
Background in high-availability and distributed systems at scale
Excellent communication and cross‑functional collaboration.
Ability to translate research needs into technical solutions.
Strong ownership mindset and ability to lead complex initiatives.

Responsibilities

Design and implement HPC clusters, including compute, storage, networking, and job‑scheduling components.
Evaluate and integrate new technologies (GPUs, accelerators, interconnects, filesystems).
Develop automation for cluster provisioning, configuration, and lifecycle management.
Architect solutions for large‑scale parallel workloads, AI/ML pipelines, and data‑intensive applications.
Profile and tune applications for CPU, GPU, memory, and I/O performance.
Optimize MPI, OpenMP, CUDA, and other parallel programming frameworks.
Benchmark hardware and software stacks to guide procurement and architecture decisions.
Maintain and monitor HPC clusters, job schedulers (Slurm, PBS, LSF), and distributed filesystems (Lustre, GPFS, BeeGFS).
Troubleshoot complex system issues across compute, storage, and network layers.
Implement security best practices, patching, and compliance controls.
Ensure high availability and efficient resource utilization.
Build and maintain CI/CD pipelines for HPC‑related software and infrastructure.
Use tools such as Ansible, Terraform, Kubernetes, or custom scripts to automate workflows.
Develop monitoring and observability solutions (Prometheus, Grafana, ELK, etc.).
Work closely with researchers, data scientists, and engineering teams to support workload optimization.
Provide technical leadership, mentorship, and guidance to junior engineers.
Document architectures, procedures, and best practices.
Participate in capacity planning and long‑term HPC strategy.