HPC Systems Engineer

KLA•Milpitas, CA

1d•$159,500 - $271,200•Onsite

About The Position

In this role, you will lead the architecture, deployment, and operational support of a high-performance computing (HPC) cluster platform used across IC fabrication facilities and mask shops globally. You will partner with engineering stakeholders to gather requirements, design scalable solutions, and drive implementation from concept through production. This role requires a strong balance of systems architecture, hands-on engineering, and operational excellence in complex HPC environments.

Requirements

Deep expertise in Linux operating systems (SUSE, Red Hat, Rocky Linux, Ubuntu)
Strong experience architecting and maintaining robust storage systems
Solid understanding of HPC hardware ecosystems, including servers, GPUs, networking, storage, schedulers, BIOS, and BMC
Experience with virtualization technologies such as VMware, Proxmox, or XCP-ng
Strong understanding of TCP/IP fundamentals and network protocols (DNS, DHCP, HTTP, LDAP, SMTP)
Experience with file sharing technologies (NFS, CIFS)
Familiarity with net boot/PXE and high-availability Linux configurations
Proficiency in scripting and development using Shell and Python
Experience with configuration management tools (Ansible, Salt, Chef, Puppet)
Strong DevOps mindset, including CI/CD pipelines and Git-based repositories
Experience with HPC schedulers (SGE, SLURM)
Familiarity with web servers and traffic management (Apache, Nginx, reverse proxy, load balancing via HAProxy)
Monitoring and observability tools (Prometheus, Grafana, Nagios)
Database experience with MySQL
Doctorate (Academic) Degree and related work experience of 3+ years; Master's Level Degree and related work experience of 6+ years; Bachelor's Level Degree and related work experience of 8+ years

Responsibilities

Design and architect scalable, high-performance HPC cluster solutions for global manufacturing environments
Lead deployment, configuration, and lifecycle management of cluster infrastructure
Collaborate with developers and cross-functional teams to understand requirements and translate them into technical solutions
Drive solutions from design through production, including implementation, validation, and support
Ensure system reliability, performance, and availability across compute, storage, and networking layers
Support ongoing operations, troubleshooting, and continuous improvement of HPC systems
Contribute to automation, standardization, and DevOps best practices across the platform