Staff HPC Infrastructure Engineer

Guardant Health•Palo Alto, CA

66d•Hybrid

About The Position

You enjoy an agile, very fast paced and highly technical environment. You are a self-driven accomplished technologist who strives to be ever improving your skills, value to the company and improve the computational infrastructure. You are dedicated to engineering excellence yet pragmatic and flexible. You have the ability to maintain the day-to-day support SLA while running various key projects that move the business forward.

Requirements

B.S. in Computer Science or related field
4+ years of TCP/IP networking experience
2+ years of RDMA networking experience
4+ years of Linux/Unix administration, knowledge of Unix network protocols, TCP/IP network fundamentals, core infrastructure technologies and virtualization
2+ years of large-scale data storage and compute clusters (HPC) infrastructure
2+ years working in and with on-premise and cloud-based (AWS, Google, IBM and Azure) data-centers
2+ years of building software release and ops processes and automation toolset
2+ years providing documentation of system administration

Nice To Haves

Cisco Certified Network Professional certification
Experience with Arista and compatible networking, up to and including 400 gb/s links
Experience with Mellanox infiniband fabric
Experience administering IBM’s General Parallel File System
Experience administering SLURM scheduler
Experience with using warewulf
Experience with cloud bursting technologies
Experience with wide area file systems
Experience with docker and container technologies
Experience with Kubernetes
Operating infrastructure compliant with HIPAA and SOX standards

Responsibilities

Act as a technical lead in day to day operations
Help manage the HPC interconnects
Help integrate the HPC systems with the bandwidth on-demand system
Help integrate the HPC system with the single namespace storage system
Help integrate cloud bursting as part of the HPC abstraction work
Work with the networking infrastructure team to manage and optimize the connectivity to and from the HPC systems and locales
Help manage multiple HPC clusters and cluster file systems.
Help research, develop and implement the next generation HPC solution
Troubleshoot the production system stack down to source code level e.g. shell scripts, python and others.
Maintain, monitor, and support the infrastructure environment and/or facilities.
Use and maintain enhanced production monitoring and additional capability.
Support improvements for increased system reliability and performance.
Support multiple systems or applications of medium to high complex (complexity defined by size, technology used, and system feeds and interfaces) with multiple concurrent users, ensuring control, integrity, and accessibility.
Support systems at remote locations, including internationally
Work with offsite consultants to maintain the infrastructure
Work with vendors to troubleshoot, upgrade and repair systems as needed
Participate in a 24/7 on-call rotation