Staff HPC Infrastructure Engineer

Guardant HealthPalo Alto, CA
22dHybrid

About The Position

You enjoy an agile, very fast paced and highly technical environment. You are a self-driven accomplished technologist who strives to be ever improving your skills, value to the company and improve the computational infrastructure. You are dedicated to engineering excellence yet pragmatic and flexible. You have the ability to maintain the day-to-day support SLA while running various key projects that move the business forward.

Requirements

  • B.S. in Computer Science or related field
  • 4+ years of TCP/IP networking experience
  • 2+ years of RDMA networking experience
  • 4+ years of Linux/Unix administration, knowledge of Unix network protocols, TCP/IP network fundamentals, core infrastructure technologies and virtualization
  • 2+ years of large-scale data storage and compute clusters (HPC) infrastructure
  • 2+ years working in and with on-premise and cloud-based (AWS, Google, IBM and Azure) data-centers
  • 2+ years of building software release and ops processes and automation toolset
  • 2+ years providing documentation of system administration

Nice To Haves

  • Cisco Certified Network Professional certification
  • Experience with Arista and compatible networking, up to and including 400 gb/s links
  • Experience with Mellanox infiniband fabric
  • Experience administering IBM’s General Parallel File System
  • Experience administering SLURM scheduler
  • Experience with using warewulf
  • Experience with cloud bursting technologies
  • Experience with wide area file systems
  • Experience with docker and container technologies
  • Experience with Kubernetes
  • Operating infrastructure compliant with HIPAA and SOX standards

Responsibilities

  • Act as a technical lead in day to day operations
  • Help manage the HPC interconnects
  • Help integrate the HPC systems with the bandwidth on-demand system
  • Help integrate the HPC system with the single namespace storage system
  • Help integrate cloud bursting as part of the HPC abstraction work
  • Work with the networking infrastructure team to manage and optimize the connectivity to and from the HPC systems and locales
  • Help manage multiple HPC clusters and cluster file systems.
  • Help research, develop and implement the next generation HPC solution
  • Troubleshoot the production system stack down to source code level e.g. shell scripts, python and others.
  • Maintain, monitor, and support the infrastructure environment and/or facilities.
  • Use and maintain enhanced production monitoring and additional capability.
  • Support improvements for increased system reliability and performance.
  • Support multiple systems or applications of medium to high complex (complexity defined by size, technology used, and system feeds and interfaces) with multiple concurrent users, ensuring control, integrity, and accessibility.
  • Support systems at remote locations, including internationally
  • Work with offsite consultants to maintain the infrastructure
  • Work with vendors to troubleshoot, upgrade and repair systems as needed
  • Participate in a 24/7 on-call rotation

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service