HPC System Administrator Consultant

Louisiana State University

4d•Onsite

About The Position

This position is for a "hands-on" IT Consultant in the High Performance Computing group in the Information Technology Services Department at LSU. The HPC IT Consultant specializes in hardware architecture and advanced troubleshooting to support, optimize, and maintain research computing infrastructure. This role is responsible for enabling both existing and emerging high performance computing initiatives through direct technical support, training, and system hardware and software support. This role is designed for a technical expert who is equally comfortable troubleshooting physical hardware in the data center as they are writing complex automation scripts in a Red Hat Enterprise Linux (RHEL) environment. All Information Technology Services employees are expected to demonstrate a commitment to exemplary customer service in all facets of their work.

Requirements

Bachelor's Degree with 3 years of experience (Ph.D. in Computation Science, Engineering or other computationally intensive disciplines substitutes for 2 year exp).
Experience in IT systems administration in Linux/HPC environments.
Strong knowledge of Linux/Unix operating systems.
Expertise in scripting and programming in bash and other languages.
Experience with HPC cluster resource managers and other management software such as Kickstart, DNF, RACADM, SSH keys, Ansible, etc.
Experience working with, managing, and repairing hardware in large complex HPC systems.
Proven experience troubleshooting complex hardware, networking, and performance issues in Linux-based HPC environments.
Candidates who have relevant experience in key job responsibilities are encouraged to apply— a degree is not required as long as the candidate meets the required years of experience specified in the job description.

Nice To Haves

Master's degree in Computation Science, Engineering or related computationally intensive disciplines.
5 years of experience in Linux system administration in a large HPC deployment.
Experience in scripting and programming.
Experience with PBS torque and moab, InfiniBand, and Lustre filesystems.
Experience working with computational research projects utilizing large and complex HPC systems.
Scripting skills in bash or similar language.
Experience with filesystems and hardware such as Lustre, GPFS, NAS, DDN, Panasas.
Experience with large scale Linux deployments, RHEL, Fedora or CentOS preferred.
Experience with Grid Computing systems and software such as XSEDE, TeraGrid, Open Science Grid.
Knowledge of HPC clusters resource managers such as Torque, SLURM, Condor.
Experience with scientific application portals.
Well versed in computer fundamentals and protocols.
Experience with Virtualization technologies (KVM).
Experience with containerization such as Docker and Singularity.
Experience working with large complex HPC systems.

Responsibilities

Expertise and leadership in verifying the quality of operations of Linux supercomputers, infrastructure systems, and other research computing systems. This includes, but is not limited to, performing daily system checks, analyzing system logs, troubleshooting hardware and software problems, monitoring and analyzing storage/infrastructure/job performance, helping users recognize job performance problems, writing scripts to enhance monitoring, and responding to unplanned system events such as power outages. This may require travel to various HPC sites to maintain physical installation of systems located off-site.
Proactively perform hardware maintenance on the clusters, cluster infrastructure, and other systems as needed. This includes diagnosing and fixing problems which includes, but is not limited to, running diagnostics, re-seating dimms, replacing hard disks, calling vendors for RMA support, replacing mother boards, and return shipping replacement parts.
Plan and perform software maintenance on both the clusters and the cluster infrastructure as needed. This includes, but is not limited to, installing operating systems, installing security patches, installing or upgrading drivers, upgrading firmware, installing or upgrading software licenses, installing or upgrading software specific to HPC cluster management.
Investigates, architects and implements new technology as appropriate to add new features to both the user environment and to our deployment environment. This requires the ability to work without training to take a new technology through installation to production. This also includes the ability to develop and document procedures related to that technology and to train other members of the group.
Respond to tickets which include complaints, requests, troubleshooting, assessing storage options, etc.
Provide training to groups or individuals as needed.
Other duties as assigned.