System Administrators (HPC), must provide High Performance Computing (HPC) services in the form of HPC enhanced sustainment capabilities to two geographically dispersed areas. These capabilities include: Multi-vendor HPC servers, HPC clusters, and SPD servers. Systems running Red Hat, CentOS, SUSE and custom vendor-specific operating systems, with high-speed shared storage (lustre and gpfs as examples), along with dedicated high-speed low latency network interconnects like Infiniband and Slingshot. High speed shared parallel storage utilizes LUSTRE to provide performant shared storage solutions between two or more HPCs in a data center. An Interconnect service integrates HPC systems with a dedicated high-speed network connecting several storage appliances to dedicated HPC LNETs. These appliances would be available to various HPCs to enhance capabilities. The HPC Operations Team must provide for implementing and managing monitoring capability required to track the health, status, and performance of the entire system to include its subcomponents (environmental, compute, storage, networks and applications) using various COTS and GOTS toolsets (such as Nagios, Splunk, Prometheus, etc.) System Administrators (HPC) must support The HPC and ABS (ABUNDANTSHIELD high speed shared parallel storage) SRE teams and follow Government designated policies and procedures, developed to enhance the teams’ ability to perform their sustainment responsibilities and to improve customer mission operations. The contractor must follow those Government designated policies and procedures, which include ticket tracking processes, change approvals and change management processes, coding specifications, and the support (and adherence) to processes and policies associated with the Government designated (and deployed) base SRE tools.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level