IT044 High Performance Computing (HPC) and Storage System Administrator

Support•Greenbelt, MD

4h•Onsite

About The Position

This job description is for a High Performance Computing and Storage System Administrator to support the operations of the Integrated Modeling Computing Center (IMCC), formerly known as the NASA Center for Climate Simulation (NCCS). The IMCC will directly support the Integrated Modeling Virtual Institute (IMVI) to meet the Earth science modeling needs for NASA. The following describes the core duties and responsibilities and technical skills. Ideal candidates should have excellent communication skills, problem solving, and the ability to work efficiently within a highly performing team environment.

Requirements

Advanced, production-level expertise in enterprise Linux distributions (RHEL, Rocky Linux, AlmaLinux, or Ubuntu Server), incorporating expert-level command-line proficiency, kernel tuning, and automated shell scripting (Bash, Python).
Hands-on experience in the design, deployment, scaling, and/or optimization of high-performance file systems.
Experience in deploying, configuring, and operating IBM Spectrum Scale and/or Lustre.
Working familiarity with HPC resource management, including experience with Slurm.
Robust foundation in core security frameworks, containing firewalls, identity management (LDAP/Active Directory), access control lists (ACLs), SSH hardening, and continuous patch management cycles.
Experience operating within modern Agile frameworks (Scrum, Kanban), leveraging iterative workflows, participating in sprint reviews, and utilizing collaborative project boards (Jira, Gitlab) to track milestones.
Proficiency in configuring and maintaining GPU-accelerated computing environments, including driver installation/management, CUDA or similar library configuration, and performance tuning for accelerated workloads.
A MS degree and 5+ years’ experience in relevant work areas.
US Citizenship required
Ability to obtain and maintain a Tier 1 or Tier 2 Investigation through NASA.

Responsibilities

Perform day-to-day operations and management of large-scale, supercomputing clusters to meet the required availability, and performance, including, but not limited to, integration, provisioning, software stack deployment, updates, hardware and software maintenance, and decommissioning.
Deploy, tune, configure, maintain, and operate massive parallel file systems.
Manage, configure, optimize, and troubleshoot cluster management and job scheduling software.
Proactively implement security updates, coordinate systematic Operating System kernel patches, and mitigate vulnerabilities across computing and storage environments without compromising system stability.
Coordinate vendor-supported maintenance schedules, conduct hardware and software diagnostics, and participate in rapid-response resolution during service degradations or system blackouts.
Provide specialized, tiered technical assistance ranging from software provisioning and workflow optimization to advanced, expert-level troubleshooting for complex research challenges.
Provision, configure, and maintain GPU-accelerated computing systems, including driver management, library configuration, and performance optimization for workload acceleration.