We are seeking a highly skilled and motivated HPC Systems Administrator to manage and support our high-performance computing (HPC) environment. The role involves maintaining and optimizing four unique HPC clusters, Globus data transfer nodes, GPU nodes, monitoring systems, IBM ESS storage appliances, GPFS (General Parallel File System), PBS Pro scheduler, and ensuring compliance with security and identity management standards such as LDAP integration, Multi-Factor Authentication (MFA), and HSPD-12 compliance. The ideal candidate will ensure the reliability, performance, and scalability of our HPC infrastructure to support advanced computational workloads. Key Responsibilities HPC Cluster Management: Administer and maintain four unique HPC clusters , ensuring optimal performance and uptime. Perform system upgrades, patching, and configuration management. Troubleshoot and resolve hardware and software issues. Data Transfer Nodes & Globus: Manage Globus data transfer nodes to facilitate efficient and secure data movement. Monitor and optimize data transfer performance across the clusters. GPU Nodes Administration: Configure and maintain GPU nodes for computational workloads. Optimize GPU utilization for machine learning, AI, and other GPU-intensive applications. Monitoring & Visualization: Implement and maintain monitoring tools such as Grafana to track system health and performance. Develop dashboards and alerts for proactive issue resolution. Storage Management: Administer IBM ESS storage appliances and GPFS (Spectrum Scale) to ensure high availability and performance. Monitor storage usage and plan for capacity upgrades as needed. Job Scheduling: Manage and optimize PBS Pro scheduler for efficient job queuing and resource allocation. Troubleshoot scheduling issues and implement policies to improve throughput. Identity & Access Management: Implement and manage LDAP integration for centralized authentication and directory services. Administer Linux account management, including user provisioning, permissions, and access controls. Configure and support Multi-Factor Authentication (MFA) solutions to enhance system security. Ensure compliance with HSPD-12 standards for identity verification and access control. Documentation & Reporting: Maintain detailed documentation of system configurations, processes, and procedures. Generate regular reports on system performance, utilization, and incidents. Collaboration & Support: Work closely with researchers, developers, and other stakeholders to understand their computational needs. Provide technical support and training to users of the HPC systems. Security & Compliance: Implement security best practices to protect sensitive data and computational resources. Ensure compliance with organizational policies, industry standards, and government regulations such as HSPD-12. May be required to perform other duties as assigned.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
1,001-5,000 employees