HPC Systems Administrator

Argonne National LaboratoryLemont, IL
1dOnsite

About The Position

We are seeking a highly skilled and motivated HPC Systems Administrator to manage and support our high-performance computing (HPC) environment. The role involves maintaining and optimizing four unique HPC clusters, Globus data transfer nodes, GPU nodes, monitoring systems, IBM ESS storage appliances, GPFS (General Parallel File System), PBS Pro scheduler, and ensuring compliance with security and identity management standards such as LDAP integration, Multi-Factor Authentication (MFA), and HSPD-12 compliance. The ideal candidate will ensure the reliability, performance, and scalability of our HPC infrastructure to support advanced computational workloads. Key Responsibilities HPC Cluster Management: Administer and maintain four unique HPC clusters , ensuring optimal performance and uptime. Perform system upgrades, patching, and configuration management. Troubleshoot and resolve hardware and software issues. Data Transfer Nodes & Globus: Manage Globus data transfer nodes to facilitate efficient and secure data movement. Monitor and optimize data transfer performance across the clusters. GPU Nodes Administration: Configure and maintain GPU nodes for computational workloads. Optimize GPU utilization for machine learning, AI, and other GPU-intensive applications. Monitoring & Visualization: Implement and maintain monitoring tools such as Grafana to track system health and performance. Develop dashboards and alerts for proactive issue resolution. Storage Management: Administer IBM ESS storage appliances and GPFS (Spectrum Scale) to ensure high availability and performance. Monitor storage usage and plan for capacity upgrades as needed. Job Scheduling: Manage and optimize PBS Pro scheduler for efficient job queuing and resource allocation. Troubleshoot scheduling issues and implement policies to improve throughput. Identity & Access Management: Implement and manage LDAP integration for centralized authentication and directory services. Administer Linux account management, including user provisioning, permissions, and access controls. Configure and support Multi-Factor Authentication (MFA) solutions to enhance system security. Ensure compliance with HSPD-12 standards for identity verification and access control. Documentation & Reporting: Maintain detailed documentation of system configurations, processes, and procedures. Generate regular reports on system performance, utilization, and incidents. Collaboration & Support: Work closely with researchers, developers, and other stakeholders to understand their computational needs. Provide technical support and training to users of the HPC systems. Security & Compliance: Implement security best practices to protect sensitive data and computational resources. Ensure compliance with organizational policies, industry standards, and government regulations such as HSPD-12. May be required to perform other duties as assigned.

Requirements

  • Bachelors and 6+ years’ experience, Masters and 4+ years’ experience, or equivalent
  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • 5+ years of experience in HPC systems administration or a similar role.
  • Proficiency in Linux/Unix system administration.
  • Experience with Globus, GPU nodes, and HPC cluster management.
  • Strong knowledge of IBM ESS storage appliances and GPFS.
  • Familiarity with PBS Pro scheduler and job queuing systems.
  • Expertise in LDAP integration, Linux account management, and Multi-Factor Authentication (MFA).
  • Hands-on experience with monitoring tools like Grafana.
  • Knowledge of HSPD-12 compliance requirements and implementation.
  • Excellent problem-solving and analytical skills.
  • Ability to work independently and manage multiple priorities.
  • Attention to detail and commitment to quality.
  • Ability to model Argonne’s core values of impact, safety, respect, integrity, and teamwork.
  • Interpersonal skills, oral and written communication skills, and ability to interact with people at all levels both within and outside the laboratory.

Nice To Haves

  • Master's degree in a relevant field.
  • Certifications in HPC, Linux, or storage technologies.
  • Experience with scripting languages (e.g., Python, Bash) for automation.
  • Knowledge of networking protocols and security practices.

Responsibilities

  • HPC Cluster Management: Administer and maintain four unique HPC clusters , ensuring optimal performance and uptime.
  • Perform system upgrades, patching, and configuration management.
  • Troubleshoot and resolve hardware and software issues.
  • Data Transfer Nodes & Globus: Manage Globus data transfer nodes to facilitate efficient and secure data movement.
  • Monitor and optimize data transfer performance across the clusters.
  • GPU Nodes Administration: Configure and maintain GPU nodes for computational workloads.
  • Optimize GPU utilization for machine learning, AI, and other GPU-intensive applications.
  • Monitoring & Visualization: Implement and maintain monitoring tools such as Grafana to track system health and performance.
  • Develop dashboards and alerts for proactive issue resolution.
  • Storage Management: Administer IBM ESS storage appliances and GPFS (Spectrum Scale) to ensure high availability and performance.
  • Monitor storage usage and plan for capacity upgrades as needed.
  • Job Scheduling: Manage and optimize PBS Pro scheduler for efficient job queuing and resource allocation.
  • Troubleshoot scheduling issues and implement policies to improve throughput.
  • Identity & Access Management: Implement and manage LDAP integration for centralized authentication and directory services.
  • Administer Linux account management, including user provisioning, permissions, and access controls.
  • Configure and support Multi-Factor Authentication (MFA) solutions to enhance system security.
  • Ensure compliance with HSPD-12 standards for identity verification and access control.
  • Documentation & Reporting: Maintain detailed documentation of system configurations, processes, and procedures.
  • Generate regular reports on system performance, utilization, and incidents.
  • Collaboration & Support: Work closely with researchers, developers, and other stakeholders to understand their computational needs.
  • Provide technical support and training to users of the HPC systems.
  • Security & Compliance: Implement security best practices to protect sensitive data and computational resources.
  • Ensure compliance with organizational policies, industry standards, and government regulations such as HSPD-12.
  • May be required to perform other duties as assigned.

Benefits

  • comprehensive benefits are part of the total rewards package.
  • Click here to view Argonne employee benefits!
  • As an equal employment opportunity employer, and in accordance with our core values of impact, safety, respect, integrity and teamwork, Argonne National Laboratory is committed to a safe and welcoming workplace that fosters collaborative scientific discovery and innovation.
  • Argonne encourages everyone to apply for employment.
  • Argonne is committed to nondiscrimination and considers all qualified applicants for employment without regard to any characteristic protected by law.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service