Senior High Performance Computing Administrator

Yale UniversityNew Haven, CT
179d$81,900 - $163,425Hybrid

About The Position

The Yale Center for Research Computing (YCRC) is looking for a versatile system administrator/engineer to help ensure that Yale's exceptional faculty and students have the infrastructure they need to propel discovery and scholarship to improve the world. Join our growing team of system specialists, research facilitators, and project administration experts, focusing your work especially on GPU infrastructure enhancements and improvements as part of Yale's comprehensive campus investment in AI. As an experienced subject matter expert, you will help lead the system design, deployment and support of YCRC's AI-focused research cluster and storage infrastructure. This role is both systems- and researcher-facing, so frequent interaction with other systems team members, research support specialists, and researchers is a routine part of the job. You will be expected to stay current on developments and trends in accelerator and overall high performance computing technologies, processes, and methodologies. We will look to you for insights on evolving tradeoffs in areas such as accelerator-based memory, precision, interconnects, power consumption, and cost. This is a hybrid position, with YCRC's office space being on the Yale campus. As part of the systems team, you will be expected to provide on-site equipment maintenance as needed. Infrastructure is hosted at a Yale data center in West Haven, CT, and at the Massachusetts Green High Performance Computing Center (MGHPCC) in Holyoke, MA.

Requirements

  • Expertise in administration of HPC Linux clusters, including managing and configuring cluster provisioning and management tools, and batch scheduler.
  • Experience with high-speed networking such as InfiniBand and high-speed Ethernet.
  • Experience with large storage systems and parallel file systems such as GPFS and Lustre.
  • Expertise in Linux system administration, including managing the operating system, networking, storage, and security.
  • Expertise in automation and scripting in at least one scripting language.
  • Ability to work in a team environment in a fast moving technology field.
  • Excellent verbal and writing skills.
  • Ability to interact well with team members and end users.
  • Ability to work independently and across units.
  • Attention to detail.

Nice To Haves

  • Experience with GPUs.
  • Ability to specify new systems especially for AI and ML.
  • Experience configuring, deploying, supporting large-scale systems in a research environment.
  • Expertise in computer security in large, multi-user Linux environments.
  • Experience with remote admin, installing and trouble-shooting hardware.
  • Expertise securing large Linux environments.

Responsibilities

  • Design, implement and advance core HPC systems such as the HPC provisioning system, the resource-management system, account/user lifecycle management, and user authentication and authorization systems.
  • Design, configure and support HPC clusters.
  • Install, administer and maintain hardware, system software, networking, accounts, and security measures.
  • Design and implement our parallel storage and backup systems in collaboration with team.
  • Diagnose and correct system issues, whether these be issues with correct operation or performance.
  • Reinstate integrity of systems as quickly as possible following an outage in order to minimize downtime.
  • Triage and solve user-submitted tickets, especially when they relate to infrastructure.
  • Track resource usage using monitoring and queuing software.
  • Develop and maintain documentation for team members and end users.
  • Research developments in HPC architecture and new technologies, processes, and methodologies.
  • Patch system firmware and software as needed.
  • Determine specifications for new systems, and tailor these to meet business needs (together with team).
  • Conduct training and user education.
  • Perform other duties as assigned.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service