Sr. HPC Systems Engineer (IT@JH Research Computing)

Johns Hopkins University•Baltimore, MD

78d

About The Position

IT@JH Research Computing is seeking a Sr. HPC Systems Engineer who will design, build, and maintain advanced high-performance computing environments supporting Johns Hopkins University's research mission. This position focuses on the reliable operation, configuration, and optimization of HPC and AI systems, including multi-node CPU and GPU clusters, high-speed InfiniBand and Ethernet networks, and large-scale parallel and object storage. The engineer implements and automates secure, efficient, and reproducible computing platforms used by faculty, researchers, and students across diverse scientific disciplines. Assignments include both ticket-based support and project-based deployments. The role operates with moderate independence, collaborating closely with the IT Architect, Research Computing, and reporting to the IT Manager for Research Computing to ensure scalable, sustainable, and high-performance systems that enable cutting-edge scientific discovery. Specific Duties & Responsibilities In Addition to the Duties Described Above

Requirements

Bachelor's Degree.
Six years related experience.
Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma/graduation equivalent, to the extent permitted by the JHU equivalency formula.

Nice To Haves

Eight + years of experience in high-performance computing systems administration or engineering, including experience with cluster management, workload scheduling (e.g., Slurm), and distributed or parallel storage.
Deep proficiency in Linux systems administration, configuration management (Ansible, Puppet, or Salt), performance monitoring, and tuning for HPC workloads.
Experience with high-speed interconnects (Infiniband, 100/400 Gb Ethernet) and parallel file systems (e.g., GPFS, Lustre, BeeGFS, or WekaFS).
Working knowledge of containerization and orchestration (Singularity, Docker, Kubernetes for HPC).
Ability to automate deployments and routine operations through scripting (Bash, Python).
Familiarity with data-center operations, GPU acceleration, and research software environments (e.g., CUDA, MPI, AI/ML frameworks).
Strong analytical and troubleshooting skills, with proven ability to support complex research workloads in multi-user, multi-tenant environments.
Experience collaborating with faculty and research groups to translate scientific requirements into practical and performant computing solutions.

Responsibilities

Support and administer production systems used by researchers and Research Centers.
Provide technical leadership/project management for system configuration, implementation, management, and user support for both new and existing systems.
Research and recommend new functionality for HPC management and administration tools by exploring system-wide impacts, working with functional users to define current and future processes.
Expertise with architecting, operating, and debugging large scale HPC network and storage infrastructure, including MPI, NCCL, RDMA, Infiniband, and parallel file systems
Work with scientific support specialists to assign tasks and provide oversight as appropriate to HPC engineering team to support scientific researchers who use a broad spectrum of applications from diverse fields.
Analyze results of server monitoring and implement changes to improve performance, processing, and utilization.
Propose, maintain, and enforce policies, practices and security procedures.
Provide break/fix support, setup/installation support, escalation support, and solutions support.
Collaborate closely with a variety of stakeholders, both internal and external, on all aspects of projects.
Other duties as assigned.
Deploy, configure, and maintain large-scale Linux-based HPC clusters comprising CPU and GPU nodes, high-speed interconnects, and parallel file systems.
Implement and optimize workload schedulers (Slurm) and job submission policies to maximize system throughput and fair-share usage.
Administer and monitor distributed storage systems (GPFS, Lustre, WekaFS, Ceph, MinIO) to ensure reliability and performance across multi-petabyte environments.
Maintain high-speed fabric and network infrastructure (Infiniband, Ethernet) to support low-latency data transfer and MPI workloads.
Support research groups in deploying, testing, and optimizing scientific applications and AI/ML workflows on shared computing resources.
Develop and maintain automation and monitoring frameworks for system provisioning, metrics collection, and alerting (Prometheus, Grafana, ELK).
Participate in capacity planning, hardware lifecycle management, and evaluation of new technologies in collaboration with architects and management.
Ensure security and compliance through configuration hardening, patch management, and integration with campus identity and access control systems.
Document system designs, procedures, and troubleshooting guides to support knowledge transfer and team continuity.
Contribute to a collaborative engineering culture that emphasizes service quality, innovation, and continuous improvement in research computing operations.
Participate in on-call rotation to ensure high availability and timely response to system alerts.