System Administrator

MGIS•Ottawa, ON

11d

About The Position

MGIS is seeking a System Administrator, Level 2, to manage High Performance Computing (HPC) clusters and support the scientists who rely on them. This role blends HPC system administration with hands-on user support — helping researchers install, run, and debug applications on HPC infrastructure so they can focus on their science instead of IT issues. HPC environments in scope include clustered CPU/GPU systems with job schedulers and attached parallel storage (e.g., Lustre, GPFS).

Requirements

Solid experience administering Linux-based HPC clusters (CPU/GPU nodes, schedulers, parallel storage)
Hands-on experience with job schedulers such as PBS Pro/Torque, SLURM, or SGE
Experience troubleshooting CUDA installations, GPU failures, and driver issues
Familiarity with scientific computing toolchains — compilers (GNU, Intel), MPI implementations, EasyBuild, and Spack
Experience supporting researchers or end-users with application builds and runtime issues
Working knowledge of configuration management tools (Git, Ansible, MS DevOps)
Comfortable working independently and producing clear technical documentation
Eligible to obtain and maintain a Secret-level security clearance

Responsibilities

Maintain the HPC cluster — hardware, image management, local networking, scheduler, and backups
Troubleshoot environment incidents to ensure a quick return to normal operations
Meet with scientists to evaluate their HPC support requirements
Develop task plans to meet researchers' needs, consulting the technical authority for approval
Support application builds, installs, and runtime troubleshooting (GNU, Intel, Fortran, Nvidia)
Support open-source and commercial software, including Python/Anaconda installs, Bash scripting, build/make tools, EasyBuild, Spack, and MPI implementations (MPICH, OpenMPI, IntelMPI, HPMPI)
Assist with compilation and runtime of in-house developed applications
Manage Linux OS patching schedules and reliability
Manage user accounts (creation, deletion) and environment modules
Manage configuration via Git, MS DevOps, and Ansible Playbooks
Manage RPM/DEB packages and troubleshoot ThinLinc
Troubleshoot jobs on schedulers (PBS Pro/Torque, SLURM, SGE)
Ensure reliable CUDA installs; troubleshoot GPU failures and CUDA software/driver issues
Provide hardware support — memory upgrades, storage arrays, power/network cabling, ILO
Document every process and task to support enterprise knowledge continuity
Submit weekly progress reports to the Technical Authority