System Administrator

MGISOttawa, ON

About The Position

MGIS is seeking a System Administrator, Level 2, to manage High Performance Computing (HPC) clusters and support the scientists who rely on them. This role blends HPC system administration with hands-on user support — helping researchers install, run, and debug applications on HPC infrastructure so they can focus on their science instead of IT issues. HPC environments in scope include clustered CPU/GPU systems with job schedulers and attached parallel storage (e.g., Lustre, GPFS).

Requirements

  • Solid experience administering Linux-based HPC clusters (CPU/GPU nodes, schedulers, parallel storage)
  • Hands-on experience with job schedulers such as PBS Pro/Torque, SLURM, or SGE
  • Experience troubleshooting CUDA installations, GPU failures, and driver issues
  • Familiarity with scientific computing toolchains — compilers (GNU, Intel), MPI implementations, EasyBuild, and Spack
  • Experience supporting researchers or end-users with application builds and runtime issues
  • Working knowledge of configuration management tools (Git, Ansible, MS DevOps)
  • Comfortable working independently and producing clear technical documentation
  • Eligible to obtain and maintain a Secret-level security clearance

Responsibilities

  • Maintain the HPC cluster — hardware, image management, local networking, scheduler, and backups
  • Troubleshoot environment incidents to ensure a quick return to normal operations
  • Meet with scientists to evaluate their HPC support requirements
  • Develop task plans to meet researchers' needs, consulting the technical authority for approval
  • Support application builds, installs, and runtime troubleshooting (GNU, Intel, Fortran, Nvidia)
  • Support open-source and commercial software, including Python/Anaconda installs, Bash scripting, build/make tools, EasyBuild, Spack, and MPI implementations (MPICH, OpenMPI, IntelMPI, HPMPI)
  • Assist with compilation and runtime of in-house developed applications
  • Manage Linux OS patching schedules and reliability
  • Manage user accounts (creation, deletion) and environment modules
  • Manage configuration via Git, MS DevOps, and Ansible Playbooks
  • Manage RPM/DEB packages and troubleshoot ThinLinc
  • Troubleshoot jobs on schedulers (PBS Pro/Torque, SLURM, SGE)
  • Ensure reliable CUDA installs; troubleshoot GPU failures and CUDA software/driver issues
  • Provide hardware support — memory upgrades, storage arrays, power/network cabling, ILO
  • Document every process and task to support enterprise knowledge continuity
  • Submit weekly progress reports to the Technical Authority
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service