High Performance Computing (HPC) Platform Engineer

Numerical Algorithms Group

8h•$150,000•Hybrid

About The Position

Are you a High-Performance Computing (HPC) Platform Engineer who enjoys collaborating with a team of skilled, friendly, and supportive colleagues on a wide range of technically challenging projects? Do you have the expertise to build, operate, and optimize HPC systems? If so, we’d love to hear from you. This role offers the opportunity to work on complex, high-performance computing environments. These include seismic data processing, reservoir visualization, and well planning workflows, running across distributed HPC systems. You’ll be involved in building and tuning HPC platforms, diagnosing and resolving performance bottlenecks, and enabling users to get the most out of advanced computing systems. If you are seeking to join a long-established, successful company that values teamwork, offers family-friendly and flexible working arrangements, and supports a healthy work-life balance, then nAG could be the perfect fit. As a market leader in technical software and high-performance computing, this is an exciting opportunity to play a key role in contributing to and shaping our growing HPC Services team. The ideal candidate will be an innovative thinker who can go beyond traditional approaches. You should also have strong communication skills, with the ability to explain complex technical concepts to both technical and non-technical audiences. This role requires flexibility and the ability to manage multiple projects while prioritizing business outcomes. What We’re Looking For: Qualifications: Bachelor’s degree (or equivalent) in Computer Engineering, Computer Science, or a related engineering discipline 5+ years of experience deploying and administering HPC clusters Great communication skills including the ability to understand computational scientists and their domain-specific terminology Solid understanding of HPC and accelerated computing within engineering or academic research environments Experience with network-distributed, multi-node applications Understanding of core HPC components, including network fabrics, types of parallel filesystems, memory hierarchies, accelerators Competencies: Deep understanding of operating systems, computer networks, and HPC applications C/C++/Python/Bash programming and scripting experience Experience with automation tools, including GitLab CI/CD pipelines Experience with scheduling and resource management systems (e.g. Slurm, or similar) Experience installing, supporting, and upgrading Lustre filesystems Strong Linux systems administration skills Experience with HPC workflows using MPI Experience with package management tools such as Conda, Spack, and RPM Ability to manage multiple priorities effectively in a dynamic environment

Requirements

Bachelor’s degree (or equivalent) in Computer Engineering, Computer Science, or a related engineering discipline
5+ years of experience deploying and administering HPC clusters
Great communication skills including the ability to understand computational scientists and their domain-specific terminology
Solid understanding of HPC and accelerated computing within engineering or academic research environments
Experience with network-distributed, multi-node applications
Understanding of core HPC components, including network fabrics, types of parallel filesystems, memory hierarchies, accelerators
Deep understanding of operating systems, computer networks, and HPC applications
C/C++/Python/Bash programming and scripting experience
Experience with automation tools, including GitLab CI/CD pipelines
Experience with scheduling and resource management systems (e.g. Slurm, or similar)
Experience installing, supporting, and upgrading Lustre filesystems
Strong Linux systems administration skills
Experience with HPC workflows using MPI
Experience with package management tools such as Conda, Spack, and RPM Ability to manage multiple priorities effectively in a dynamic environment

Nice To Haves

Strong understanding of and hands-on experience with Slurm configuration
Exposure to container technologies for HPC applications
Broad experience across HPC systems, with the ability to diagnose and resolve complex runtime issues across systems, networks, file systems, and applications
Experience designing HPC clusters, including considerations for power, cooling, networking, compute, and storage

Responsibilities

Configuring, optimizing, and managing HPC clusters, storage systems, and networking components to ensure performance and scalability
Supporting HPC environment operations, maintenance, and continuous improvement
Troubleshooting hardware and software issues across the HPC stack
Implementing security measures and maintaining system integrity
Collaborating with data scientists, researchers, and domain specialists to support and streamline workflows
Monitoring system performance and identifying opportunities for improvement
Performing system upgrades and ensuring platforms meet user and organizational needs

Benefits

competitive salary (dependent on your experience)
401K plan with company match up to 5%
health/dental/life/short-term and long-term disability insurance
20 vacation days with an additional 4 days mandatorily taken between Christmas and New Year’s holidays
paid sick days
maternity and paternity leave

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume