High Performance Computing (HPC) Platform Engineer

Numerical Algorithms Group
$150,000Hybrid

About The Position

Are you a High-Performance Computing (HPC) Platform Engineer who enjoys collaborating with a team of skilled, friendly, and supportive colleagues on a wide range of technically challenging projects? Do you have the expertise to build, operate, and optimize HPC systems? If so, we’d love to hear from you. This role offers the opportunity to work on complex, high-performance computing environments. These include seismic data processing, reservoir visualization, and well planning workflows, running across distributed HPC systems. You’ll be involved in building and tuning HPC platforms, diagnosing and resolving performance bottlenecks, and enabling users to get the most out of advanced computing systems. If you are seeking to join a long-established, successful company that values teamwork, offers family-friendly and flexible working arrangements, and supports a healthy work-life balance, then nAG could be the perfect fit. As a market leader in technical software and high-performance computing, this is an exciting opportunity to play a key role in contributing to and shaping our growing HPC Services team. The ideal candidate will be an innovative thinker who can go beyond traditional approaches. You should also have strong communication skills, with the ability to explain complex technical concepts to both technical and non-technical audiences. This role requires flexibility and the ability to manage multiple projects while prioritizing business outcomes. What We’re Looking For: Qualifications: Bachelor’s degree (or equivalent) in Computer Engineering, Computer Science, or a related engineering discipline 5+ years of experience deploying and administering HPC clusters Great communication skills including the ability to understand computational scientists and their domain-specific terminology Solid understanding of HPC and accelerated computing within engineering or academic research environments Experience with network-distributed, multi-node applications Understanding of core HPC components, including network fabrics, types of parallel filesystems, memory hierarchies, accelerators Competencies: Deep understanding of operating systems, computer networks, and HPC applications C/C++/Python/Bash programming and scripting experience Experience with automation tools, including GitLab CI/CD pipelines Experience with scheduling and resource management systems (e.g. Slurm, or similar) Experience installing, supporting, and upgrading Lustre filesystems Strong Linux systems administration skills Experience with HPC workflows using MPI Experience with package management tools such as Conda, Spack, and RPM Ability to manage multiple priorities effectively in a dynamic environment

Requirements

  • Bachelor’s degree (or equivalent) in Computer Engineering, Computer Science, or a related engineering discipline
  • 5+ years of experience deploying and administering HPC clusters
  • Great communication skills including the ability to understand computational scientists and their domain-specific terminology
  • Solid understanding of HPC and accelerated computing within engineering or academic research environments
  • Experience with network-distributed, multi-node applications
  • Understanding of core HPC components, including network fabrics, types of parallel filesystems, memory hierarchies, accelerators
  • Deep understanding of operating systems, computer networks, and HPC applications
  • C/C++/Python/Bash programming and scripting experience
  • Experience with automation tools, including GitLab CI/CD pipelines
  • Experience with scheduling and resource management systems (e.g. Slurm, or similar)
  • Experience installing, supporting, and upgrading Lustre filesystems
  • Strong Linux systems administration skills
  • Experience with HPC workflows using MPI
  • Experience with package management tools such as Conda, Spack, and RPM Ability to manage multiple priorities effectively in a dynamic environment

Nice To Haves

  • Strong understanding of and hands-on experience with Slurm configuration
  • Exposure to container technologies for HPC applications
  • Broad experience across HPC systems, with the ability to diagnose and resolve complex runtime issues across systems, networks, file systems, and applications
  • Experience designing HPC clusters, including considerations for power, cooling, networking, compute, and storage

Responsibilities

  • Configuring, optimizing, and managing HPC clusters, storage systems, and networking components to ensure performance and scalability
  • Supporting HPC environment operations, maintenance, and continuous improvement
  • Troubleshooting hardware and software issues across the HPC stack
  • Implementing security measures and maintaining system integrity
  • Collaborating with data scientists, researchers, and domain specialists to support and streamline workflows
  • Monitoring system performance and identifying opportunities for improvement
  • Performing system upgrades and ensuring platforms meet user and organizational needs

Benefits

  • competitive salary (dependent on your experience)
  • 401K plan with company match up to 5%
  • health/dental/life/short-term and long-term disability insurance
  • 20 vacation days with an additional 4 days mandatorily taken between Christmas and New Year’s holidays
  • paid sick days
  • maternity and paternity leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service