HPC Systems Engineer

KLAAnn Arbor, MI

About The Position

We’re looking for a HPC Systems Engineer to help power the compute infrastructure behind our R&D innovation! In this role, you’ll support and evolve a high‑performance Linux cluster used for physics modeling, simulation, algorithm development, and machine‑learning workloads—enabling hundreds of engineers to do their best work every day. You’ll play a key role in driving the reliability, performance, and scalability of a shared, mission‑critical HPC environment, partnering closely with infrastructure, DevOps, and application teams to keep the platform fast, resilient, and ready for the most demanding computational challenges!

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 3+ years of hands‑on Linux systems administration experience
  • Direct experience working with HPC or large‑scale compute environments
  • Practical experience with at least one HPC scheduler (SLURM, LSF, PBS, or similar)
  • Strong Linux troubleshooting skills (processes, memory, I/O, networking, performance analysis)
  • Comfort working in CLI‑driven, production infrastructure environments

Nice To Haves

  • Experience supporting GPU‑accelerated workloads (CUDA, drivers, GPU scheduling concepts)
  • Familiarity with parallel computing or scientific/engineering workloads
  • Experience with cluster storage systems (e.g., Lustre, BeeGFS, NFS, or high‑performance NAS/SAN)
  • Exposure to automation tools (Ansible, scripting, Infrastructure‑as‑Code concepts)
  • Familiarity with containers in HPC contexts (Singularity / Apptainer, rootless containers)
  • Experience supporting internal developer or research communities

Responsibilities

  • HPC Platform Operations • Operate and maintain a large-scale Linux based HPC cluster used for internal R&D workloads
  • Manage compute nodes, login nodes, and supporting infrastructure in a multi-tenant environment
  • Monitor cluster health, performance, and capacity; respond to incidents and degradations
  • Scheduler & Workload Management • Configure, tune, and support HPC job schedulers (e.g., SLURM, LSF, PBS, or equivalent)
  • Assist users with job submission issues, resource requests, and queue optimization
  • Help optimize scheduler policies to balance throughput, fairness, and utilization
  • Linux Systems Engineering • Install, configure, and maintain Linux operating systems across compute and service nodes
  • Manage OS updates, kernel changes, drivers (including GPU drivers where applicable), and system hardening
  • Troubleshoot complex Linux performance, networking, storage, and process level issues
  • Performance & Scaling • Support high throughput and parallel workloads across CPU and GPU resources
  • Diagnose performance bottlenecks across compute, storage, network, and scheduler layers
  • Assist with scaling activities such as node expansions, re provisioning, and hardware refreshes
  • Automation & Reliability • Use automation and configuration management tools to ensure consistency across the cluster
  • Contribute to scripting and tooling for node provisioning, validation, and lifecycle management
  • Participate in on call or escalation rotations as required to support a production R&D platform
  • Collaboration & User Support • Partner with internal engineering teams to understand workload requirements and usage patterns
  • Provide guidance and best practices for running workloads efficiently on shared HPC systems
  • Contribute to internal documentation and operational runbooks

Benefits

  • medical, dental, vision, life, and other voluntary benefits
  • 401(K) including company matching
  • employee stock purchase program (ESPP)
  • student debt assistance
  • tuition reimbursement program
  • development and career growth opportunities and programs
  • financial planning benefits
  • wellness benefits including an employee assistance program (EAP)
  • paid time off and paid company holidays
  • family care and bonding leave

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service