HPC Systems Engineer

KLAAnn Arbor, MI

About The Position

We’re looking for a HPC Systems Engineer to help power the compute infrastructure behind our R&D innovation! In this role, you’ll support and evolve a high‑performance Linux cluster used for physics modeling, simulation, algorithm development, and machine‑learning workloads—enabling hundreds of engineers to do their best work every day. You’ll play a key role in driving the reliability, performance, and scalability of a shared, mission‑critical HPC environment, partnering closely with infrastructure, DevOps, and application teams to keep the platform fast, resilient, and ready for the most demanding computational challenges!

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 3+ years of hands‑on Linux systems administration experience
  • Direct experience working with HPC or large‑scale compute environments
  • Practical experience with at least one HPC scheduler (SLURM, LSF, PBS, or similar)
  • Strong Linux troubleshooting skills (processes, memory, I/O, networking, performance analysis)
  • Comfort working in CLI‑driven, production infrastructure environments

Nice To Haves

  • Experience supporting GPU‑accelerated workloads (CUDA, drivers, GPU scheduling concepts)
  • Familiarity with parallel computing or scientific/engineering workloads
  • Experience with cluster storage systems (e.g., Lustre, BeeGFS, NFS, or high‑performance NAS/SAN)
  • Exposure to automation tools (Ansible, scripting, Infrastructure‑as‑Code concepts)
  • Familiarity with containers in HPC contexts (Singularity / Apptainer, rootless containers)
  • Experience supporting internal developer or research communities

Responsibilities

  • HPC Platform Operations
  • Operate and maintain a large-scale Linux based HPC cluster used for internal R&D workloads
  • Manage compute nodes, login nodes, and supporting infrastructure in a multi-tenant environment
  • Monitor cluster health, performance, and capacity; respond to incidents and degradations
  • Scheduler & Workload Management
  • Configure, tune, and support HPC job schedulers (e.g., SLURM, LSF, PBS, or equivalent)
  • Assist users with job submission issues, resource requests, and queue optimization
  • Help optimize scheduler policies to balance throughput, fairness, and utilization
  • Linux Systems Engineering
  • Install, configure, and maintain Linux operating systems across compute and service nodes
  • Manage OS updates, kernel changes, drivers (including GPU drivers where applicable), and system hardening
  • Troubleshoot complex Linux performance, networking, storage, and process level issues
  • Performance & Scaling
  • Support high throughput and parallel workloads across CPU and GPU resources
  • Diagnose performance bottlenecks across compute, storage, network, and scheduler layers
  • Assist with scaling activities such as node expansions, re provisioning, and hardware refreshes
  • Automation & Reliability
  • Use automation and configuration management tools to ensure consistency across the cluster
  • Contribute to scripting and tooling for node provisioning, validation, and lifecycle management
  • Participate in on call or escalation rotations as required to support a production R&D platform
  • Collaboration & User Support
  • Partner with internal engineering teams to understand workload requirements and usage patterns
  • Provide guidance and best practices for running workloads efficiently on shared HPC systems
  • Contribute to internal documentation and operational runbooks

Benefits

  • medical
  • dental
  • vision
  • life
  • other voluntary benefits
  • 401(K) including company matching
  • employee stock purchase program (ESPP)
  • student debt assistance
  • tuition reimbursement program
  • development and career growth opportunities and programs
  • financial planning benefits
  • wellness benefits including an employee assistance program (EAP)
  • paid time off and paid company holidays
  • family care and bonding leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service