Principal Engineer - HPC Operations

Core42 US Services LLC
7d

About The Position

We are seeking a highly skilled Principal Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads. This role ensures stable, secure, and high-performing infrastructure leveraging technologies such as Slurm, Kubernetes, and modern MLOps platforms. The ideal candidate will bring deep technical expertise in HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments. Responsibilities will extend to collaborating with multidisciplinary teams, leading complex projects, implementing cutting-edge technologies, and providing mentorship to operations engineers.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
  • Minimum of 8 years of experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacity.
  • Advanced expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems.
  • Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.
  • In-depth knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads.
  • Proficiency with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM.
  • Strong scripting and automation skills, including Python, Bash, Ansible, and Terraform.
  • Solid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies (RDMA, InfiniBand, RoCE), and storage solutions (NFS, Lustre, Ceph).

Responsibilities

  • Oversee the daily operational management of HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes, etc.).
  • Drive efforts to optimize the efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtime.
  • Serve as the primary technical contact for planned HPC deployments in scope.
  • Serve as the primary technical escalation point for L2 support teams, ensuring rapid and effective resolution of incidents and service requests.
  • Continuously monitor system health, performance, and resource utilization using advanced monitoring tools (e.g., Prometheus, Grafana, DCGM).
  • Manage user environments for AI/ML workloads, including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow).
  • Define and enforce job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments to ensure resource fairness, efficiency, and workload optimization.
  • Lead root cause analysis (RCA) of operational issues, contributing to post-mortem documentation and driving continuous improvement initiatives.
  • Provide mentorship and technical guidance to junior engineers, fostering skills development and knowledge sharing across teams. Participate in on-call rotation as necessary.
  • Ensure adherence to security and operational policies, assisting in audits and maintaining documentation for change and incident management processes.

Benefits

  • bonus
  • LTIP
  • benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service