Senior HPC Engineer

Millennium•New York, NY

1d•Onsite

About The Position

Millennium’s Infrastructure organization designs, engineers, and operates a robust global computing platform supporting WorldQuant’s quantitative research. We are seeking a Senior HPC Engineer to join our team in a senior, hands-on role building and evolving large-scale, high-throughput HPC and GPU platforms that underpin AI- and machine-learning-driven research. In this role, you will be part of a small, senior HPC team, taking end-to-end ownership of a significant area of the platform while collaborating closely with other subject-matter experts. You will be a systems-level engineer who is comfortable owning complex technical decisions and designing and building production infrastructure, rather than advising from the sidelines. We aim to build infrastructure that is reliable, understandable, and adaptable, and we value engineers who care about simplicity, clarity, and maintainability as much as raw performance. We recognize that strong candidates may bring different experiences, perspectives, and working styles.

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related technical field; a Master’s or PhD is a plus.
Typically 7+ years of hands-on experience designing, building, and operating HPC or large-scale compute environments.
Deep, practical experience with at least one major HPC scheduler (such as Slurm), including using it to operate large-scale or high-throughput clusters in production.
Hands-on experience with GPU-accelerated computing, including NVIDIA GPUs and associated software ecosystems.
Strong Linux systems engineering skills and comfort working close to the operating system, drivers, and hardware.
Experience designing or operating high-performance storage systems, including parallel or scale-out file systems.
Curious, evidence-driven problem solving, including experimenting with different approaches and using data to inform decisions.
A collaborative working style that values listening, respectful discussion, and incorporating different perspectives — whether you are more quiet and reflective or more vocal in group settings.
Clear written and verbal communication skills, and an ability to explain complex ideas in a way that works for different audiences.
A strong sense of ownership for outcomes, paired with openness to feedback, learning, and evolving systems over time.

Nice To Haves

Experience with Kubernetes, Run:ai, or other workload orchestration platforms alongside traditional HPC schedulers.
Familiarity with Lustre, GPFS / Spectrum Scale, or similar high-performance storage technologies.
Exposure to cloud-based HPC environments (e.g., GCP or other major cloud providers).
Experience supporting quantitative research, finance, or other demanding compute-intensive workloads.
Interest in applying AI or ML techniques to infrastructure (for example, optimization, anomaly detection, or predictive analysis).

Responsibilities

Design, build, and operate large-scale, high-throughput HPC and GPU clusters (for example, tens of thousands of CPU cores and hundreds of GPUs) supporting AI and machine-learning workloads.
Collaborate with other HPC engineers and subject-matter experts to co-design system architectures, review designs, and share knowledge.
Partner with storage specialists to architect and maintain high-performance, low-latency storage solutions, including parallel or scale-out file systems.
Work closely with researchers, data scientists, and engineers to understand computational needs and translate them into effective, scalable system designs.
Monitor, analyze, and optimize performance across compute, scheduling, networking, and storage layers.
Build and maintain automation and infrastructure-as-code for provisioning, configuration, monitoring, and lifecycle management, with an emphasis on repeatability and simplicity.
Participate in design reviews, operational discussions, and post-incident reviews with a focus on learning, collaboration, and system improvement rather than blame.
Explore alternative approaches to scheduling, data layout, cluster architectures, and GPU utilization through small experiments or prototypes, using data to guide decisions.
Produce clear documentation, diagrams, and reusable tooling that enable others to operate, debug, and extend the platform.
Stay current with advancements in HPC, GPU computing, networking, and storage, and help assess where new technologies can add real value.