Senior HPC Engineer

MillenniumNew York, NY
32d$175,000 - $250,000

About The Position

Millennium's Infrastructure organization is dedicated to designing, engineering, supporting, and managing a robust server estate, systems virtualization, and core enterprise services. We are seeking a Senior HPC Engineer for a hands-on technical leadership position to support Worldquant’s intiative of maintaining financial research leadership. This role is pivotal in designing, building, and maintaining our cutting-edge High-Performance Computing (HPC) and GPU clusters, which are essential for our AI and Machine Learning initiatives. The ideal candidate will have a strong background in HPC environments, with specific expertise in GPU-accelerated computing and advanced storage solutions. You will be responsible for ensuring the reliability, scalability, and performance of our computational infrastructure. You will join a highly specialized team of exceptionally talented yet refreshingly humble individuals from diverse disciplines. We believe that delivering exceptional services requires the ability to make meaningful changes across the entire stack. Our mission is to solve real business challenges, reduce operational complexities, and foster a collaborative, team-driven environment that promotes mutual growth and success.

Requirements

  • A Bachelor’s degree in Computer Science, Engineering, or a related field.
  • A minimum of 7 years of progressive experience in designing, building, and managing complex HPC environments.
  • Proven experience with GPU-accelerated computing, including NVIDIA GPUs and associated software (e.g., CUDA).
  • Deep expertise in high-performance storage systems and parallel file systems (e.g., Lustre, GPFS/Spectrum Scale).
  • Strong proficiency in Linux/Unix operating systems, scripting languages and configuration management platforms
  • Experience with cluster management and scheduling software (e.g., Kubernetes, Run.io), with a strong preference for Slurm
  • Familiarity with high-speed interconnects like InfiniBand or RoCE.
  • Understanding AI technologies and their applications in infrastructure automation and management. Experience with or a strong interest in implementing AI/ML solutions for infrastructure optimization, anomaly detection, or predictive analytics.
  • A passion for technology and automation, with a deep sense of curiosity and ownership.
  • A hands-on approach to problem-solving and a demonstrable enthusiasm for technology.
  • Excellent verbal and written communication skills.

Nice To Haves

  • Master’s or Ph.D. in a relevant technical field.
  • Experience in a buy-side financial organization.
  • Experience with cloud-based HPC, preferably with GCP.
  • Knowledge of containerization technologies such as Docker and Singularity.

Responsibilities

  • Design and Implementation: Lead the architectural design, implementation, and maintenance of large-scale HPC and GPU clusters.
  • Storage Management: In collaboration with the storage team, architect and manage high-performance storage solutions tailored for GPU-intensive workloads, ensuring low-latency data access and high throughput.
  • System Optimization: Monitor, analyze, and tune the performance of the HPC environment, including compute nodes, networking fabrics, and parallel file systems.
  • Automation: Develop and maintain automation scripts and tools for provisioning, configuration management, and monitoring of the HPC infrastructure.
  • Collaboration: Work closely with researchers, data scientists, and software engineers to understand their computational needs and provide a robust and efficient platform to accelerate their work.
  • Troubleshooting: Provide expert-level support for complex issues related to hardware, software, and networking within the HPC ecosystem.
  • Technology Evaluation: Stay current with emerging technologies and industry trends in HPC, GPU computing, and storage, and conduct evaluations to recommend new solutions.
  • Contribute to organizational knowledge through documentation, education, and writing maintainable code. Provide guidance to the team in your subject matter expertise.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service