GPU Systems Engineer

Hudson River TradingSeattle, WA
98d$200,000 - $300,000

About The Position

Hudson River Trading (HRT) is looking for GPU Systems Engineers to help scale and evolve our exceptionally sophisticated HPC/AI research environment. Joining our Research and Development team, you will collaborate with experts responsible for the compute, storage, operating systems, and automation tools that enable our trading and research to run 24/7 across the globe. We design, grow, and operate infrastructure at a large scale, including triple-digit petabyte-scale storage and massive CPU and GPU clusters in globally distributed data centers. As such, this is a high-impact role with broad scope, from HPC/AI cluster design and performance tuning, to troubleshooting and automation for thousands of nodes.

Requirements

  • 5+ years of experience in large-scale Linux systems engineering in HPC, AI or distributed infrastructure roles
  • Extensive experience in Linux system installation, performance tuning, and troubleshooting
  • Expertise in troubleshooting distributed GPU workloads
  • Deep knowledge around GPU optimization and performance
  • Proficiency in Python scripting and automation frameworks
  • CUDA or C/C++ experience is a plus
  • Experience with NVIDIA technologies beyond CUDA, such as NCCL, GPUDirect RDMA, and NVLink
  • Familiarity with configuration management tools (e.g. Salt, Ansible, Puppet, Chef)
  • Comfortable diagnosing complex system issues at the hardware, OS, and network levels
  • Strong communication and organizational skills; able to collaborate across diverse technical teams
  • Thrive in fast-paced environments and excited by high-impact work

Responsibilities

  • Design, build, and optimize large-scale distributed GPU compute clusters
  • Identify and resolve GPU workloads’ performance bottlenecks across compute, storage, and networking layers
  • Collaborate with research and development teams to profile, benchmark, and fine-tune GPU-based workloads
  • Automate system deployment, monitoring, and troubleshooting across thousands of nodes
  • Collaborate with research, and engineering teams to support evolving workloads
  • Own critical infrastructure projects — from concept to implementation and support
  • Test and deploy new hardware and software, and partner with vendors to resolve complex issues

Benefits

  • Medical insurance
  • Dental insurance
  • Vision insurance
  • Basic life insurance
  • Enrollment in the company’s 401k plan
  • 20 vacation days annually
  • 10 paid holidays annually
  • Sick leave
  • Parental leave
  • Discretionary performance-based bonuses
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service