Software Engineer - GPU reliability

Hudson River TradingNew York City, NY
23h$200,000 - $300,000

About The Position

Hudson River Trading (HRT) is seeking a Software Engineer focused on GPU reliability to join our Systems Development team. The Systems Development team builds and maintains the platform that is shared by all Systems teams to provision, monitor, and manage HRT’s server and network infrastructure. In this role, your main focus will be to develop tools in Python to analyze the performance of GPU hardware and build creative solutions to improve observability, reliability, and efficiency of the fleet. You’ll work closely with other engineering teams to deeply understand research and trading workflows and ensure that GPU infrastructure is utilized optimally. Strong Python skills and development experience are required, along with Unix experience and a background of managing GPU hardware at scale. Responsibilities This role offers a unique opportunity to make a significant impact on a critical part of our existing and growing infrastructure.

Requirements

  • BS and/or MS in computer science or a related field
  • 2+ years of relevant experience, including programming in Python and managing GPUs
  • Experience using automation to solve problems and improve process efficiency
  • Experience working with, troubleshooting, tuning, and deploying various types of GPU hardware
  • Strong grasp of computer science fundamentals and software design patterns
  • Solid understanding of Linux/UNIX operating systems
  • Familiarity with open-source software
  • Ability to debug and analyze problems quickly
  • Skilled at balancing multiple tasks while maintaining meticulous attention to detail
  • Ability to operate effectively as a team player and also work independently
  • Ability to learn at a fast pace and apply new skills effectively

Nice To Haves

  • Understanding of Debian operating system
  • Familiarity with systems configuration management and monitoring technologies
  • Familiarity with continuous integration and continuous deployment tools and processes
  • Understanding of networking protocols

Responsibilities

  • Building and maintaining tools and software features to automate systems engineering workflows related to GPU management, monitoring, metrics collection, maintenance, and network configuration
  • Troubleshooting software and hardware bugs on a fleet of GPU devices, including application, network, operating system, and/or kernel issues
  • Working across HRT’s engineering teams to tune workloads and processes to use GPUs more efficiently
  • Analyzing GPU job statistics to identify trends and areas for improvement

Benefits

  • medical
  • dental
  • vision
  • basic life insurance
  • enrollment in our company’s retirement savings plans
  • sick and parental leave
  • other paid time off (including 20 vacation days and 10 paid holidays in the US)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service