Software Engineer - GPU reliability

Hudson River TradingNew York, NY
2d

About The Position

Hudson River Trading (HRT) is seeking a Software Engineer focused on GPU reliability to join our Systems Development team. The Systems Development team builds and maintains the platform that is shared by all Systems teams to provision, monitor, and manage HRT’s server and network infrastructure. In this role, your main focus will be to develop tools in Python to analyze the performance of GPU hardware and build creative solutions to improve observability, reliability, and efficiency of the fleet. You’ll work closely with other engineering teams to deeply understand research and trading workflows and ensure that GPU infrastructure is utilized optimally. Strong Python skills and development experience are required, along with Unix experience and a background of managing GPU hardware at scale.

Requirements

  • BS and/or MS in computer science or a related field
  • 2+ years of relevant experience, including programming in Python and managing GPUs
  • Experience using automation to solve problems and improve process efficiency
  • Experience working with, troubleshooting, tuning, and deploying various types of GPU hardware
  • Strong grasp of computer science fundamentals and software design patterns
  • Solid understanding of Linux/UNIX operating systems
  • Familiarity with open-source software
  • Ability to debug and analyze problems quickly
  • Skilled at balancing multiple tasks while maintaining meticulous attention to detail
  • Ability to operate effectively as a team player and also work independently
  • Ability to learn at a fast pace and apply new skills effectively

Nice To Haves

  • Understanding of Debian operating system
  • Familiarity with systems configuration management and monitoring technologies
  • Familiarity with continuous integration and continuous deployment tools and processes
  • Understanding of networking protocols

Responsibilities

  • Building and maintaining tools and software features to automate systems engineering workflows related to GPU management, monitoring, metrics collection, maintenance, and network configuration
  • Troubleshooting software and hardware bugs on a fleet of GPU devices, including application, network, operating system, and/or kernel issues
  • Working across HRT’s engineering teams to tune workloads and processes to use GPUs more efficiently
  • Analyzing GPU job statistics to identify trends and areas for improvement

Benefits

  • This role will also be eligible for discretionary performance-based bonuses and a competitive benefits package which includes medical, dental, vision, basic life insurance, and enrollment in our company’s retirement savings plans.
  • Employees will receive sick and parental leave, as well as other paid time off (including 20 vacation days and 10 paid holidays in the US).
  • Please note that benefits and time off policies will vary across non-US locations.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service