Voleon is a technology company that applies state-of-the-art machine learning techniques to real-world problems in finance. For more than a decade, we have led our industry and worked at the frontier of applying machine learning to investment management. We have become a multibillion-dollar asset manager, and we have ambitious goals for the future. As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. Our research clusters are at the core of our R&D, and you will be directly responsible for keeping this key resource available and performant. Your work will provide a world-class HPC platform for researchers to focus on cutting-edge machine learning problems at scale. You will support both on-prem and cloud infrastructure, and work to provide the best experience to our technical staff. You will leverage IaC, Automation, and SRE principles to refine and hone a product that operates 24/7 to support Voleon. The Cluster Operations team works on the frontline to triage and mitigate real-time operational issues. You will be an integral member of this team, solving day-to-day issues with high urgency, while also engineering systemic improvements and architectural fixes to prevent recurring issues. You will collaborate with engineering teams to develop improvements to monitoring/telemetry. You will help design and oversee operational frameworks to ensure the cluster operates within a set of rigorous SLAs.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
Bachelor's degree
Number of Employees
11-50 employees