About The Position

As Point72 reimagines the future of investing, our Technology team is constantly evolving our firm’s IT infrastructure and engineering capabilities, positioning us at the forefront of a rapidly evolving technology landscape. We’re a team of experts who experiment and work to discover new ways to harness open-source solutions, modern cloud architectures, and sophisticated Artificial Intelligence (AI) solutions, while embracing enterprise agile methodologies. Our commitment to building and innovating in the AI space provides the framework intended to drive smarter decision making and enhance how we build and operate our platforms and applications. As a member of Point72’s Technology team, we encourage and support your professional development from day one—helping you advance your technical skills, contribute innovative ideas, and satisfy your own intellectual curiosity—all while delivering real business impact for our multi-billion-dollar global business.

Requirements

  • Bachelor's or master's degree in computer science, electrical engineering, or a related technical field
  • 3–7 years of experience building and maintaining scalable compute or machine learning infrastructure systems
  • Deep understanding of distributed systems, container orchestration (Kubernetes), and public cloud platforms such as AWS, Google Cloud Platform, or Azure
  • Hands-on experience with machine learning operations and infrastructure tools such as MLflow, Ray, Airflow, Kubeflow, and Terraform
  • Strong understanding of reinforcement learning concepts and their infrastructure implications
  • Proficiency in Python and systems-level programming in one or more languages such as Go, C++, or Rust
  • Strong debugging, performance profiling, and optimization skills across GPU and CPU compute stacks
  • Experience implementing monitoring, observability, and cost-optimization for GPU/accelerator-based compute environments
  • Excellent collaboration and communication skills with a systems-thinking mindset
  • Commitment to the highest ethical standards

Responsibilities

  • Design and implement high-performance infrastructure to support large-scale generative AI and machine learning workloads, enabling faster model iteration and real business impact
  • Design and operate distributed systems for model training, hyperparameter tuning, inference, and data preprocessing pipelines to deliver reliable end-to-end machine learning (ML) workflows
  • Collaborate with ML researchers and engineers to produce models, optimizing compute utilization, training throughput, and inference latency
  • Develop and automate deployment, orchestration, and CI/CD pipelines for models and data workflows using container orchestration and infrastructure-as-code (IaC)
  • Implement observability, monitoring, and cost-management strategies for GPU and accelerator compute environments to maintain predictable performance and spend
  • Evaluate, integrate, and benchmark emerging hardware and software technologies across cloud and on-prem environments to improve scalability and throughput
  • Drive security, compliance, and operational runbooks for GenAI infrastructure including access controls, secrets management, and incident response procedures
  • Troubleshoot, profile, and optimize performance across GPU and CPU compute stacks to remove bottlenecks and increase reliability
  • Document architecture, operational practices, and mentor engineers to expand team capability and accelerate adoption of production-ready GenAI infrastructure

Benefits

  • Fully-paid health care benefits
  • Generous parental and family leave policies
  • Volunteer opportunities
  • Support for employee-led affinity groups representing women, people of color and the LGBT+ community
  • Mental and physical wellness programs
  • Tuition assistance
  • A 401(k) savings program with an employer match and more
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service