Research Engineer, Training & Inference

HarmonicPalo Alto, CA

About The Position

We are developing reinforcement learning systems at a scale where standard abstractions frequently fail. Unlike labs that operate primarily through high-level wrappers, we own the entirety of our RL stack. This ownership spans from low-level environment simulators and custom communication primitives to our distributed training loops and inference engines. We are seeking engineers who view existing libraries as a baseline and the hardware speed itself as the true target. You will be responsible for the architecture powering our agents, with a relentless focus on maximizing the throughput of our reinforcement learning and production workflows.

Requirements

  • BS in Computer Science or a related technical field, or equivalent industry experience
  • 2+ years of relevant, hands-on industry experience
  • Proficiency in Python
  • Experience building or maintaining components within ML frameworks (e.g., PyTorch, JAX, or TensorFlow).
  • Proficiency in either: Understanding of distributed training concepts and collective communication primitives (e.g., NCCL). OR Practical experience deploying and profiling models on GPU-accelerated cloud infrastructure.

Nice To Haves

  • MS or PhD in Computer Science, Mathematics, or a related field.
  • 5+ years of relevant, hands-on industry experience
  • Proficiency in C++
  • Experience writing or improving kernels (Triton, CuTeDSL, TileLang, CUDA, CUTLASS, ThunderKittens) to resolve low-level bottlenecks.
  • Proven success deploying performant inference at scale using open-source or custom inference engines, routers, etc.
  • Direct experience scaling models via FSDP, Tensor Parallelism, or related sharding techniques on multi-node GPU clusters.
  • Experience designing reinforcement learning systems for high-throughput training and asynchronous data sampling.

Responsibilities

  • Maintain and optimize our proprietary RL training and serving infrastructure.
  • Refactor any layer—from the Python API down to the CUDA kernels—to achieve peak performance for foundation model workloads.
  • Maximize the throughput of our reinforcement learning system from data generation to model training with sharded multi-node training and inference algorithms.
  • Optimize our inference stack for high-throughput reinforcement learning and low-latency LLM production traffic.
  • Tune the inference engine, router, and scheduler, down to custom kernels if need be.
  • Identify and resolve performance bottlenecks within our distributed clusters, ensuring optimal throughput and memory efficiency for multi-billion parameter models, balancing memory constraints with compute-heavy training cycles.

Benefits

  • Unlimited PTO
  • 401(k) matching
  • 100% employer-paid health, vision, and dental benefits for employees and 50% coverage for dependents.
  • Varied health coverage options
  • Health Savings Account (HSA) available for qualifying health plans
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service