About The Position

Centific AI Research seeks a PhD Research Intern to design and evaluate reinforcement learning (RL) systems for agentic AI workflows. You will develop RL environments, reward models, and post-training pipelines for LLM-based agents, translating research into practical enterprise solutions. The scope of work includes end-to-end RL pipelines for agentic systems (simulation → training → evaluation), alignment of LLM-based agents using RLHF, DPO, PPO, and emerging methods, design of reward functions, verifiers, and evaluation frameworks, simulation environments (digital twins) for enterprise workflows, and scalable training and inference for RL-based systems. Example projects include building a custom RL environment simulating a real-world enterprise workflow and training an agent using PPO or GRPO, developing a reward modeling pipeline from human feedback and evaluating alignment improvements, creating an evaluation harness measuring reasoning, task success, and policy safety, and prototyping an agentic system with tool use and multi-step reasoning, integrated with RL training, and documenting experiments, ablations, and findings for research and productionization.

Requirements

  • PhD candidate in CS, ML, or related field with research in reinforcement learning or agentic AI.
  • Strong Python and PyTorch skills with GPU-based training experience.
  • Solid understanding of RL fundamentals (MDPs, policy gradients, value methods).
  • Experience with LLMs and post-training techniques (RLHF, DPO, PPO, etc.).
  • Strong experimentation practices (ablation, reproducibility, clear reporting).

Nice To Haves

  • Experience with RL environments (Gymnasium, RLlib, Stable Baselines).
  • Research in offline RL, model-based RL, or hierarchical RL.
  • Publications at top ML conferences (NeurIPS, ICML, ICLR, ACL).
  • Experience with simulation, synthetic data, or multi-agent systems.
  • Distributed training and large-scale experimentation.

Responsibilities

  • Design and evaluate reinforcement learning (RL) systems for agentic AI workflows.
  • Develop RL environments, reward models, and post-training pipelines for LLM-based agents.
  • Translate research into practical enterprise solutions.
  • Build end-to-end RL pipelines for agentic systems (simulation → training → evaluation).
  • Align LLM-based agents using RLHF, DPO, PPO, and emerging methods.
  • Design reward functions, verifiers, and evaluation frameworks.
  • Create simulation environments (digital twins) for enterprise workflows.
  • Implement scalable training and inference for RL-based systems.
  • Build custom RL environments simulating real-world enterprise workflows and train agents using PPO or GRPO.
  • Develop reward modeling pipelines from human feedback and evaluate alignment improvements.
  • Create evaluation harnesses measuring reasoning, task success, and policy safety.
  • Prototype agentic systems with tool use and multi-step reasoning, integrated with RL training.
  • Document experiments, ablations, and findings for research and productionization.

Benefits

  • Competitive stipend
  • Real-world impactful projects
  • Mentorship from researchers and engineers
  • Access to modern GPU infrastructure
  • Opportunities to publish and present research
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service