Drive the entire alignment stack, spanning instruction tuning, RLHF, and RLAIF, to push the model toward high factual accuracy and robust instruction following. Lead research efforts to design next-generation reward models and optimization objectives that significantly improve human preference (HP) performance. Curate high-quality training data and design synthetic data pipelines that solve complex reasoning and behavioral gaps. Optimize large-scale RL pipelines for stability and efficiency, ensuring rapid iteration cycles for model improvements. Collaborate closely with pre-training and evaluation teams to create tight feedback loops that translate alignment research into generalizable model gains.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
51-100 employees