This role involves designing and building coding benchmarks and evaluation pipelines to test frontier AI models on real software engineering work. The goal is to create benchmarks that evaluate models on tasks like reasoning, debugging, and producing production-quality code. Responsibilities include analyzing model-generated code, constructing evaluation scenarios, and providing technical feedback on model performance. The ultimate aim is to develop benchmarks that effectively differentiate the capabilities of AI models and inform the training and improvement of future generations.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
Associate degree