Scientific AI Evaluation & Computational Problem Designer

Weekday AI

1d•Remote

About The Position

This role focuses on designing rigorous, research-grade computational problems that assess how effectively AI systems can leverage real scientific software tools to solve complex challenges. Unlike traditional annotation roles, this position requires creating original, graduate-level problems rooted in real-world scientific workflows. You will iteratively refine these problems through calibration against state-of-the-art AI models, ensuring the right balance of difficulty, depth, and reasoning complexity.

Requirements

Graduate-level expertise (MS or PhD preferred) in a relevant STEM field
Hands-on experience using scientific software libraries for real research problems
Strong Python programming skills, including building computational workflows and validators
Ability to design challenging problems that require deep reasoning rather than surface-level solutions
Familiarity with edge cases, limitations, and practical challenges of scientific tools
Demonstrated proficiency with at least one relevant scientific library (via research, open-source work, or industry experience)
Ability to work independently and iterate based on feedback
Comfort working in Linux/terminal environments and remote compute setups
Availability of at least 15–20 hours per week

Nice To Haves

Experience across multiple domains or tools
Background in evaluation frameworks or benchmarking
Experience in teaching, pedagogy, or problem-set design
Familiarity with reproducible research practices and containerized environments

Responsibilities

Design advanced computational problems requiring the use of domain-specific scientific software
Create tasks that test both precise execution (multi-step workflows, simulations) and strategic reasoning (experiment design, inference from partial data)
Develop problem setups, solution pathways, and validation mechanisms
Calibrate and refine tasks based on model performance to achieve target difficulty levels
Ensure problems emphasize reasoning strategy over brute-force computation