AI evaluations are at a critical inflection point. Static benchmarks are saturated; benchmarks like MMLU, HumanEval, and SWE-Bench have reached their limits as models become increasingly familiar with public test data, and capable of autonomously finding answer keys online. The gap between benchmark scores and real developer experience is growing, making it hard to understand which problems are truly ‘solved’, and which are worth deeper investment. GitHub is uniquely positioned to lead the industry through this transition. We have direct feedback and deep insight into real production workflows from millions of developers, and the scale to build evaluation systems that truly reflect developer success. We’re looking for a Principal Technical Program Manager to help us build the future of AI evaluation. The Applied Science team for GitHub Copilot sits at the intersection of frontier AI research and the world's largest developer platform. We ship AI-powered experiences (ex: code completion, code review, coding agents) used by millions of professional developers every day. As a member of the team, you will help lead GitHub Copilot's AI evaluation strategy end-to-end — from benchmark design and lifecycle governance, through evaluation infrastructure and internal adoption, to community engagement and public transparency. You are the person who ensures that every model swap, product harness, and feature launch is measured against what actually matters to developers — and that the world can see the results.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Career Level
Principal