Product Manager, Public Sector GenAI Test & Evaluation (T&E)

Scale AI•St. Louis, MO

66d•Remote

About The Position

At Scale, our mission is to develop reliable AI systems for the world’s most important decisions. The Public Sector team is at the forefront of this mission, partnering with government agencies to deploy mission-critical agentic solutions. The Public Sector GenAI T&E Product Manager will be a high-horsepower technical leader, defining the vision and owning the roadmap for our evaluation capabilities. This role requires thriving in unscripted, high-stakes environments, as you will be the primary owner for the T&E tech stack—the robust infrastructure required to continuously measure, improve, and prove the superiority and sustained performance of our agentic applications. Traversing multiple engineering organizations across Scale, you will identify bottlenecks, distill technical friction into actionable plans, and drive execution. You will work across Scale’s commercial and public sector teams to define requirements, ensuring our evaluation services are robust enough for the most demanding government use cases. Key objectives include refining the tech stack that allows ML teams to hillclimb, and surfacing critical performance information to stakeholders.

Requirements

3+ years of experience in software engineering, systems architecture, or highly technical program management.
Ability to read code, understand system architecture, and participate in technical design reviews alongside engineering teams.
Proven experience designing, owning the roadmap for, or operating the infrastructure required to continuously measure, improve, and show the performance of AI applications.
Demonstrated experience taking a vaguely defined problem (e.g., "our evaluation cycles are too slow") and delivering a technical roadmap, resource requirements, and measurable success metrics within a narrow time window.
Proven track record of taking a project from "stalled/undefined" to "shipped" in a high-pressure environment.
Ability to point to at least two instances where you inherited a failing project and saw it through to production.
Led multiple projects that required direct alignment between at least three distinct engineering organizations (e.g., Infrastructure, ML Research, and Product).
Experience using technical project management frameworks (e.g., Linear) to provide consistent weekly reporting on delivery velocity and blockers to executive stakeholders.

Nice To Haves

Active Secret, Top Secret, or TS/SCI clearance.
Practical experience developing or evaluating features built specifically on LLMs, RAG, or autonomous agent workflows.
Advanced degree in Computer Science, Engineering, or a related field.
2+ years of experience working with DoD, IC, or Civil agencies on mission-critical software deployments.

Responsibilities

Defining the vision and owning the roadmap for our evaluation capabilities.
Being the primary owner for the T&E tech stack—the robust infrastructure required to continuously measure, improve, and prove the superiority and sustained performance of our agentic applications.
Identifying bottlenecks, distilling technical friction into actionable plans, and driving execution across multiple engineering organizations.
Defining requirements to ensure evaluation services are robust enough for the most demanding government use cases.
Refining the tech stack that allows ML teams to hillclimb.
Surfacing critical performance information to stakeholders.
Using technical project management frameworks (e.g., Linear) to provide consistent weekly reporting on delivery velocity and blockers to executive stakeholders.