Engineering Manager, Evaluation Platform

Procore Technologies•Austin, TX

3d•$168,560 - $231,770•Hybrid

About The Position

We’re looking for an Engineering Manager for our Evaluation Platform team to join Procore’s Construction Intelligence organization. In this role, you’ll build the infrastructure and tooling that enables users and internal teams to measure, benchmark, and improve the quality of AI agents — including Search Agent, RFI Create Agent, Invoice Agent, and future agentic products. You will own the end-to-end evaluation lifecycle: from defining quality metrics and building evaluation frameworks, to delivering intuitive interfaces that surface actionable insights about agent performance. This position reports into Sr Director of the Procore AI Engineering team and will be 2 days per week hybrid role in our Austin office. We’re looking for someone to join us immediately.

Requirements

5+ years managing engineering teams or technical leads, with 7+ years total in software engineering.
Experience building evaluation, quality measurement, or observability platforms for LLM-based or agentic systems (RAG pipelines, multi-step agents, tool-use agents).
Strong understanding of evaluation methodologies: precision/recall, LLM-as-judge, human annotation, A/B testing, and statistical significance frameworks.
Proven ability to translate ambiguous problem spaces into clear technical strategies and executable roadmaps.
Hands-on technical depth in backend systems, data pipelines, or distributed infrastructure (Python, Go, or similar)
Familiarity with evaluation frameworks such as RAGAS, DeepEval, LangFuse, or custom eval harnesses.
Background in search relevance (NDCG, MRR) or information retrieval quality systems.

Nice To Haves

Experience with construction-tech, procurement, or enterprise B2B SaaS domains.

Responsibilities

Lead and grow a team of engineers focused on evaluation infrastructure, quality measurement, and developer tooling for AI agents.
Define the technical vision and roadmap for the Evaluation Platform — covering offline evaluations (batch benchmarks, regression suites) and online evaluations (live traffic quality monitoring, A/B testing).
Partner with AI/ML, Product, and Agent teams to define quality metrics for agents (relevance, accuracy, latency, safety, user satisfaction, token usage) and build automated pipelines to compute them at scale.
Design and deliver user-facing evaluation tools that allow customers and internal teams to assess agent output quality, compare model versions, and identify regressions.
Build frameworks for human-in-the-loop evaluation — annotation workflows, rating interfaces, and inter-rater reliability measurement.
Establish CI/CD quality gates so that new agent versions cannot ship without passing evaluation thresholds.
Drive engineering excellence: code quality, system reliability, test coverage, on-call health, and technical debt management.
Recruit, mentor, and develop engineers — fostering a culture of ownership, curiosity, and rigorous experimentation.