Senior Member of Technical Staff, AI Quality

Harper•San Francisco, CA

1d•$176,000 - $253,000•Onsite

About The Position

Harper is revolutionizing commercial insurance distribution with AI, aiming for 90%+ AI-led operations. The company is experiencing rapid growth, serving millions of businesses, many of whom are underinsured. Currently, AI engineers rely on subjective assessments ('vibes') to evaluate changes to prompts, tools, or models. This role is critical for establishing objective, data-driven quality metrics to ensure AI systems are improving and to prevent regressions. The goal is to transform agent quality from a subjective measure into a quantifiable number, ensuring reliability and scalability without proportional headcount increases. The AI systems span the entire insurance lifecycle, including operator guidance, risk matching, autonomous communications, and voice AI.

Requirements

3-6 years of software engineering experience.
Production LLM/agent eval experience, including capability and regression suite design, LLM-as-judge graders, and golden datasets.
Familiarity with at least one major eval framework.
Strong written communication skills for documenting eval rubrics and failure-mode taxonomies.
Based in San Francisco or willing to relocate.

Nice To Haves

Open-source contributions to eval frameworks.
Red-team/adversarial-testing experience for LLM systems.
Voice AI eval experience (latency, interruption handling, transcription accuracy).
ML eval/observability background.

Responsibilities

Build capability and regression eval suites for assigned agents (intake, submissions, placements, renewals, CRM, or voice).
Curate golden datasets consisting of real failure modes from customer transcripts, underwriter interactions, and call recordings (20-50 quality cases per agent).
Design graders, starting with deterministic methods (string match, state check, tool-call assertions) and progressing to LLM-as-judge where deterministic methods are insufficient, with human calibration.
Implement pre-merge eval gates in CI to block PRs that do not meet quality thresholds.
Set up production trajectory monitoring using online evaluators to detect drift in live systems within hours.
Convert operational findings and flagged failures into regression tests to ensure repeat issues are permanently addressed.