AI Evaluation Lead

fintentional.ai•US Remote,

3d•$120,000 - $140,000•Remote

About The Position

Hence is building the Financial Answer Machine, an intelligent guide designed to help people navigate a new financial reality. Underpinned by a proprietary financial system, we are turning “average” advice into personalized, multi-modal financial power. We are looking for founding team members to help us build a bridge between the intelligence of AI and the rigid accuracy required for financial freedom. This is a rare opportunity to join at Day Zero and architect a business designed for outsized impact and massive scale.

Requirements

Worked on AI or ML system quality in a context where outputs had real stakes.
Think analytically about what data is and is not telling you.
Comfortable making judgment calls in ambiguous situations rather than waiting for the answer to be obvious.
Enough AI/ML fluency to reason about why a system is producing what it is producing, not just whether the output looks right.
Enough personal finance literacy to read an advice response and have a genuine opinion about whether it is directionally sound.
Fluency with how LLM-based systems behave in production, including output variance, failure modes, and the limits of automated scoring.
Ability to assess whether an eval framework is measuring the right things, not just whether it is running correctly.
Comfortable working with behavioral and interaction data to surface patterns and quality signals.
Familiarity with evaluation and observability tooling.

Nice To Haves

Model evaluation or QA on a consumer-facing AI product, particularly in a regulated or high-stakes context.
Model risk or validation with LLM or generative AI exposure.
Data science or analytics with ownership of production AI system quality.
Operations quality control built around AI- or ML-generated outputs.
Financial services or fintech product roles where you developed both analytical depth and personal finance domain familiarity.

Responsibilities

Define and validate the evaluation set: what cases we should be testing, whether coverage is sufficient across domains, and where the current framework has gaps.
Analyze scoring results to identify highest-frequency case types, patterns in what is performing well versus poorly, and anomalies that warrant closer review.
Assess whether current measures are detecting the right failure modes or whether new measures are needed.
Review flagged cases and make judgment calls on what the results mean and what should be done about them, drawing on both data and domain knowledge.
Own the criteria and calibration for when human review is triggered: defining what rises to that level, what does not, and ensuring the threshold stays well-calibrated as the platform scales.
Partner with subject matter experts on cases that require deeper domain judgment, and incorporate their input into evaluation design.
Ensure evaluation coverage keeps pace with new domain additions and model changes before they ship.
Translate findings into specific, actionable recommendations for the AI/ML team on what needs to change in the system.
Evolve the evaluation framework as the system grows, new domains are added, and user patterns shift.