AI Evaluation Lead

fintentional.aiUS Remote,
$120,000 - $140,000Remote

About The Position

Hence is building the Financial Answer Machine, an intelligent guide designed to help people navigate a new financial reality. Underpinned by a proprietary financial system, we are turning “average” advice into personalized, multi-modal financial power. We are looking for founding team members to help us build a bridge between the intelligence of AI and the rigid accuracy required for financial freedom. This is a rare opportunity to join at Day Zero and architect a business designed for outsized impact and massive scale.

Requirements

  • Worked on AI or ML system quality in a context where outputs had real stakes.
  • Think analytically about what data is and is not telling you.
  • Comfortable making judgment calls in ambiguous situations rather than waiting for the answer to be obvious.
  • Enough AI/ML fluency to reason about why a system is producing what it is producing, not just whether the output looks right.
  • Enough personal finance literacy to read an advice response and have a genuine opinion about whether it is directionally sound.
  • Fluency with how LLM-based systems behave in production, including output variance, failure modes, and the limits of automated scoring.
  • Ability to assess whether an eval framework is measuring the right things, not just whether it is running correctly.
  • Comfortable working with behavioral and interaction data to surface patterns and quality signals.
  • Familiarity with evaluation and observability tooling.

Nice To Haves

  • Model evaluation or QA on a consumer-facing AI product, particularly in a regulated or high-stakes context.
  • Model risk or validation with LLM or generative AI exposure.
  • Data science or analytics with ownership of production AI system quality.
  • Operations quality control built around AI- or ML-generated outputs.
  • Financial services or fintech product roles where you developed both analytical depth and personal finance domain familiarity.

Responsibilities

  • Define and validate the evaluation set: what cases we should be testing, whether coverage is sufficient across domains, and where the current framework has gaps.
  • Analyze scoring results to identify highest-frequency case types, patterns in what is performing well versus poorly, and anomalies that warrant closer review.
  • Assess whether current measures are detecting the right failure modes or whether new measures are needed.
  • Review flagged cases and make judgment calls on what the results mean and what should be done about them, drawing on both data and domain knowledge.
  • Own the criteria and calibration for when human review is triggered: defining what rises to that level, what does not, and ensuring the threshold stays well-calibrated as the platform scales.
  • Partner with subject matter experts on cases that require deeper domain judgment, and incorporate their input into evaluation design.
  • Ensure evaluation coverage keeps pace with new domain additions and model changes before they ship.
  • Translate findings into specific, actionable recommendations for the AI/ML team on what needs to change in the system.
  • Evolve the evaluation framework as the system grows, new domains are added, and user patterns shift.

Benefits

  • Early-stage option equity
  • Periodic in-person get-togethers
  • Clear writing
  • High ownership
  • Fast iteration
  • Direct communication
  • Thoughtful async collaboration
  • Broad ownership
  • Frequent context shifts
  • High degree of autonomy
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service