AI Engineer, Quality (Evals)

FieldguideSan Francisco, CA
Onsite

About The Position

Fieldguide is establishing a new state of trust for global commerce and capital markets through automating and streamlining the work of assurance and audit practitioners specifically within cybersecurity, privacy, and financial audit. They build software for the people who enable trust between businesses. Fieldguide is based in San Francisco, CA, but operates as a remote-first company. They are backed by top investors including Growth Equity at Goldman Sachs Alternatives, Bessemer Venture Partners, 8VC, Floodgate, Y Combinator, DNX Ventures, Global Founders Capital, Justin Kan, Elad Gil, and more. The company values diversity and aims to build an inclusive, driven, humble, and supportive team. As an early-stage startup employee, you will contribute to building the future of business trust, making audit practitioners' lives easier by streamlining their work and improving work-life balance. Fieldguide is building AI agents for complex audit and advisory workflows in a rapidly transforming $100B+ market. Over 50 of the top 100 accounting and consulting firms utilize their services. As an AI Engineer, Quality, you will be responsible for the evaluation infrastructure that ensures the reliability of AI agents at an enterprise scale. This role focuses on establishing evaluations as a core engineering capability, including building a unified platform, automated pipelines, and production feedback loops to quickly evaluate new models across critical workflows. You will work at the intersection of ML engineering, observability, and quality assurance to meet rigorous customer standards. Fieldguide is hiring across all levels, with seniority calibrated during interviews. This specific role emphasizes in-person collaboration at their San Francisco, CA office.

Requirements

  • You are an engineer who believes that evaluations are foundational to building reliable AI systems, not a nice-to-have
  • Evaluation-first mindset: You understand that for an AI company, not being able to evaluate a new model quickly is unacceptable
  • AI-native instincts: You treat LLMs, agents, and automation as fundamental building blocks and parts of the craft of engineering
  • Data-driven rigor: You make decisions based on metrics and are obsessed with measuring what matters
  • Production-oriented: You understand that evaluations must work on real production behavior, not just offline datasets
  • Strong product judgment: You can decide what matters and why, without waiting for guidance, not just how to implement it
  • Bias to building: You move fast and build working systems rather than perfect specifications
  • Multiple years of experience shipping production software in complex, real-world systems
  • Experience with TypeScript, React, Python, and Postgres
  • Built and deployed LLM-powered features serving production traffic
  • Implemented evaluation frameworks for model outputs and agent behaviors
  • Designed observability or tracing infrastructure for AI/ML systems
  • Worked with vector databases, embedding models, and RAG architectures
  • Experience with evaluation platforms (LangSmith, Langfuse, or similar)
  • Comfort operating in ambiguity and taking responsibility for outcomes
  • Deep empathy for professional-grade, mission-critical software

Nice To Haves

  • Experience with audit and accounting workflows

Responsibilities

  • Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows
  • Build observability systems that surface agent behavior, trace execution, and failure modes in production, and feedback loops that turn production failures into first-class evaluation cases
  • Own the evaluation infrastructure stack including integration with LangSmith and LangGraph
  • Translate customer problems into concrete agent behaviors and workflows
  • Integrate and orchestrate LLMs, tools, retrieval systems, and logic into cohesive, reliable agent experiences
  • Build automated pipelines that evaluate new models against all critical workflows within hours of release
  • Design evaluation harnesses for our most complex Agentic systems and workflows
  • Implement comparison frameworks that measure effectiveness, consistency, latency, and cost across model versions
  • Design guardrails and monitoring systems that catch quality regressions before they reach customers
  • Use AI as core leverage in how you design, build, test, and iterate
  • Prototype quickly to resolve uncertainty, then harden systems for enterprise-grade reliability
  • Build evaluations, feedback mechanisms, and guardrails so agents improve over time
  • Work with SMEs and ML Engineers to create evaluation datasets by curating production traces
  • Design prompts, retrieval pipelines, and agent orchestration systems that perform reliably at scale
  • Define and document evaluation standards, best practices, and processes for the engineering organization
  • Advocate for evaluation-driven development and make it easy for the team to write and run evals
  • Partner with product and ML engineers to integrate evaluation requirements into agent development from day one
  • Take full ownership of large product areas rather than executing on narrow tasks

Benefits

  • Competitive compensation packages with meaningful ownership
  • Flexible PTO
  • 401k
  • Wellness benefits, including a bundle of free therapy sessions
  • Technology & Work from Home reimbursement
  • Flexible work schedules
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service