About The Position

abra R&D is seeking an AI Evaluation & Reliability Engineer to contribute to the development of a next-generation agentic analytics platform, which is the first real-time database optimized for AI agents at scale. This Senior role focuses on defining and building the methodologies for measuring, validating, monitoring, and improving AI agents in production environments. The position is at the intersection of LLM systems, evaluation research, and production-grade engineering. The engineer will be responsible for designing evaluation methodologies, constructing LLM-as-a-judge systems, and developing agent-based testing frameworks to ensure the correctness, robustness, and reliability of complex multi-agent workflows operating on real-time data.

Requirements

  • 4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering
  • Strong programming skills in Python
  • Hands-on experience working with LLMs in production environments
  • Experience building evaluation systems, automation frameworks, or testing infrastructure
  • Strong understanding of prompt engineering, tool use, and agent behavior
  • Ability to think in terms of metrics, correctness, and system reliability

Nice To Haves

  • Experience with LLM evaluation frameworks (Opik, LangSmith, etc.)
  • Experience with Google ADK / agent frameworks
  • Experience implementing LLM-as-a-judge or ranking systems
  • Background in data systems, analytics, or real-time pipelines
  • Experience with multi-agent systems
  • Familiarity with statistical evaluation methods or experimentation (A/B testing, scoring systems)

Responsibilities

  • Design and implement evaluation frameworks for AI agents and multi-agent systems
  • Build LLM-as-a-judge pipelines to assess correctness, reasoning quality, and output quality
  • Develop agent-based evaluation systems (agents evaluating agents) for scalable testing
  • Define metrics, benchmarks, scorecards, and methodologies for agent reliability and performance
  • Build data-driven evaluation pipelines using synthetic and real-world datasets
  • Identify and analyze failure modes, edge cases, and non-deterministic behaviors
  • Improve agent robustness, consistency, and reliability in production environments
  • Work with tools such as Google ADK, Opik, and related evaluation frameworks
  • Collaborate closely with AI, platform, and database teams to shape agent–data interaction quality
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service