About The Position

abra R&D is looking for a Reliability Engineer who will take part in building the next-generation agentic analytics platform, the first real-time database optimized for AI agents at scale. We’re looking for a Senior AI Evaluation & Reliability Engineer to define and build how AI agents are measured, validated, monitored, and improved in production. This role sits at the intersection of LLM systems, evaluation research, and production-grade engineering. You will design evaluation methodologies, build LLM-as-a-judge systems, and develop agent-based testing frameworks to ensure correctness, robustness, and reliability of complex multi-agent workflows operating on real-time data.

Requirements

  • 4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering
  • Strong programming skills in Python
  • Hands-on experience working with LLMs in production environments
  • Experience building evaluation systems, automation frameworks, or testing infrastructure
  • Strong understanding of prompt engineering, tool use, and agent behavior
  • Ability to think in terms of metrics, correctness, and system reliability

Nice To Haves

  • Experience with LLM evaluation frameworks (Opik, LangSmith, etc.)
  • Experience with Google ADK / agent frameworks
  • Experience implementing LLM-as-a-judge or ranking systems
  • Background in data systems, analytics, or real-time pipelines
  • Experience with multi-agent systems
  • Familiarity with statistical evaluation methods or experimentation (A/B testing, scoring systems)

Responsibilities

  • Design and implement evaluation frameworks for AI agents and multi-agent systems
  • Build LLM-as-a-judge pipelines to assess correctness, reasoning quality, and output quality
  • Develop agent-based evaluation systems (agents evaluating agents) for scalable testing
  • Define metrics, benchmarks, scorecards, and methodologies for agent reliability and performance
  • Build data-driven evaluation pipelines using synthetic and real-world datasets
  • Identify and analyze failure modes, edge cases, and non-deterministic behaviors
  • Improve agent robustness, consistency, and reliability in production environments
  • Work with tools such as Google ADK, Opik, and related evaluation frameworks
  • Collaborate closely with AI, platform, and database teams to shape agent–data interaction quality
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service