AI Eval / Testing (Eval Engineer)

NTT DATA ServicesDallas, TX
Onsite

About The Position

We are looking for an AI Evaluation & Test Engineer to ensure generative AI models and applications are safe, accurate, trustworthy, and deliver an elegant user experience. This role validates AI models and agents for accuracy, safety, bias, and performance through structured testing, benchmarking, and continuous evaluation pipelines. The engineer will be responsible for building and maintaining AI evaluation pipelines, implementing traces, spans, and session tracking for observability, defining AI quality metrics and KPIs, implementing evaluation and testing automation, defining and implementing release gates in the CI/CD pipeline, finding creative ways to break products, and assisting in root cause analysis and troubleshooting of bugs and field issues. The role also involves collaborating with cross-functional teammates from product, engineering, linguistics, and customer support to shape human-AI interaction paradigms and ensure desired outcomes and user experiences.

Requirements

  • 5+ years of strong proficiency in Python and testing frameworks like pytest.
  • 5+ years of hands-on experience with evaluation tools like LangSmith, DeepEval, TruLens, or Promptfoo.
  • 3 to 5 years of familiarity with agentic workflows built on LangChain, CrewAI, or LlamaIndex.
  • Understanding of tracing and session tracking to map how errors propagate in RAG systems.
  • 5+ years of strong software testing fundamentals and expertise in writing test plans, executing test cases, and generating detailed reports and dashboards.
  • Strong analytical and debugging skills, and attention to detail.
  • 5+ years of proficiency in Python, scripting, and software testing automation frameworks and tools such as Pytest, Selenium, Robot Framework, etc.
  • Working knowledge of generative AI models, AI agents, and related concepts such as retrieval augmented generation (RAG), prompt engineering, context engineering, explainability, traceability, observability, guard rails, reasoning, specificity, etc.
  • Sound understanding of the fundamental differences in the approach for testing conventional software versus evaluating generative AI systems.
  • Team player with excellent interpersonal skills and the ability to collaborate effectively with remote and cross- functional team members.
  • Go-getter attitude and ability to flourish in a fast-paced, startup environment.

Nice To Haves

  • AI evaluation frameworks such as Arize, Braintrust, DeepEval, LangSmith, Ragas
  • AI safety and red teaming experience, e.g., prompt injection, jailbreak, adversarial and stress testing.
  • Different types of AI evaluation methods, e.g, Human-in-the-loop, LLM-as-a-Judge.

Responsibilities

  • Build and maintain AI evaluation pipelines to test, measure, and evaluate the behavior and performance of AI systems.
  • Implement traces, spans, and session tracking for observability and identify error propagation in multi-step pipelines.
  • Define AI quality metrics and KPIs around factuality, faithfulness, toxicity, grounding precision/recall, latency, cost, etc., with clear acceptance bars.
  • Implement evaluation and testing automation to enable end-to-end system and regression testing at scale.
  • Define criteria for and implement release gates in the CI/CD pipeline.
  • Find creative ways to break products.
  • Assist in root cause analysis and troubleshooting of bugs and field issues.
  • Collaborate with cross-functional teammates from product, engineering, linguistics,, and customer support to shape human-AI interaction paradigms and ensure that our AI models and applications deliver the desired outcome and user experience.
  • Design prompts to manipulate agent behavior, stress-test edge cases, and expose security vulnerabilities (e.g., prompt injection or PII leakage) before deployment.
  • Build and maintain automated regression testing, CI/CD release gates, and testing data sets (golden sets) to measure system drift.
  • Implement "LLM-as-a-judge" frameworks, rule-based checks, and human-in-the-loop scoring rubrics to objectively evaluate open-ended AI outputs.
  • Trace multi-turn conversations and agent tool interactions to diagnose when and why the AI chose the wrong path.
  • Establish and monitor AI KPIs such as factual accuracy, latency, cost, and grounding precision.

Benefits

  • Compliance with Client’s responsible AI principles and Acceptable Use policy
  • Adherence to data residency, privacy (GDPR, HIPAA where applicable), and 21 CFR Part 11 controls where in scope
  • Third-party risk assessment and SOC 2 Type II (or equivalent) certification
  • Disclosure of subcontractors and offshore delivery locations
  • Disclosure of model providers, training data practices, and any use of client data for model improvement (opt-out required)
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service