Lead Quality Engineer - AI

Wolters Kluwer•Coppell, TX

14h•Hybrid

About The Position

We are seeking a Lead AI Quality Engineer to ensure the quality, reliability, and trustworthiness of AI-powered product experiences in Wolters Kluwer Tax and Accounting. This role goes beyond validating that buttons click—you will design tests that confirm the system behaves correctly, measuring retrieval accuracy, citation correctness, and overall alignment of responses with user intent. You will be a key contributor in helping us deliver a system customers can trust.

Requirements

Bachelors Degree in Computer Science or equivalent
5+ years of experience in software testing, quality engineering, or equivalent engineering roles with a focus on validation and reliability.
Experience with AI evaluation frameworks (e.g. LlamaIndex evals, OpenAI Evals, Ragas, TruLens, or custom harnesses)
Strong skills in Python testing frameworks (Pytest, unittest, or equivalent)
Experience testing web applications and APIs
Familiarity with AI/ML or non-deterministic system testing
Knowledge of CI/CD pipelines, Git, and automated regression testing
Strong analytical skills: able to define metrics and success criteria where outputs aren’t deterministic
Comfortable working in a fast-paced Agile environment with weekly sprints, pairing, and close collaboration with PM/UX/Dev

Nice To Haves

Knowledge of retrieval-augmented generation (RAG) pipelines
Experience with metrics/observability tooling (Grafana, Prometheus, Datadog)
Familiarity with containerized environments (Docker, Kubernetes)
Exposure to performance/load testing tools (Locust, k6, JMeter)

Responsibilities

Design and implement evaluation harnesses to measure retrieval accuracy, citation correctness, response quality, and overall system behavior
Develop automated tests for APIs, ingestion pipelines, and chat workflows
Collaborate with developers and product managers to define quality metrics (accuracy, latency, cost, hallucination rate)
Analyze logs, traces, and feedback signals to identify root causes of failures in AI-driven responses
Create regression suites to ensure changes to prompts, chunking, or embeddings don’t break existing behavior
Validate REST APIs and service integrations for resilience, correctness, and security
Contribute to observability by instrumenting metrics and dashboards for system performance
Participate in sprint planning and retrospectives, ensuring testability is built into features from day one