Senior Software Engineer in Test (AI Agentic Systems)

Collective Health•Plano, TX

1d•$99,200 - $136,400•Hybrid

About The Position

This is not a traditional QA role. You will be the quality owner for an LLM-based multi-agent pipeline that autonomously adjudicates health insurance claims for self-funded plan sponsors. You are building a Three-Tier Evaluation Framework to ensure our Gemini-powered agents reason correctly, call tools accurately, and produce DOL-ready outcomes. You will work at the intersection of Vertex AI, healthcare compliance, and high-scale data engineering. Your work directly determines whether claims are paid correctly and whether the company can withstand a Department of Labor (DOL) or state DOI audit. The stakes are real, the domain is hard, and the problems are genuinely novel.

Requirements

Python SDET Expertise: Expert in Python and pytest, specifically building custom mocking frameworks for external APIs (Vertex AI/ADK).
AI/LLM Observability: Hands-on experience with Vertex AI Experiments, Auto-SxS, and Cloud Logging for trace analysis.
Data Literacy: Expert-level SQL (BigQuery) and Pandas skills to "diff" massive datasets and identify adjudication discrepancies.
Prompt Engineering for QA: Ability to analyze "System Instructions" and refine prompts based on failed test cases to close logic gaps.
Architectural Testing: Experience testing multi-layer systems involving RAG (Vertex AI Search), state management (LangGraph), and function calling.

Nice To Haves

Healthcare/Claims Domain: Familiarity with claims adjudication concepts (pend reason codes, COB, eligibility, stop-loss).
Compliance Knowledge: Understanding of HIPAA/PHI handling and writing test evidence for regulatory bodies (DOL/DOI).
Human-in-the-Loop Testing: Experience in "Shadow Mode" monitoring—comparing agent decisions against human expert (MCA) baselines.

Responsibilities

Outcome Evaluation (The "What"): Golden Set Governance: Build and maintain a versioned library of "Grounding Data" results by working with senior claims examiners to define "Ground Truth." Model-as-a-Judge Automation: Design automated "LLM-grading-LLM" workflows using custom rubrics to score factual grounding and policy compliance. Semantic Assertion Framework: Develop testing libraries that move beyond string matching to validate semantic equivalence and numerical accuracy in agent outputs.
Trajectory Evaluation (The "How"): Function-Call Auditing: Use Vertex AI traces to programmatically verify that mandatory tools (via MCP) were invoked with correct arguments. Orchestration Logic Validation: Assert that agents respect defined priorities across the four architectural layers: Data & Knowledge, Orchestration, Agentic Reasoning, and Tooling. Reasoning Trace Auditing: Ensure every autonomous decision is traceable to a specific SOP sentence and a live API data point.
Continuous Automated Regression (The "Always"): CI/CD Integration: Every prompt or model update in Vertex AI Prompt Management must trigger an automated regression run. Auto-SxS: Own the automated pairwise comparison process to detect logic drift between "New" and "Production" agent versions. Mocking & Resilience: Build a Vertex AI/ADK mocking layer to simulate model responses, allowing for thousands of logic tests in seconds with zero API costs.