Software Quality Engineer (AI/ML Applicaitons)

Vizient•Chicago, IL

About The Position

In this role, you will validate AI and ML-powered healthcare solutions across the full development lifecycle to ensure data quality, model performance, reliability, and safe deployment in production environments. You will design and execute data-driven and automated test strategies, including model evaluation, prompt regression testing, dataset profiling, and end-to-end pipeline validation. You will partner with data science, engineering, product, and security teams to define measurable quality gates and deliver compliant, explainable, and dependable AI experiences that drive client value.

Requirements

Relevant degree preferred.
2 or more years of relevant experience required.
Experience validating ML or generative AI-based applications, including model evaluation and data quality assessment required.
Proficiency in Python, SQL, and test automation frameworks.
Experience evaluating LLM systems, including prompt regression testing and automated or human-in-the-loop judging methodologies.
Familiarity with RAG evaluation concepts, including retrieval quality, context relevance, faithfulness, and safety testing.
Experience designing AI evaluation metrics, including ranking, calibration, and reliability measures.
Experience building model monitoring dashboards and production health reporting.
Understanding of Agile methodologies and CI/CD practices.
Strong analytical, documentation, and communication skills.
Self-starter who thrives in fast-paced, iterative environment and drives quality initiatives end-to-end amid ambiguity and shifting priorities.

Responsibilities

Develop and execute test strategies for ML and generative AI-powered applications.
Design and maintain evaluation frameworks for Large Language Models (LLM), including automated scoring and LLM -as-a-judge methodologies.
Develop prompt regression test suites to detect performance degradation across model and prompt versions.
Evaluate generative AI systems for hallucination risk, factual consistency, grounding accuracy, and safety compliance.
Conduct model evaluation, regression testing, and drift monitoring in development and production environments.
Build dashboards and monitoring tools to detect degraded evaluation scores, drift, or safety risks and support proactive triage.
Design and implement proactive AI-driven alerting and recommendation systems embedded within dashboards and user workflows
Automate dashboard metric generation and refresh pipelines using Python and data workflows.
Partner with cross-functional teams to define AI quality standards, acceptance criteria, and release gates.
Investigate defects, analyze root causes, and recommend corrective actions to improve reliability and performance.