Lead QA Engineer (AI & Agentic Systems)

Solstice•New York, NY

3d•$160,000 - $300,000•Onsite

About The Position

Solstice is redefining how life sciences organizations commercialize their therapeutics by building a commercial engine that allows pharmaceutical marketers to launch campaigns at 100x the speed. We are hiring our first dedicated QA lead to own quality for the AI that powers Solstice. Our platform generates regulated pharmaceutical marketing content for the brands we work with, so when the output is wrong, say an unsupported claim or a missing safety disclosure, it becomes a real compliance problem and not just a bug to file. This role is about more than the models; it's also about ordinary product reliability, ensuring the app functions as expected and that frontend tweaks or new features don't cause production issues. You'll own end-to-end testing with Playwright and perform hands-on manual testing to catch what automation misses. This is a senior, hands-on engineering job where most of your time will be spent writing code and building test infrastructure, but you'll also engage in manual testing when it's the most efficient way to identify a problem. Your work will span the entire product, from backend services to the customer-facing frontend.

Requirements

Strong Python skills and experience building test infrastructure that runs automatically in CI/CD.
Strong end-to-end and UI test automation experience, particularly with Playwright.
A genuine manual and exploratory QA discipline, including the ability to test features by hand, find edge cases, and own release sign-off.
Experience testing non-deterministic, ML, or LLM-based systems, or the willingness to build this capability from scratch.
Comfort with evaluation methods such as golden datasets, LLM-as-judge (rubric, pairwise, reference-based), and calibrating judges against human or expert labels.
A statistical approach to quality, considering variance, pass@k, and regression detection, rather than simple pass/fail.
An instinct for error analysis, including reading traces, grouping failures, and converting significant issues into permanent tests.
Independent thinking and a strong sense of ownership, with the ability to make decisions and build a quality function from the ground up.
Clear communication skills, able to explain technical risk to non-technical stakeholders and provide/receive direct feedback.
A serious work ethic.

Nice To Haves

Hands-on experience with agent or LLM frameworks like LangChain, LangGraph, and understanding how agentic systems fail.
Experience with eval and LLM-observability tools such as LangSmith, Langfuse, Arize Phoenix, Braintrust, RAGAS, Promptfoo, OpenAI Evals, or similar.
Comfort in a modern frontend stack (TypeScript and React) to write meaningful UI tests and quickly reproduce bugs.
Adversarial or red-team testing experience, including identifying outputs that are technically correct but cross safety or regulatory lines.
Backend experience with async Python services and task queues.
Experience in a regulated industry such as pharma, healthcare, or finance.
MLOps or LLMOps experience, including defining quality SLOs.

Responsibilities

Build our evaluation systems to score quality and determine what is good enough to ship, given the probabilistic nature of AI outputs.
Develop tooling to ensure model and prompt changes are safe, flagging drops in quality, cost increases, or latency regressions.
Test agents for common failure modes such as drifting off goals, looping on tool calls, picking the wrong tool, or being hijacked by malicious instructions.
Own the testing of compliance-critical paths, including verifying claims against approved source material.
Build and maintain a Playwright suite for end-to-end testing of user flows, from login to content creation and review.
Perform hands-on manual and exploratory QA to find edge cases and ensure release quality.
Implement CI/CD quality gates, including automated tests, linting, and type checks.
Utilize production tracing and monitoring to detect drift and create new regression tests.
Harden background jobs to ensure they can survive retries, timeouts, and worker crashes without data loss or duplication.
Set the standard for testing practices within the team and help foster a culture of trust in the code.