Member of Technical Staff (QA Engineer - Agentic Systems)

Solstice•New York City, NY

2d•$160,000 - $300,000•Onsite

About The Position

Solstice is redefining how life sciences organizations commercialize their therapeutics by building a commercial engine that allows pharmaceutical marketers to launch campaigns at 100x the speed. We are looking for our first dedicated QA lead to own quality for the AI that powers Solstice. Our platform generates regulated pharmaceutical marketing content, so ensuring accuracy and compliance is critical. This role involves building evaluation systems for probabilistic AI outputs, ensuring model and prompt changes are safe, testing agent failures, protecting compliance-critical paths, owning end-to-end testing, performing manual and exploratory QA, implementing CI/CD quality gates, utilizing production as a test bed, hardening background jobs, and setting the overall testing standard for the company. This is a senior, hands-on engineering job requiring both coding and manual testing skills across the entire product stack.

Requirements

Strong Python and experience building test infrastructure that runs automatically in CI/CD.
Strong end-to-end and UI test automation, especially with Playwright.
Genuine manual and exploratory QA discipline, including owning release sign-off.
Experience testing non-deterministic, ML, or LLM-based systems, or the appetite to build this capability.
Comfort with evaluation methods: golden datasets, LLM-as-judge (rubric, pairwise, reference-based), and calibrating judges.
A statistical way of thinking about quality (variance, pass@k, regression detection).
Instinct for error analysis: reading traces, grouping failures, and turning them into permanent tests.
Independent thinking and a strong sense of ownership.
Clear communication skills, able to explain technical risk to non-technical people.
A serious work ethic.

Nice To Haves

Hands-on experience with agent or LLM frameworks (LangChain, LangGraph).
Experience with eval and LLM-observability tools (LangSmith, Langfuse, Arize Phoenix, Braintrust, RAGAS, Promptfoo, OpenAI Evals, or similar).
Comfort in a modern frontend stack (TypeScript and React).
Adversarial or red-team testing experience.
Backend experience with async Python services and task queues.
Experience in a regulated industry (pharma, healthcare, finance).
MLOps or LLMOps experience, including defining quality SLOs.

Responsibilities

Build our evaluation systems to score quality and determine what is good enough to ship.
Make models and prompt changes safe by flagging drops in quality, cost, or latency.
Test the agents for common failure modes like drifting off goals, looping, picking the wrong tool, or being hijacked.
Protect the compliance-critical paths by testing for unsupported claims or missing disclosures against approved source material.
Own end-to-end testing across the app, building and maintaining a Playwright suite for user flows.
Run hands-on manual and exploratory QA to find edge cases and be the last set of eyes before shipping.
Get CI/CD quality gates in place, including automated tests and linting.
Use production signals to create drift detection and new regression tests.
Harden background jobs to ensure they survive retries, timeouts, and worker crashes without data loss or duplication.
Set the testing bar for the company and help the team write trustworthy code.