AI Quality Engineer

Rootly•Office,

About The Position

Rootly is building the AI-native future of incident management, and we need someone who can push our AI to its limits before our customers do. As our AI Quality Engineer, you'll own the evaluation and optimization of Rootly's agentic AI features -- designing test scenarios, running adversarial prompts, interpreting outputs, and working directly with engineering and product to close the loop on performance. This isn't traditional QA. You'll spend your days thinking like an attacker, a confused user, and a power user all at once -- probing how our AI agents reason, make decisions, and handle edge cases across complex incident workflows.

Requirements

+5 years in QA, product operations, AI/ML evaluation, or a closely related role
Hands-on experience testing or evaluating LLM-powered or agentic AI products
Strong prompt engineering instincts -- you understand how wording, context, and structure affect model behaviour
Comfortable writing scripts or working with evaluation tools (Python a plus; not required to be a full-stack engineer)
Sharp analytical thinking; you can spot a subtle reasoning failure and articulate exactly why it's a problem
Clear written communicator; able to translate AI behaviour findings for both technical and non-technical audiences

Nice To Haves

Familiarity with incident management, DevOps, or IT operations workflows is a strong asset
Experience with evaluation frameworks (e.g. LangSmith, PromptFlow, Braintrust, or similar)
Exposure to red-teaming or adversarial testing of AI systems
Comfortable writing E2E tests with Playwright
Background working at a B2B SaaS or developer-tools company
Familiar with mobile app testing (iOS/Android)

Responsibilities

Design and execute prompt-based test scenarios that cover happy paths, edge cases, and adversarial inputs across Rootly's agentic AI features
Evaluate AI outputs for accuracy, relevance, consistency, and alignment with expected workflow behaviour
Build and maintain an evaluation framework; structured test libraries, scoring rubrics, and regression suites to track AI performance over time
Identify failure modes, hallucinations, reasoning gaps, and unexpected agent behaviours; document findings and work with engineers to resolve them
Partner with Product and Engineering on new AI feature releases, contributing to acceptance criteria and quality gates before launch
Define and track quality metrics (accuracy rates, failure frequency, regression trends) and report findings to stakeholders
Stay current on LLM evaluation techniques, prompt engineering best practices, and agentic testing methodologies

Benefits

Competitive compensation and early equity in a fast-growing, venture-backed company.
Comprehensive medical, dental, and vision coverage.
3 weeks of vacation, plus unlimited sick and mental health days, and a company-wide end-of-year shutdown to recharge.
$500 stipend for home office setup.
Unlimited token usage and access to AI tools

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume