AI Evaluation Engineer

Distyl AI•San Francisco, CA

1d•Hybrid

About The Position

Distyl is an applied AI technology company that partners with ambitious institutions to rearchitect critical operations for the frontier of AI. They research and deploy technologies that power AI-native operations, spanning research into self-constructing systems, the development of reliable AI system execution, and products that transform mission-critical workflows. Distyl's technologies impact large-scale operations, including consumer interactions, supply chain transactions, and patient journeys. The company is backed by prominent investors and has a 100% production deployment success rate for its customers, operating as a profitable enterprise AI company. At Distyl, AI systems are built using Evaluation-Driven Development, where evaluation is the primary mechanism for iterating, improving, and trusting AI behavior in production. AI Evaluation Engineers are responsible for designing and implementing these evaluation systems. They are hands-on engineers who write production Python code, build evaluation pipelines, and use structured signals to guide system design, prompt iteration, and deployment decisions for customer-facing AI systems. This role is ideal for engineers who believe that AI systems improve when measurement is tightly coupled to development and want to apply this philosophy to impactful systems.

Requirements

2+ years of software engineering experience
Strong Python Engineering Skills: Write clean, maintainable Python and are comfortable building evaluation and experimentation pipelines that run in production environments. You treat evaluation code with the same rigor as application code
Experience with Evaluation-Driven or Experiment-Driven Development: Experience using structured evaluation or experimentation frameworks to drive system iteration, and understand the pitfalls of overfitting to metrics that don’t reflect real outcomes
Ability to Translate Human Judgment into Code: Work with subject matter experts to elicit high-quality judgments and encode them into test cases, scoring functions, and graders that scale
Systems-Oriented Mindset: Understand how evaluation interacts with prompts, agents, data, and deployment. You design evaluation systems that support fast iteration while maintaining trust and safety in production
AI-Native Working Style: Use AI tools to generate tests, analyze failures, explore edge cases, and accelerate debugging and iteration

Responsibilities

Design and implement evaluation frameworks that enable Evaluation-Driven Development for AI systems deployed in customer environments
Define how system quality is measured in each domain, ensuring that evaluation signals reflect real user needs, domain constraints, and business objectives
Build and maintain golden test cases and regression suites in Python, using both human-authored and AI-assisted test generation to capture critical behaviors and edge cases. These test suites are treated as first-class system components that evolve alongside the AI system itself
Develop and maintain evaluation pipelines—offline and online—that integrate directly into system iteration loops. Evaluation results inform prompt design, agent logic, model selection, and release readiness, ensuring that system changes are driven by measurable improvements rather than intuition alone
Define, calibrate, and operate LLM-based graders, aligning automated judgments with expert human assessments. They investigate where evaluation signals diverge from real-world outcomes and refine grading approaches to maintain signal quality as systems and domains evolve
Work closely with Forward Deployed AI Engineers, Architects, Product Engineers, AI Strategists, and domain experts to ensure evaluation frameworks meaningfully guide system development and deployment in production

Benefits

Meaningful equity
Comprehensive benefits package
100% covered medical, dental, and vision for employees and dependents
401(k) with additional perks (e.g., commuter benefits, in‑office lunch)
Access to state‑of‑the‑art models, generous usage of modern AI tools, and real‑world business problems
Ownership of high‑impact projects across top enterprises
A mission‑driven, fast‑moving culture that prizes curiosity, pragmatism, and excellence

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume