AI Engineer, Agents & Evaluation

Guild.ai•San Francisco, CA

About The Position

We are seeking our first AI Engineer specializing in agents and evaluation. This foundational role will be instrumental in shaping how we build, measure, and scale intelligent systems. The opportunity involves designing the playbook for high-performance AI agents, tackling the complex challenge of helping developers understand, evolve, and operate sophisticated systems using autonomous and event-driven AI. In this position, you will develop the evaluation frameworks, task harnesses, and orchestration strategies essential for making our agents reliable, testable, and truly valuable. Your contributions will directly enhance our agents and also generate reusable benchmarks and artifacts that can foster innovation and advance the broader foundation model ecosystem. This role is ideal for individuals who excel at designing experiments, constructing systems, and integrating theory with code in a research-engineering capacity, particularly in a 0-to-1 environment.

Requirements

MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience
Strong background in machine learning and large language models, with research and hands-on implementation experience
2–5 years working with LLM technology
Familiarity with prompting and interaction patterns
Familiarity with agent and tool orchestration strategies
Familiarity with evaluation strategies for complex, open-ended tasks
Proficiency writing production-quality code, especially in Python
Comfort working with TypeScript or modern web/backend stacks
Experience designing and running experiments, and interpreting results in real-world settings
Self-motivated, comfortable operating in an unstructured, high-ambiguity environment
Strong communication skills and ability to translate vague goals into concrete, testable setups

Nice To Haves

Experience building agentic systems (tool-using agents, workflows, or multi-agent systems) in real products
Prior work on model evaluation frameworks, benchmarking, or reliability/robustness testing
Familiarity with modern ML tooling (training/inference stacks, experiment tracking, data pipelines)
Contributions to open-source LLM, tooling, or evaluation projects
Experience at an early-stage startup or research lab where you owned projects end-to-end

Responsibilities

Design and implement task-specific evaluations to measure and improve agent quality, driving concrete iteration and sparking innovation.
Define tasks, collect and curate balanced datasets, and build robust evaluation harnesses for use across agents and modeling approaches.
Develop and utilize frameworks and tools for running evaluations at scale to tune existing agents and guide the development of new ones.
Investigate and implement orchestration patterns (tooling, routing, decomposition, multi-agent setups) for agents to handle complex, multi-step tasks.
Experiment with post-training techniques (fine-tuning, preference optimization, reward shaping, distillation) for high-performance models.
Design, run, and analyze experiments rigorously, translating results into actionable recommendations for model configurations, prompts, and system design.
Collaborate with founders, product, and infrastructure engineers to ensure alignment between evaluations, agents, and platform primitives.