AI Engineer, Agents & Evaluation

Guild.aiSan Francisco, CA

About The Position

We are seeking our first AI Engineer specializing in agents and evaluation. This foundational role will be instrumental in shaping how we build, measure, and scale intelligent systems. The opportunity involves designing the playbook for high-performance AI agents, tackling the complex challenge of helping developers understand, evolve, and operate sophisticated systems using autonomous and event-driven AI. In this position, you will develop the evaluation frameworks, task harnesses, and orchestration strategies essential for making our agents reliable, testable, and truly valuable. Your contributions will directly enhance our agents and also generate reusable benchmarks and artifacts that can foster innovation and advance the broader foundation model ecosystem. This role is ideal for individuals who excel at designing experiments, constructing systems, and integrating theory with code in a research-engineering capacity, particularly in a 0-to-1 environment.

Requirements

  • MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience
  • Strong background in machine learning and large language models, with research and hands-on implementation experience
  • 2–5 years working with LLM technology
  • Familiarity with prompting and interaction patterns
  • Familiarity with agent and tool orchestration strategies
  • Familiarity with evaluation strategies for complex, open-ended tasks
  • Proficiency writing production-quality code, especially in Python
  • Comfort working with TypeScript or modern web/backend stacks
  • Experience designing and running experiments, and interpreting results in real-world settings
  • Self-motivated, comfortable operating in an unstructured, high-ambiguity environment
  • Strong communication skills and ability to translate vague goals into concrete, testable setups

Nice To Haves

  • Experience building agentic systems (tool-using agents, workflows, or multi-agent systems) in real products
  • Prior work on model evaluation frameworks, benchmarking, or reliability/robustness testing
  • Familiarity with modern ML tooling (training/inference stacks, experiment tracking, data pipelines)
  • Contributions to open-source LLM, tooling, or evaluation projects
  • Experience at an early-stage startup or research lab where you owned projects end-to-end

Responsibilities

  • Design and implement task-specific evaluations to measure and improve agent quality, driving concrete iteration and sparking innovation.
  • Define tasks, collect and curate balanced datasets, and build robust evaluation harnesses for use across agents and modeling approaches.
  • Develop and utilize frameworks and tools for running evaluations at scale to tune existing agents and guide the development of new ones.
  • Investigate and implement orchestration patterns (tooling, routing, decomposition, multi-agent setups) for agents to handle complex, multi-step tasks.
  • Experiment with post-training techniques (fine-tuning, preference optimization, reward shaping, distillation) for high-performance models.
  • Design, run, and analyze experiments rigorously, translating results into actionable recommendations for model configurations, prompts, and system design.
  • Collaborate with founders, product, and infrastructure engineers to ensure alignment between evaluations, agents, and platform primitives.

Benefits

  • Significant equity in an early-stage, venture-backed startup
  • Comprehensive Health Benefits (Medical, Dental, Vision)
  • Flexible PTO to ensure you have the time you need to recharge
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service