We're building the engine that judges how good our agents actually are. Claims have to be data-driven: you can't build on what you can't see, so how can you honestly say one version is 10% better than the last? Evaluation runs both before we ship and after; this role owns the runtime side — judging agents live in production, from the traces they generate serving real traffic. The hard part is the data. Agent behaviour generates verbose traces with high cardinality, and we need a system that can analyze them real-time, providing actionable insights in low latency. Join us to build it: the engineering looks a lot like site reliability engineering meeting user analytics, combining high-throughput low latency data with evaluating user behaviour and outcomes.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed