Machine Learning Engineer, LLM Evals & Observability

Glean•Mountain View, CA

1d•Hybrid

About The Position

Glean is seeking a Machine Learning Engineer focused on LLM Evals & Observability to join their team. This role is crucial for measuring and improving the quality of Glean's AI Assistant and Agents. The team is responsible for evaluation pipelines, quality eval-sets, LLM-powered judges, agent observability, and the tooling engineers use to understand changes and their impact. It's a unique blend of infrastructure engineering, applied ML, and direct product impact, aimed at making AI quality measurable and driving improvements.

Requirements

2+ years of software engineering experience with strong coding skills.
Strong backend fundamentals in Go and Python.
Comfortable with distributed data pipelines.
Experience working with LLM evaluation, reinforcement learning from human feedback, natural language processing, or other large systems involving machine learning.
Analytically rigorous – ability to think carefully about what offline metrics predict about real user experience.
Ability to thrive in a customer-focused, tight-knit, and cross-functional environment.
Team player willing to take on whatever is most impactful for the company.
A strong care for quality in both systems built and the product being measured.

Responsibilities

Design and curate evaluation datasets, including sampling strategies, query diversity, and golden sets for reliable coverage of real assistant behavior.
Build and maintain large-scale evaluation pipelines to measure assistant quality across thousands of real user queries.
Develop LLM-powered judges to score metrics like correctness, completeness, and response quality, aligning them with human judgment.
Evaluate new models and product changes before shipping, providing quality signals to gate launches and prevent regressions.
Build observability infrastructure for AI agents, including trace enrichment, data pipelines, and dashboards for inspectable assistant behavior.
Close the loop between quality measurement and improvement using eval results, customer feedback, and techniques like automated prompt iteration.
Collaborate with engineers across the company to integrate evals as a first-class part of the shipping process.