RAG and Evaluation Engineer

LTS

3d•Remote

About The Position

LTS is seeking a RAG & Evaluation Engineer to join a small, senior engineering team applying frontier AI to one of the most consequential legacy systems still running in production today. The mission is to build agents that read, translate, and modernize a decades-old codebase that millions of people quietly depend on. The work has executive backing, real users, and a customer who knows exactly what they’re buying. The team is small by design, with every seat carrying unusual leverage. They hire people who are already deep in this work and use AI tooling natively — agents in parallel, model as collaborator, no exceptions.

Requirements

Bachelor’s degree in Computer Science, Engineering, Information Science, or a related field, plus 4 years of professional software engineering experience; equivalent experience may substitute for the degree requirement.
Has shipped a production RAG system with quality the candidate can describe in numbers (rigor matters more than scale).
Ability to work in a fast-paced, collaborative environment.
Production experience with retrieval pipelines — ingestion, chunking, embedding, hybrid retrieval, reranking.
Strong applied evaluation skills — benchmark design, regression detection, LLM-as-judge patterns.
Knows when BM25 beats embeddings and when neither is enough.
Measures everything they ship; opinions about chunking are backed by benchmarks.
Patient with detail; comfortable defining metrics before the team has agreed on them.
Heavy native use of AI tooling: agents in parallel, model as collaborator.
Strong TypeScript or Python.
Demonstrated experience in a remote work environment.

Nice To Haves

Code-as-corpus retrieval (search over source code rather than prose).
Applied IR or search-engine background.
Synthetic data generation and LLM-as-judge patterns.
Open-source contributions to retrieval, eval, or RAG tooling.
Experience integrating retrieval feedback loops with production usage.
Healthcare IT or legacy modernization domain experience.
Public technical writing or conference talks on retrieval or evaluation.

Responsibilities

Own the knowledge surface — ingestion pipelines for source code, structured metadata, technical documentation, patches, and additional corpora the customer provides.
Own retrieval quality — chunking, embeddings, hybrid retrieval, reranking, and freshness.
Own the eval harness — benchmarks for translation accuracy, dependency-map correctness, and overall agent quality.
Run A/B testing and regression detection across prompts, retrieval, and model changes.
Operate the feedback loop from production usage back into evals and retrieval.
Define what “good” means for the platform when no one else has a clear view, so the team can tell whether the agent is actually improving.
Pair with the Agent Engineers on the prompt-and-eval iteration cycle.