AI Evaluation Scientist

steampunk•McLean, VA

15h•$105,000 - $145,000

About The Position

We are looking for an AI Evaluation Scientist to design and execute evaluation processes that ensure our predictive and generative AI systems are accurate, reliable, safe, and aligned with mission requirements. This role is essential for establishing trust in AI solutions and supporting continuous improvement across the AI lifecycle. The AI Evaluation Scientist will work closely with engineers, data scientists, governance analysts, and product teams to develop evaluation metrics, build test harnesses, analyze model behavior, and support responsible deployment. You will contribute to the growth of our AI & Data Exploitation Practice!

Requirements

Ability to hold a position of public trust with the U.S. government.
Bachelor’s degree in Computer Science, Statistics, Machine Learning, Cognitive Science, Human-Computer Interaction, Data Science, or a related field and 5+ years of experience; OR Master’s degree in Computer Science, Statistics, Machine Learning, Cognitive Science, Human-Computer Interaction, Data Science, or a related field and 3+ years of experience.
2+ years of experience evaluating machine learning models, NLP systems, or generative AI models (LLMs preferred).
Familiarity with evaluation metrics, statistical testing, dataset creation, and experimental design for AI systems.
Proficiency in Python and relevant libraries such as PyTorch, Hugging Face, scikit-learn, LangChain
Proficiency in AI evaluation frameworks such as Ragas
Experience analyzing structured and unstructured data, including text, documents, and embeddings.
Understanding of LLM behavior, prompt evaluation, retrieval pipelines, or RAG architectures.
Exposure to responsible AI concepts and governance-aligned evaluation criteria (e.g., fairness, transparency, reliability).
Strong analytical skills with the ability to interpret model weaknesses, extract insights, and recommend actionable improvements.
Excellent written and verbal communication skills, with the ability to present evaluation findings clearly to technical and non-technical stakeholders.

Nice To Haves

Experience working in agile or iterative development environments is a plus.
Familiarity with OWASP LLM Top 10 Risks
Relevant certifications (helpful but not required): NIST AI RMF (AISIC) INFORMS CAP AWS/Azure/Google ML Certifications

Responsibilities

Implement evaluation frameworks for AI models, including accuracy, robustness, relevance, bias, hallucination rate, and safety metrics.
Build and maintain automated evaluation scripts, tests, and pipelines that assess AI model outputs and detect performance drift over time.
Develop benchmark datasets, challenge sets, and scenario-based test cases tailored to mission and user needs.
Perform structured error analysis and behavioral audits of LLMs, retrieval-augmented generation (RAG) systems, and predictive models, documenting findings and improvement recommendations.
Collaborate with AI Developers, LLMOps Engineers, and Data Scientists to support iterative experimentation, model hardening, and quality improvements.
Contribute to the design of human-in-the-loop evaluation workflows, integrating qualitative and quantitative insight into evaluation reports.
Assist in mapping evaluation outcomes to responsible AI principles such as fairness, transparency, reliability, and safety.
Partner with AI Governance Analysts to ensure evaluation outputs support compliance, documentation, and risk assessments.
Stay current with emerging evaluation tools, frameworks, metrics, and research related to LLM assessment and generative AI reliability.
Document evaluation processes, criteria, and results for both technical and non-technical audiences.