Skip to content

Evaluating RAG (Retrieval + Answer Quality)

If you don’t evaluate RAG, you end up guessing: - “Did chunking help?” - “Is hybrid search better?” - “Did the prompt change actually improve groundedness?”

Evaluation doesn’t need to be fancy. A small, honest test set plus a repeatable script is enough to drive big improvements.


What to evaluate

Retrieval quality (before the LLM)

Common signals: - Recall@K: did you retrieve at least one “correct” chunk in the top K? - MRR: how early does the first correct result appear? - Diversity: are results all from the same doc?

Answer quality (after the LLM)

Common signals: - Groundedness / faithfulness: does the answer only use the context? - Relevance: does it answer the question? - Citation correctness: do citations point to supporting chunks?


Build a minimal test set (questions.jsonl)

Create a small file, e.g. questions.jsonl:

{"id":"pricing_sso","question":"Which plan supports SSO?","expected_sources":["pricing.md"],"notes":"Should mention Enterprise only."}
{"id":"reset_password","question":"How do I reset my password?","expected_sources":["account.md"]}

Tips: - Start with 20–50 questions. - Include both “easy” and “hard” questions. - Add ambiguous questions you expect users to ask.

If you don’t know the exact chunk IDs yet, start by using expected_sources (file names / URLs) as the supervision signal.


A minimal retrieval eval harness

This script assumes you have a retrieve(question, k) function that returns chunks with a source field.

from __future__ import annotations

import json
from pathlib import Path
from typing import Any


def load_jsonl(path: str) -> list[dict[str, Any]]:
    items: list[dict[str, Any]] = []
    for line in Path(path).read_text(encoding="utf-8").splitlines():
        line = line.strip()
        if not line:
            continue
        items.append(json.loads(line))
    return items


def recall_at_k(retrieved_sources: list[str], expected_sources: list[str]) -> float:
    expected = {s.lower() for s in expected_sources}
    got = {s.lower() for s in retrieved_sources}
    return 1.0 if expected.intersection(got) else 0.0


def evaluate_retrieval(questions_path: str, *, k: int = 8) -> None:
    qs = load_jsonl(questions_path)
    scores: list[float] = []

    for q in qs:
        question = q["question"]
        expected_sources = q.get("expected_sources") or []

        # TODO: replace with your actual retrieval function
        retrieved = retrieve(question, k=k)  # noqa: F821

        retrieved_sources = [c["source"] for c in retrieved]
        if expected_sources:
            score = recall_at_k(retrieved_sources, expected_sources)
            scores.append(score)

        print("Q:", question)
        print("Retrieved sources:", retrieved_sources[:5])
        print("---")

    if scores:
        print(f"Recall@{k}: {sum(scores) / len(scores):.3f} ({len(scores)} labeled questions)")
    else:
        print("No labeled questions found (missing expected_sources).")

LLM-as-judge (optional, use carefully)

LLM judges can help you scale qualitative checks (groundedness/citations), but: - they can be noisy - they can be biased by your prompt

If you use LLM judges, make the rubric explicit:

Rate 0–2 for each:
1) Groundedness: uses only context
2) Answer relevance: answers the question
3) Citation correctness: citations support claims
Return JSON with the scores and a short reason.

Always keep a small set of human-reviewed examples as a reality check.


Iteration loop (the only thing that matters)

1) Change one thing (chunk size, overlap, filters, hybrid, prompt) 2) Re-run retrieval eval 3) Spot-check answers on a small subset 4) Keep changes that improve metrics and reduce obvious failures


Next steps