Evaluating RAG (Retrieval + Answer Quality)¶
If you don’t evaluate RAG, you end up guessing: - “Did chunking help?” - “Is hybrid search better?” - “Did the prompt change actually improve groundedness?”
Evaluation doesn’t need to be fancy. A small, honest test set plus a repeatable script is enough to drive big improvements.
What to evaluate¶
Retrieval quality (before the LLM)¶
Common signals: - Recall@K: did you retrieve at least one “correct” chunk in the top K? - MRR: how early does the first correct result appear? - Diversity: are results all from the same doc?
Answer quality (after the LLM)¶
Common signals: - Groundedness / faithfulness: does the answer only use the context? - Relevance: does it answer the question? - Citation correctness: do citations point to supporting chunks?
Build a minimal test set (questions.jsonl)¶
Create a small file, e.g. questions.jsonl:
{"id":"pricing_sso","question":"Which plan supports SSO?","expected_sources":["pricing.md"],"notes":"Should mention Enterprise only."}
{"id":"reset_password","question":"How do I reset my password?","expected_sources":["account.md"]}
Tips: - Start with 20–50 questions. - Include both “easy” and “hard” questions. - Add ambiguous questions you expect users to ask.
If you don’t know the exact chunk IDs yet, start by using expected_sources (file names / URLs) as the supervision signal.
A minimal retrieval eval harness¶
This script assumes you have a retrieve(question, k) function that returns chunks with a source field.
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
def load_jsonl(path: str) -> list[dict[str, Any]]:
items: list[dict[str, Any]] = []
for line in Path(path).read_text(encoding="utf-8").splitlines():
line = line.strip()
if not line:
continue
items.append(json.loads(line))
return items
def recall_at_k(retrieved_sources: list[str], expected_sources: list[str]) -> float:
expected = {s.lower() for s in expected_sources}
got = {s.lower() for s in retrieved_sources}
return 1.0 if expected.intersection(got) else 0.0
def evaluate_retrieval(questions_path: str, *, k: int = 8) -> None:
qs = load_jsonl(questions_path)
scores: list[float] = []
for q in qs:
question = q["question"]
expected_sources = q.get("expected_sources") or []
# TODO: replace with your actual retrieval function
retrieved = retrieve(question, k=k) # noqa: F821
retrieved_sources = [c["source"] for c in retrieved]
if expected_sources:
score = recall_at_k(retrieved_sources, expected_sources)
scores.append(score)
print("Q:", question)
print("Retrieved sources:", retrieved_sources[:5])
print("---")
if scores:
print(f"Recall@{k}: {sum(scores) / len(scores):.3f} ({len(scores)} labeled questions)")
else:
print("No labeled questions found (missing expected_sources).")
LLM-as-judge (optional, use carefully)¶
LLM judges can help you scale qualitative checks (groundedness/citations), but: - they can be noisy - they can be biased by your prompt
If you use LLM judges, make the rubric explicit:
Rate 0–2 for each:
1) Groundedness: uses only context
2) Answer relevance: answers the question
3) Citation correctness: citations support claims
Return JSON with the scores and a short reason.
Always keep a small set of human-reviewed examples as a reality check.
Iteration loop (the only thing that matters)¶
1) Change one thing (chunk size, overlap, filters, hybrid, prompt) 2) Re-run retrieval eval 3) Spot-check answers on a small subset 4) Keep changes that improve metrics and reduce obvious failures
Next steps¶
- Improve retrieval knobs systematically: Retrieval Strategies
- Use Postgres hybrid retrieval: SQL for RAG