Testing RAG Components¶
RAG systems are easy to build but hard to maintain without tests. Without them, a chunking change breaks retrieval silently, or a prompt tweak tanks answer quality with no warning.
The RAG testing pyramid¶
┌─────────────────────────────┐
│ Regression eval (pytest) │ ← assert recall@8 ≥ threshold
├─────────────────────────────┤
│ Integration tests │ ← ingest → retrieve round-trip (real DB)
├─────────────────────────────┤
│ Unit tests │ ← chunker, formatter (no DB, no API)
└─────────────────────────────┘
Start at the bottom. Unit tests are fast and deterministic. Integration tests catch wiring bugs. Regression eval catches quality regressions.
Install¶
uv pip install pytest pytest-mock psycopg[binary] openai
Recommended project layout¶
rag-project/
├── rag/
│ ├── __init__.py
│ ├── chunker.py
│ ├── embed.py
│ ├── retrieve.py
│ └── generate.py
├── tests/
│ ├── fixtures/
│ │ ├── sample.md # small test document
│ │ └── questions.jsonl # labeled eval questions
│ ├── test_chunker.py
│ ├── test_formatter.py
│ ├── test_integration.py
│ └── test_retrieval_eval.py
└── .env
Unit tests: chunker¶
# tests/test_chunker.py
import pytest
from rag.chunker import chunk_text # your chunk_text(text, max_tokens, overlap) function
def test_chunk_empty_string():
assert chunk_text("", max_tokens=200) == []
def test_chunk_respects_max_tokens():
# 100 words → should be split into multiple chunks at max_tokens=20
text = " ".join(["word"] * 100)
chunks = chunk_text(text, max_tokens=20)
assert len(chunks) > 1
for chunk in chunks:
word_count = len(chunk.split())
assert word_count <= 25 # small buffer for overlap
def test_chunk_overlap():
# With overlap, adjacent chunks should share some tokens
text = " ".join([f"w{i}" for i in range(60)])
chunks = chunk_text(text, max_tokens=20, overlap=5)
assert len(chunks) >= 2
# last tokens of chunk[0] should appear in chunk[1]
end_of_first = chunks[0].split()[-5:]
start_of_second = chunks[1].split()[:10]
assert any(t in start_of_second for t in end_of_first)
def test_chunk_single_chunk():
# Short text should return a single chunk
text = "This is short."
chunks = chunk_text(text, max_tokens=200)
assert len(chunks) == 1
assert chunks[0] == text
Unit tests: context formatter¶
# tests/test_formatter.py
from rag.generate import format_context # your format_context(chunks) function
def test_context_formatter():
chunks = [
{"id": 1, "source": "docs/readme.md", "content": "Hello world."},
{"id": 2, "source": "docs/api.md", "content": "POST /query"},
]
ctx = format_context(chunks)
assert "chunk_id=1" in ctx
assert 'source="docs/readme.md"' in ctx
assert "Hello world." in ctx
assert "chunk_id=2" in ctx
assert "BEGIN CONTEXT" in ctx
assert "END CONTEXT" in ctx
def test_context_formatter_empty():
ctx = format_context([])
assert "BEGIN CONTEXT" in ctx
assert "END CONTEXT" in ctx
Unit tests: mock OpenAI API¶
Use pytest-mock to avoid real API calls in unit tests:
# tests/test_generate.py
from rag.generate import answer_question
def test_answer_question_calls_openai(mocker):
mock_create = mocker.patch("rag.generate.client.chat.completions.create")
mock_create.return_value.choices = [
mocker.Mock(message=mocker.Mock(content="The answer is 42."))
]
chunks = [{"id": 1, "source": "docs/faq.md", "content": "The answer is 42."}]
result = answer_question("What is the answer?", chunks=chunks)
assert result == "The answer is 42."
mock_create.assert_called_once()
def test_answer_question_passes_context_to_prompt(mocker):
mock_create = mocker.patch("rag.generate.client.chat.completions.create")
mock_create.return_value.choices = [
mocker.Mock(message=mocker.Mock(content="Answer."))
]
chunks = [{"id": 7, "source": "pricing.md", "content": "Enterprise costs $500/mo."}]
answer_question("How much is Enterprise?", chunks=chunks)
# Verify the call included the chunk content in the prompt
call_args = mock_create.call_args
messages = call_args.kwargs["messages"]
user_content = next(m["content"] for m in messages if m["role"] == "user")
assert "Enterprise costs $500/mo." in user_content
Integration test: ingest → retrieve round-trip¶
This test requires a real pgvector instance (use Docker):
docker run -d --name pgvector-test \
-e POSTGRES_PASSWORD=password -e POSTGRES_DB=ragtest \
-p 5433:5432 pgvector/pgvector:pg16
# tests/test_integration.py
import os
import pytest
import psycopg
from rag.ingest import ingest_document # your ingest function
from rag.retrieve import retrieve # your retrieve function
TEST_DB = "postgresql://postgres:password@localhost:5433/ragtest"
@pytest.fixture(scope="module")
def db():
"""Create schema and tear down after tests."""
with psycopg.connect(TEST_DB) as conn:
conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
conn.execute("""
CREATE TABLE IF NOT EXISTS chunks (
id SERIAL PRIMARY KEY,
source TEXT,
content TEXT,
embedding VECTOR(1536)
)
""")
conn.commit()
yield conn
conn.execute("DROP TABLE IF EXISTS chunks")
conn.commit()
def test_ingest_and_retrieve(db):
"""Ingest a document and verify retrieval finds the right chunk."""
ingest_document(
content="Enterprise plan includes SSO and advanced security.",
source="pricing.md",
conn=db,
)
chunks = retrieve("Which plan includes SSO?", k=3, conn=db)
assert len(chunks) > 0
sources = [c["source"] for c in chunks]
assert "pricing.md" in sources
Real embeddings in integration tests
Integration tests call the real OpenAI API. Set OPENAI_API_KEY in your .env and run infrequently (not on every commit). Use the unit tests (with mocking) for fast CI.
Regression eval as a pytest test¶
Gate retrieval quality in CI. This test will fail if recall drops below your threshold:
# tests/test_retrieval_eval.py
import pytest
from rag.retrieve import retrieve
from eval import load_jsonl, recall_at_k # the eval harness from evaluating-rag.md
QUESTIONS_PATH = "tests/fixtures/questions.jsonl"
RECALL_THRESHOLD = 0.80
@pytest.mark.eval
def test_recall_at_8():
"""Recall@8 must stay above threshold."""
qs = load_jsonl(QUESTIONS_PATH)
labeled = [q for q in qs if q.get("expected_sources")]
if not labeled:
pytest.skip("No labeled questions found in fixtures.")
scores = []
for q in labeled:
chunks = retrieve(q["question"], k=8)
sources = [c["source"] for c in chunks]
scores.append(recall_at_k(sources, q["expected_sources"]))
recall = sum(scores) / len(scores)
assert recall >= RECALL_THRESHOLD, (
f"Recall@8 = {recall:.3f} dropped below threshold {RECALL_THRESHOLD}. "
f"Check recent changes to chunking, embedding model, or retrieval config."
)
Run only the eval tests separately (they're slow):
uv run pytest tests/test_retrieval_eval.py -v -m eval
Mark it @pytest.mark.eval and configure pytest.ini to skip it in fast CI:
# pytest.ini
[pytest]
markers =
eval: slow evaluation tests that require a database and OpenAI API
# Fast CI (no eval)
uv run pytest tests/ -v -m "not eval"
# Full CI including eval
uv run pytest tests/ -v
Summary¶
| Test type | Speed | Cost | When to run |
|---|---|---|---|
| Unit tests | Fast (<1s) | Free (no API) | Every commit |
| Integration tests | Medium (5–30s) | Minimal API cost | PR merge |
| Regression eval | Slow (minutes) | Real API cost | Daily / release |
Next steps¶
- Add structured logging to each test run: Monitoring & Observability
- Improve what the eval tests measure: Evaluating RAG