Testing RAG Components¶

RAG systems are easy to build but hard to maintain without tests. Without them, a chunking change breaks retrieval silently, or a prompt tweak tanks answer quality with no warning.

The RAG testing pyramid¶

        ┌─────────────────────────────┐
        │  Regression eval (pytest)   │  ← assert recall@8 ≥ threshold
        ├─────────────────────────────┤
        │  Integration tests          │  ← ingest → retrieve round-trip (real DB)
        ├─────────────────────────────┤
        │  Unit tests                 │  ← chunker, formatter (no DB, no API)
        └─────────────────────────────┘

Start at the bottom. Unit tests are fast and deterministic. Integration tests catch wiring bugs. Regression eval catches quality regressions.

Install¶

uv pip install pytest pytest-mock psycopg[binary] openai

Recommended project layout¶

rag-project/
├── rag/
│   ├── __init__.py
│   ├── chunker.py
│   ├── embed.py
│   ├── retrieve.py
│   └── generate.py
├── tests/
│   ├── fixtures/
│   │   ├── sample.md          # small test document
│   │   └── questions.jsonl    # labeled eval questions
│   ├── test_chunker.py
│   ├── test_formatter.py
│   ├── test_integration.py
│   └── test_retrieval_eval.py
└── .env

Unit tests: chunker¶

# tests/test_chunker.py
import pytest
from rag.chunker import chunk_text  # your chunk_text(text, max_tokens, overlap) function


def test_chunk_empty_string():
    assert chunk_text("", max_tokens=200) == []


def test_chunk_respects_max_tokens():
    # 100 words → should be split into multiple chunks at max_tokens=20
    text = " ".join(["word"] * 100)
    chunks = chunk_text(text, max_tokens=20)
    assert len(chunks) > 1
    for chunk in chunks:
        word_count = len(chunk.split())
        assert word_count <= 25  # small buffer for overlap


def test_chunk_overlap():
    # With overlap, adjacent chunks should share some tokens
    text = " ".join([f"w{i}" for i in range(60)])
    chunks = chunk_text(text, max_tokens=20, overlap=5)
    assert len(chunks) >= 2
    # last tokens of chunk[0] should appear in chunk[1]
    end_of_first = chunks[0].split()[-5:]
    start_of_second = chunks[1].split()[:10]
    assert any(t in start_of_second for t in end_of_first)


def test_chunk_single_chunk():
    # Short text should return a single chunk
    text = "This is short."
    chunks = chunk_text(text, max_tokens=200)
    assert len(chunks) == 1
    assert chunks[0] == text

Unit tests: context formatter¶

# tests/test_formatter.py
from rag.generate import format_context  # your format_context(chunks) function


def test_context_formatter():
    chunks = [
        {"id": 1, "source": "docs/readme.md", "content": "Hello world."},
        {"id": 2, "source": "docs/api.md",    "content": "POST /query"},
    ]
    ctx = format_context(chunks)

    assert "chunk_id=1" in ctx
    assert 'source="docs/readme.md"' in ctx
    assert "Hello world." in ctx
    assert "chunk_id=2" in ctx
    assert "BEGIN CONTEXT" in ctx
    assert "END CONTEXT" in ctx


def test_context_formatter_empty():
    ctx = format_context([])
    assert "BEGIN CONTEXT" in ctx
    assert "END CONTEXT" in ctx

Unit tests: mock OpenAI API¶

Use pytest-mock to avoid real API calls in unit tests:

# tests/test_generate.py
from rag.generate import answer_question


def test_answer_question_calls_openai(mocker):
    mock_create = mocker.patch("rag.generate.client.chat.completions.create")
    mock_create.return_value.choices = [
        mocker.Mock(message=mocker.Mock(content="The answer is 42."))
    ]

    chunks = [{"id": 1, "source": "docs/faq.md", "content": "The answer is 42."}]
    result = answer_question("What is the answer?", chunks=chunks)

    assert result == "The answer is 42."
    mock_create.assert_called_once()


def test_answer_question_passes_context_to_prompt(mocker):
    mock_create = mocker.patch("rag.generate.client.chat.completions.create")
    mock_create.return_value.choices = [
        mocker.Mock(message=mocker.Mock(content="Answer."))
    ]

    chunks = [{"id": 7, "source": "pricing.md", "content": "Enterprise costs $500/mo."}]
    answer_question("How much is Enterprise?", chunks=chunks)

    # Verify the call included the chunk content in the prompt
    call_args = mock_create.call_args
    messages = call_args.kwargs["messages"]
    user_content = next(m["content"] for m in messages if m["role"] == "user")
    assert "Enterprise costs $500/mo." in user_content

Integration test: ingest → retrieve round-trip¶

This test requires a real pgvector instance (use Docker):

docker run -d --name pgvector-test \
  -e POSTGRES_PASSWORD=password -e POSTGRES_DB=ragtest \
  -p 5433:5432 pgvector/pgvector:pg16

# tests/test_integration.py
import os
import pytest
import psycopg
from rag.ingest import ingest_document   # your ingest function
from rag.retrieve import retrieve        # your retrieve function

TEST_DB = "postgresql://postgres:password@localhost:5433/ragtest"


@pytest.fixture(scope="module")
def db():
    """Create schema and tear down after tests."""
    with psycopg.connect(TEST_DB) as conn:
        conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
        conn.execute("""
            CREATE TABLE IF NOT EXISTS chunks (
                id SERIAL PRIMARY KEY,
                source TEXT,
                content TEXT,
                embedding VECTOR(1536)
            )
        """)
        conn.commit()
        yield conn
        conn.execute("DROP TABLE IF EXISTS chunks")
        conn.commit()


def test_ingest_and_retrieve(db):
    """Ingest a document and verify retrieval finds the right chunk."""
    ingest_document(
        content="Enterprise plan includes SSO and advanced security.",
        source="pricing.md",
        conn=db,
    )

    chunks = retrieve("Which plan includes SSO?", k=3, conn=db)

    assert len(chunks) > 0
    sources = [c["source"] for c in chunks]
    assert "pricing.md" in sources

Real embeddings in integration tests

Integration tests call the real OpenAI API. Set OPENAI_API_KEY in your .env and run infrequently (not on every commit). Use the unit tests (with mocking) for fast CI.

Regression eval as a pytest test¶

Gate retrieval quality in CI. This test will fail if recall drops below your threshold:

# tests/test_retrieval_eval.py
import pytest
from rag.retrieve import retrieve
from eval import load_jsonl, recall_at_k  # the eval harness from evaluating-rag.md

QUESTIONS_PATH = "tests/fixtures/questions.jsonl"
RECALL_THRESHOLD = 0.80


@pytest.mark.eval
def test_recall_at_8():
    """Recall@8 must stay above threshold."""
    qs = load_jsonl(QUESTIONS_PATH)
    labeled = [q for q in qs if q.get("expected_sources")]

    if not labeled:
        pytest.skip("No labeled questions found in fixtures.")

    scores = []
    for q in labeled:
        chunks = retrieve(q["question"], k=8)
        sources = [c["source"] for c in chunks]
        scores.append(recall_at_k(sources, q["expected_sources"]))

    recall = sum(scores) / len(scores)
    assert recall >= RECALL_THRESHOLD, (
        f"Recall@8 = {recall:.3f} dropped below threshold {RECALL_THRESHOLD}. "
        f"Check recent changes to chunking, embedding model, or retrieval config."
    )

Run only the eval tests separately (they're slow):

uv run pytest tests/test_retrieval_eval.py -v -m eval

Mark it @pytest.mark.eval and configure pytest.ini to skip it in fast CI:

# pytest.ini
[pytest]
markers =
    eval: slow evaluation tests that require a database and OpenAI API

# Fast CI (no eval)
uv run pytest tests/ -v -m "not eval"

# Full CI including eval
uv run pytest tests/ -v

Summary¶

Test type	Speed	Cost	When to run
Unit tests	Fast (<1s)	Free (no API)	Every commit
Integration tests	Medium (5–30s)	Minimal API cost	PR merge
Regression eval	Slow (minutes)	Real API cost	Daily / release

Next steps¶

Add structured logging to each test run: Monitoring & Observability
Improve what the eval tests measure: Evaluating RAG