Retrieval Strategies (Beyond Top‑K)¶
Most RAG systems start with:
1) embed the query
2) retrieve top‑K chunks by vector similarity
3) paste those chunks into a prompt
That works, but “top‑K” alone often fails in predictable ways: - you get many chunks from the same doc (low diversity) - you miss exact keywords (IDs, error codes) - you retrieve near-duplicates (wasted context) - you retrieve the right chunks but in the wrong order or missing context
This page gives a practical “retrieval ladder” you can climb as your system grows.
The retrieval ladder¶
1) Top‑K dense retrieval (baseline)
2) Metadata filters (scope the search)
3) Diversity (MMR / per-doc caps) (reduce duplicates)
4) Hybrid retrieval (FTS + vectors) (catch exact keywords)
5) Reranking (sort candidates by a stronger model)
6) Query rewriting / multi-query (improve recall)
You don’t need all of them. Add the next rung only when you see a clear failure mode.
1) Top‑K dense retrieval¶
Baseline pgvector query:
SELECT id, doc_id, section_path, content
FROM chunks
ORDER BY embedding <=> $1::vector
LIMIT 8;
Good defaults
- start with k=6–10
- keep chunks ~200–400 tokens (see: Chunking)
2) Metadata filters (huge win)¶
Filters improve both quality and latency because you search a smaller space.
Examples:
-- Only retrieve from a specific product
SELECT id, content
FROM chunks
WHERE metadata @> '{"product":"enterprise"}'::jsonb
ORDER BY embedding <=> $1::vector
LIMIT 8;
-- Only retrieve from a subset of sources
SELECT id, content
FROM chunks
WHERE metadata->>'source_type' = 'docs'
ORDER BY embedding <=> $1::vector
LIMIT 8;
3) Diversity: per-doc caps and MMR¶
Per-document cap (simple and effective)¶
If your top results come from one long document, cap results per doc_id.
One SQL approach:
WITH ranked AS (
SELECT
id,
doc_id,
content,
row_number() OVER (PARTITION BY doc_id ORDER BY embedding <=> $1::vector) AS doc_rnk,
embedding <=> $1::vector AS distance
FROM chunks
)
SELECT id, doc_id, content
FROM ranked
WHERE doc_rnk <= 2
ORDER BY distance
LIMIT 8;
MMR (Maximal Marginal Relevance)¶
MMR trades off: - relevance to the query - novelty vs already-selected chunks
MMR is typically done in application code (not SQL). High-level pseudocode:
selected = []
while len(selected) < k:
pick chunk that maximizes:
lambda * sim(query, chunk) - (1-lambda) * max(sim(chunk, s) for s in selected)
Start with lambda=0.7.
4) Hybrid retrieval (dense + keyword)¶
Use hybrid retrieval when users type: - exact names (“OAuth”, “SAML”, “HIPAA”) - error codes (“E11000”, “403”) - IDs (“INV-1029”)
Postgres makes hybrid retrieval easy. A strong default is RRF fusion:
See: SQL for RAG
5) Reranking¶
Vector similarity is a fast filter, not a perfect ranker.
A common pattern: 1) retrieve top 50–200 candidates (fast) 2) rerank to top 6–10 using a stronger model (cross-encoder or LLM)
Reranking helps when: - the correct chunk is in the candidate set but not in the top‑K - your content is repetitive and hard to distinguish
6) Query rewriting / multi-query¶
If the user query is ambiguous or too short, rewrite it.
Two practical patterns: - Rewrite: “Expand this query into a search query for our docs…” - Multi-query: generate 3–5 variants, retrieve for each, then merge + dedupe
Only do this when you see recall issues; it increases latency and token usage.
Debug checklist (“why retrieval looks wrong?”)¶
1) Inspect the top retrieved chunks (before prompting). 2) Check if parsing added boilerplate noise. 3) Check chunk size: too small/too big? 4) Add metadata filters (product/version/source). 5) Add per-doc caps or MMR if results are duplicates. 6) Add hybrid retrieval if exact keywords are missed. 7) Consider reranking if “almost right” results keep showing up.
Next steps¶
- Make answers grounded and cite sources: Prompt Engineering for RAG
- Measure improvements objectively: Evaluating RAG