Skip to content

Retrieval Strategies (Beyond Top‑K)

Most RAG systems start with: 1) embed the query
2) retrieve top‑K chunks by vector similarity
3) paste those chunks into a prompt

That works, but “top‑K” alone often fails in predictable ways: - you get many chunks from the same doc (low diversity) - you miss exact keywords (IDs, error codes) - you retrieve near-duplicates (wasted context) - you retrieve the right chunks but in the wrong order or missing context

This page gives a practical “retrieval ladder” you can climb as your system grows.


The retrieval ladder

1) Top‑K dense retrieval (baseline)
2) Metadata filters (scope the search)
3) Diversity (MMR / per-doc caps) (reduce duplicates)
4) Hybrid retrieval (FTS + vectors) (catch exact keywords)
5) Reranking (sort candidates by a stronger model)
6) Query rewriting / multi-query (improve recall)

You don’t need all of them. Add the next rung only when you see a clear failure mode.


1) Top‑K dense retrieval

Baseline pgvector query:

SELECT id, doc_id, section_path, content
FROM chunks
ORDER BY embedding <=> $1::vector
LIMIT 8;

Good defaults - start with k=6–10 - keep chunks ~200–400 tokens (see: Chunking)


2) Metadata filters (huge win)

Filters improve both quality and latency because you search a smaller space.

Examples:

-- Only retrieve from a specific product
SELECT id, content
FROM chunks
WHERE metadata @> '{"product":"enterprise"}'::jsonb
ORDER BY embedding <=> $1::vector
LIMIT 8;
-- Only retrieve from a subset of sources
SELECT id, content
FROM chunks
WHERE metadata->>'source_type' = 'docs'
ORDER BY embedding <=> $1::vector
LIMIT 8;

3) Diversity: per-doc caps and MMR

Per-document cap (simple and effective)

If your top results come from one long document, cap results per doc_id.

One SQL approach:

WITH ranked AS (
  SELECT
    id,
    doc_id,
    content,
    row_number() OVER (PARTITION BY doc_id ORDER BY embedding <=> $1::vector) AS doc_rnk,
    embedding <=> $1::vector AS distance
  FROM chunks
)
SELECT id, doc_id, content
FROM ranked
WHERE doc_rnk <= 2
ORDER BY distance
LIMIT 8;

MMR (Maximal Marginal Relevance)

MMR trades off: - relevance to the query - novelty vs already-selected chunks

MMR is typically done in application code (not SQL). High-level pseudocode:

selected = []
while len(selected) < k:
  pick chunk that maximizes:
    lambda * sim(query, chunk) - (1-lambda) * max(sim(chunk, s) for s in selected)

Start with lambda=0.7.


4) Hybrid retrieval (dense + keyword)

Use hybrid retrieval when users type: - exact names (“OAuth”, “SAML”, “HIPAA”) - error codes (“E11000”, “403”) - IDs (“INV-1029”)

Postgres makes hybrid retrieval easy. A strong default is RRF fusion:

See: SQL for RAG


5) Reranking

Vector similarity is a fast filter, not a perfect ranker.

A common pattern: 1) retrieve top 50–200 candidates (fast) 2) rerank to top 6–10 using a stronger model (cross-encoder or LLM)

Reranking helps when: - the correct chunk is in the candidate set but not in the top‑K - your content is repetitive and hard to distinguish


6) Query rewriting / multi-query

If the user query is ambiguous or too short, rewrite it.

Two practical patterns: - Rewrite: “Expand this query into a search query for our docs…” - Multi-query: generate 3–5 variants, retrieve for each, then merge + dedupe

Only do this when you see recall issues; it increases latency and token usage.


Debug checklist (“why retrieval looks wrong?”)

1) Inspect the top retrieved chunks (before prompting). 2) Check if parsing added boilerplate noise. 3) Check chunk size: too small/too big? 4) Add metadata filters (product/version/source). 5) Add per-doc caps or MMR if results are duplicates. 6) Add hybrid retrieval if exact keywords are missed. 7) Consider reranking if “almost right” results keep showing up.


Next steps