Reranking Retrieved Results¶
Vector similarity retrieves candidates fast, but it's not a precision tool. The right chunk is often in your top-100 results but not your top-8.
Reranking fixes this by applying a stronger model after the fast ANN search.
Bi-encoder vs cross-encoder¶
| Bi-encoder (dense retrieval) | Cross-encoder (reranker) | |
|---|---|---|
| How it works | Encodes query and document separately, compares vectors | Encodes query + document together, outputs a relevance score |
| Speed | Fast — precompute doc embeddings | Slow — must run on each (query, doc) pair at query time |
| Accuracy | Good recall, imprecise ranking | High precision |
| Use for | Initial fetch (top 100) | Final ranking (top 8) |
The two-stage pattern gets you both: fast recall + precise ranking.
Install¶
uv pip install cohere
uv pip install sentence-transformers
Two-stage retrieval: fetch 100, rerank to 8¶
from rag.retrieve import retrieve # your existing retrieve() returning list[dict]
def retrieve_and_rerank(query: str, *, fetch_k: int = 100, top_k: int = 8) -> list[dict]:
# Stage 1: fast ANN retrieval
candidates = retrieve(query, k=fetch_k)
if not candidates:
return []
# Stage 2: rerank (see implementations below)
return rerank(query, candidates, top_k=top_k)
Option A: Cohere Rerank API¶
import cohere
co = cohere.ClientV2()
def rerank(query: str, candidates: list[dict], *, top_k: int = 8) -> list[dict]:
"""Rerank candidates using Cohere's rerank endpoint."""
results = co.rerank(
model="rerank-v3.5",
query=query,
documents=[c["content"] for c in candidates],
top_n=top_k,
)
return [candidates[r.index] for r in results.results]
Cohere Rerank is fast (~200 ms for 100 docs) and requires no GPU.
Option B: Local cross-encoder¶
Use sentence-transformers to run reranking entirely on your own hardware:
from sentence_transformers import CrossEncoder
_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[dict], *, top_k: int = 8) -> list[dict]:
"""Rerank candidates using a local cross-encoder model."""
pairs = [(query, c["content"]) for c in candidates]
scores = _model.predict(pairs)
scored = sorted(zip(scores, candidates), reverse=True, key=lambda x: x[0])
return [c for _, c in scored[:top_k]]
cross-encoder/ms-marco-MiniLM-L-6-v2 is a fast, small model (~80 ms/100 docs on CPU). For higher accuracy, use cross-encoder/ms-marco-MiniLM-L-12-v2.
Plug reranking into your existing pipeline¶
If your existing retrieve() function looks like this:
def retrieve(query: str, k: int = 8) -> list[dict]:
embedding = embed_texts([query])[0]
rows = conn.execute(
"SELECT id, content, source FROM chunks "
"ORDER BY embedding <=> %s::vector LIMIT %s",
(embedding, k),
).fetchall()
return [{"id": r[0], "content": r[1], "source": r[2]} for r in rows]
Slot in reranking by changing the call site — no change to retrieve() itself:
# Before
chunks = retrieve(query, k=8)
# After (with reranking)
chunks = retrieve_and_rerank(query, fetch_k=100, top_k=8)
When reranking helps vs hurts¶
| Scenario | Reranking verdict |
|---|---|
| Corpus is large (>50k chunks) | Helps — many near-miss results |
| Content is repetitive / similar phrasing | Helps — cross-encoder sees query+doc together |
| Latency budget < 100 ms | Hurts — adds 100–500 ms; skip or cache |
| Small corpus (<5k chunks) | Neutral — dense retrieval already precise |
| Streaming user-facing API | Consider async — run retrieval and reranking in background |
Typical added latency:
| Model | Latency (100 docs) | Hardware |
|---|---|---|
| Cohere Rerank API | ~150–300 ms | Cloud |
ms-marco-MiniLM-L-6-v2 |
~60–120 ms | CPU |
ms-marco-MiniLM-L-12-v2 |
~120–250 ms | CPU |
ms-marco-MiniLM-L-6-v2 |
~10–20 ms | GPU (T4) |
Measure the impact¶
Use the eval harness from Evaluating RAG to compare:
# eval_compare.py
from eval import load_jsonl, recall_at_k, save_results
from rag.retrieve import retrieve
from rag.rerank import retrieve_and_rerank
qs = load_jsonl("tests/fixtures/questions.jsonl")
labeled = [q for q in qs if q.get("expected_sources")]
baseline_scores, reranked_scores = [], []
for q in labeled:
expected = q["expected_sources"]
chunks_base = retrieve(q["question"], k=8)
baseline_scores.append(recall_at_k([c["source"] for c in chunks_base], expected))
chunks_reranked = retrieve_and_rerank(q["question"], fetch_k=100, top_k=8)
reranked_scores.append(recall_at_k([c["source"] for c in chunks_reranked], expected))
save_results(baseline_scores, k=8, run_label="baseline")
save_results(reranked_scores, k=8, run_label="reranked")
Next steps¶
- Add a streaming API endpoint for the reranked results: Serving RAG as an API
- Monitor latency impact in production: Monitoring & Observability