Reranking Retrieved Results¶

Vector similarity retrieves candidates fast, but it's not a precision tool. The right chunk is often in your top-100 results but not your top-8.

Reranking fixes this by applying a stronger model after the fast ANN search.

Bi-encoder vs cross-encoder¶

	Bi-encoder (dense retrieval)	Cross-encoder (reranker)
How it works	Encodes query and document separately, compares vectors	Encodes query + document together, outputs a relevance score
Speed	Fast — precompute doc embeddings	Slow — must run on each (query, doc) pair at query time
Accuracy	Good recall, imprecise ranking	High precision
Use for	Initial fetch (top 100)	Final ranking (top 8)

The two-stage pattern gets you both: fast recall + precise ranking.

Install¶

Cohere (cloud)Local cross-encoder

uv pip install cohere

uv pip install sentence-transformers

Two-stage retrieval: fetch 100, rerank to 8¶

from rag.retrieve import retrieve  # your existing retrieve() returning list[dict]


def retrieve_and_rerank(query: str, *, fetch_k: int = 100, top_k: int = 8) -> list[dict]:
    # Stage 1: fast ANN retrieval
    candidates = retrieve(query, k=fetch_k)
    if not candidates:
        return []

    # Stage 2: rerank (see implementations below)
    return rerank(query, candidates, top_k=top_k)

Option A: Cohere Rerank API¶

import cohere

co = cohere.ClientV2()


def rerank(query: str, candidates: list[dict], *, top_k: int = 8) -> list[dict]:
    """Rerank candidates using Cohere's rerank endpoint."""
    results = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=[c["content"] for c in candidates],
        top_n=top_k,
    )
    return [candidates[r.index] for r in results.results]

Cohere Rerank is fast (~200 ms for 100 docs) and requires no GPU.

Option B: Local cross-encoder¶

Use sentence-transformers to run reranking entirely on your own hardware:

from sentence_transformers import CrossEncoder

_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


def rerank(query: str, candidates: list[dict], *, top_k: int = 8) -> list[dict]:
    """Rerank candidates using a local cross-encoder model."""
    pairs = [(query, c["content"]) for c in candidates]
    scores = _model.predict(pairs)

    scored = sorted(zip(scores, candidates), reverse=True, key=lambda x: x[0])
    return [c for _, c in scored[:top_k]]

cross-encoder/ms-marco-MiniLM-L-6-v2 is a fast, small model (~80 ms/100 docs on CPU). For higher accuracy, use cross-encoder/ms-marco-MiniLM-L-12-v2.

Plug reranking into your existing pipeline¶

If your existing retrieve() function looks like this:

def retrieve(query: str, k: int = 8) -> list[dict]:
    embedding = embed_texts([query])[0]
    rows = conn.execute(
        "SELECT id, content, source FROM chunks "
        "ORDER BY embedding <=> %s::vector LIMIT %s",
        (embedding, k),
    ).fetchall()
    return [{"id": r[0], "content": r[1], "source": r[2]} for r in rows]

Slot in reranking by changing the call site — no change to retrieve() itself:

# Before
chunks = retrieve(query, k=8)

# After (with reranking)
chunks = retrieve_and_rerank(query, fetch_k=100, top_k=8)

When reranking helps vs hurts¶

Scenario	Reranking verdict
Corpus is large (>50k chunks)	Helps — many near-miss results
Content is repetitive / similar phrasing	Helps — cross-encoder sees query+doc together
Latency budget < 100 ms	Hurts — adds 100–500 ms; skip or cache
Small corpus (<5k chunks)	Neutral — dense retrieval already precise
Streaming user-facing API	Consider async — run retrieval and reranking in background

Typical added latency:

Model	Latency (100 docs)	Hardware
Cohere Rerank API	~150–300 ms	Cloud
`ms-marco-MiniLM-L-6-v2`	~60–120 ms	CPU
`ms-marco-MiniLM-L-12-v2`	~120–250 ms	CPU
`ms-marco-MiniLM-L-6-v2`	~10–20 ms	GPU (T4)

Measure the impact¶

Use the eval harness from Evaluating RAG to compare:

# eval_compare.py
from eval import load_jsonl, recall_at_k, save_results
from rag.retrieve import retrieve
from rag.rerank import retrieve_and_rerank

qs = load_jsonl("tests/fixtures/questions.jsonl")
labeled = [q for q in qs if q.get("expected_sources")]

baseline_scores, reranked_scores = [], []

for q in labeled:
    expected = q["expected_sources"]

    chunks_base = retrieve(q["question"], k=8)
    baseline_scores.append(recall_at_k([c["source"] for c in chunks_base], expected))

    chunks_reranked = retrieve_and_rerank(q["question"], fetch_k=100, top_k=8)
    reranked_scores.append(recall_at_k([c["source"] for c in chunks_reranked], expected))

save_results(baseline_scores, k=8, run_label="baseline")
save_results(reranked_scores, k=8, run_label="reranked")

Next steps¶

Add a streaming API endpoint for the reranked results: Serving RAG as an API
Monitor latency impact in production: Monitoring & Observability