Skip to content

Reranking Retrieved Results

Vector similarity retrieves candidates fast, but it's not a precision tool. The right chunk is often in your top-100 results but not your top-8.

Reranking fixes this by applying a stronger model after the fast ANN search.


Bi-encoder vs cross-encoder

Bi-encoder (dense retrieval) Cross-encoder (reranker)
How it works Encodes query and document separately, compares vectors Encodes query + document together, outputs a relevance score
Speed Fast — precompute doc embeddings Slow — must run on each (query, doc) pair at query time
Accuracy Good recall, imprecise ranking High precision
Use for Initial fetch (top 100) Final ranking (top 8)

The two-stage pattern gets you both: fast recall + precise ranking.


Install

uv pip install cohere
uv pip install sentence-transformers

Two-stage retrieval: fetch 100, rerank to 8

from rag.retrieve import retrieve  # your existing retrieve() returning list[dict]


def retrieve_and_rerank(query: str, *, fetch_k: int = 100, top_k: int = 8) -> list[dict]:
    # Stage 1: fast ANN retrieval
    candidates = retrieve(query, k=fetch_k)
    if not candidates:
        return []

    # Stage 2: rerank (see implementations below)
    return rerank(query, candidates, top_k=top_k)

Option A: Cohere Rerank API

import cohere

co = cohere.ClientV2()


def rerank(query: str, candidates: list[dict], *, top_k: int = 8) -> list[dict]:
    """Rerank candidates using Cohere's rerank endpoint."""
    results = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=[c["content"] for c in candidates],
        top_n=top_k,
    )
    return [candidates[r.index] for r in results.results]

Cohere Rerank is fast (~200 ms for 100 docs) and requires no GPU.


Option B: Local cross-encoder

Use sentence-transformers to run reranking entirely on your own hardware:

from sentence_transformers import CrossEncoder

_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


def rerank(query: str, candidates: list[dict], *, top_k: int = 8) -> list[dict]:
    """Rerank candidates using a local cross-encoder model."""
    pairs = [(query, c["content"]) for c in candidates]
    scores = _model.predict(pairs)

    scored = sorted(zip(scores, candidates), reverse=True, key=lambda x: x[0])
    return [c for _, c in scored[:top_k]]

cross-encoder/ms-marco-MiniLM-L-6-v2 is a fast, small model (~80 ms/100 docs on CPU). For higher accuracy, use cross-encoder/ms-marco-MiniLM-L-12-v2.


Plug reranking into your existing pipeline

If your existing retrieve() function looks like this:

def retrieve(query: str, k: int = 8) -> list[dict]:
    embedding = embed_texts([query])[0]
    rows = conn.execute(
        "SELECT id, content, source FROM chunks "
        "ORDER BY embedding <=> %s::vector LIMIT %s",
        (embedding, k),
    ).fetchall()
    return [{"id": r[0], "content": r[1], "source": r[2]} for r in rows]

Slot in reranking by changing the call site — no change to retrieve() itself:

# Before
chunks = retrieve(query, k=8)

# After (with reranking)
chunks = retrieve_and_rerank(query, fetch_k=100, top_k=8)

When reranking helps vs hurts

Scenario Reranking verdict
Corpus is large (>50k chunks) Helps — many near-miss results
Content is repetitive / similar phrasing Helps — cross-encoder sees query+doc together
Latency budget < 100 ms Hurts — adds 100–500 ms; skip or cache
Small corpus (<5k chunks) Neutral — dense retrieval already precise
Streaming user-facing API Consider async — run retrieval and reranking in background

Typical added latency:

Model Latency (100 docs) Hardware
Cohere Rerank API ~150–300 ms Cloud
ms-marco-MiniLM-L-6-v2 ~60–120 ms CPU
ms-marco-MiniLM-L-12-v2 ~120–250 ms CPU
ms-marco-MiniLM-L-6-v2 ~10–20 ms GPU (T4)

Measure the impact

Use the eval harness from Evaluating RAG to compare:

# eval_compare.py
from eval import load_jsonl, recall_at_k, save_results
from rag.retrieve import retrieve
from rag.rerank import retrieve_and_rerank

qs = load_jsonl("tests/fixtures/questions.jsonl")
labeled = [q for q in qs if q.get("expected_sources")]

baseline_scores, reranked_scores = [], []

for q in labeled:
    expected = q["expected_sources"]

    chunks_base = retrieve(q["question"], k=8)
    baseline_scores.append(recall_at_k([c["source"] for c in chunks_base], expected))

    chunks_reranked = retrieve_and_rerank(q["question"], fetch_k=100, top_k=8)
    reranked_scores.append(recall_at_k([c["source"] for c in chunks_reranked], expected))

save_results(baseline_scores, k=8, run_label="baseline")
save_results(reranked_scores, k=8, run_label="reranked")

Next steps