RAG Engineering Mastery5 / 10

Re-Ranking — The Cheap Quality Win

Retrieval gets you 30 plausible chunks. A re-ranker reads them against the actual question and floats the truly relevant few to the top.

Published May 11, 20261 min readHaythem Rehouma · Claude Mastery

Embedding search is fast but shallow: it compares your question and each chunk separately, then measures distance. A re-ranker is slow but deep: it reads the question and a chunk together and scores true relevance.

The pattern: retrieve wide, re-rank narrow

Retrieve broadly — top 30–50 chunks via hybrid search (recall-optimized; cast a wide net).
Re-rank those with a cross-encoder against the question.
Keep the top 3–8 for the prompt (precision-optimized).

You get the recall of wide retrieval and the precision of deep scoring, without re-ranking your whole corpus.

Why it works

A bi-encoder (embeddings) must encode a chunk before it knows your question. A cross-encoder sees both at once, so it catches relevance that distance misses — negation, specificity, "this chunk is about X but doesn't answer X."

The trade-off

Re-ranking adds latency and cost per query (you score 30–50 pairs). Tune the retrieve-width and keep-count against your eval set and latency budget — covered in articles 7 and 9.

Now the retrieval is sharp. Next: making the generator actually use it — grounding and citations.

The pattern: retrieve wide, re-rank narrow

Why it works

The trade-off

Related Claude skills you can install

Share this article

Series — RAG Engineering Mastery

Keep learning

The Claude Mastery course