Retrieval strategies & query rewriting
Naive RAG embeds the user question and runs one vector search. That works until paraphrases miss, compound questions need two hops, or exact tokens like ORD-1042 never appear in the embedding space. This guide is a production-first catalog of retrieval upgrades—query rewriting, decomposition, reranking, contextual indexing, sparse search, and hybrid fusion—so you can pick the right lever without turning every query into a five-LLM pipeline.
After reading, you should be able to: diagnose when single-embedding retrieval fails; implement HyDE, multi-query, step-back, and sub-question decomposition; run a two-stage retrieve→rerank pipeline with cross-encoders or vendor APIs; apply Anthropic-style contextual chunk prefixes at index time; combine BM25 and dense vectors with reciprocal rank fusion; and budget latency so rewriting runs only when evals justify it.
Why naive retrieval breaks
The default RAG path—embed the question, cosine-search the index, stuff top-K chunks into the prompt—is correct as a baseline. It is not sufficient for production Q&A over heterogeneous corpora. Failures cluster into predictable patterns you can measure before adding complexity.
A bi-encoder maps query and document into the same vector space independently. Similarity is cheap and scalable, but the query vector is built from how the user asked, not from how the answer appears in the corpus. Support tickets use informal language; policy PDFs use legal phrasing. The embedding model may bridge that gap—or may not, especially on domain jargon, rare acronyms, or questions that never appeared in training data.
Paraphrase and vocabulary mismatch
Consider a user who asks: “Can I get money back if I cancel my yearly plan mid-cycle?” The refund policy chunk says: “Annual subscriptions are eligible for pro-rata credit upon early termination.” Few token overlaps; a strong embedding model often still retrieves the right chunk. But swap in a product-specific term— “FlexAnnual tier” in the doc vs “the flexible annual billing option” in the query—and recall drops sharply. This is the lexical gap: dense vectors handle semantic similarity; they are weaker when the match depends on exact product names, error codes, or SKU strings that are nearly unique in the corpus.
| Failure mode | Example | Why one embedding struggles | Typical fix (later sections) |
|---|---|---|---|
| Paraphrase | “cancel yearly” vs “early termination of annual subscription” | Query and doc live in different wording regions of vector space | HyDE, multi-query, hybrid BM25 |
| Multi-hop | “Does our EU DPA cover subprocessors used by the analytics integration?” | Answer requires chunk A (DPA scope) + chunk B (analytics vendor list) | Sub-question decomposition, multi-query |
| Abstract → concrete | “What’s our stance on shadow IT?” | Policy uses “unsanctioned SaaS” not “shadow IT” | Step-back prompting, HyDE |
| Exact token | “Status for ORD-1042” | Embedding treats ID as low-signal noise | Sparse BM25, hybrid RRF |
| Underspecified chunk | Chunk: “The limit is 500 requests per minute.” | No surrounding context—which API? which tier? | Contextual retrieval at index time |
Multi-hop reasoning without multi-hop retrieval
Large language models can synthesize across multiple facts once those facts are in context. Naive retrieval assumes a single hop: one query vector, one ranked list. Multi-hop questions need either (a) multiple retrieval passes with different sub-queries, (b) a graph or agent loop that retrieves, reads, and retrieves again, or (c) larger chunks that accidentally contain both facts. Option (c) scales poorly; option (b) is Track 3 (agents). This guide focuses on (a)—rewriting and decomposing the query before the first LLM call.
Measure the baseline before upgrading. On a labeled set of 50–200 questions, log recall@K (is the gold chunk in top K?) and MRR (mean reciprocal rank). If recall@5 is already 90%, query rewriting adds latency and cost for marginal gain. If recall@5 is 55% on paraphrase-heavy tickets, HyDE or multi-query is justified.
from openai import OpenAI
client = OpenAI()
def embed(text: str) -> list[float]:
r = client.embeddings.create(model="text-embedding-3-small", input=text)
return r.data[0].embedding
def naive_retrieve(question: str, store, k: int = 5) -> list[dict]:
"""Single embedding, single search — baseline."""
q_vec = embed(question)
return store.similarity_search(q_vec, k=k)
# store = your VectorStore wrapper (Pinecone, pgvector, Chroma, etc.)
@Service
public class NaiveRetriever {
private final EmbeddingModel embeddingModel;
private final VectorStore vectorStore;
public List<Document> retrieve(String question, int k) {
float[] qVec = embeddingModel.embed(question);
return vectorStore.similaritySearch(
SearchRequest.query(qVec).withTopK(k));
}
}
Blaming the embedding model before checking chunking. A perfect retriever cannot find an answer split across two 256-token chunks with no overlap. Fix chunk boundaries and metadata filters first; then tune retrieval strategies.
Bi-encoders are trained on (query, relevant_doc) pairs—often MS MARCO or synthetic QA. At inference, your user question is the “query” tower input, but your indexed chunks are “document” tower inputs. Asymmetric training helps, yet the query distribution at runtime (typos, abbreviations, multilingual mix) still drifts from training.
More retrieval strategies mean more LLM calls before the answer call. Each rewriter adds 200–800 ms and input tokens. Treat strategies as a menu, not a checklist—enable per route or per confidence score, not globally by default.
HyDE — Hypothetical Document Embeddings
HyDE (Gao et al., 2022) asks the LLM to write a fake answer passage, embeds that hypothetical document, and searches the index with the hypothetical vector instead of the raw question. The hypothesis reads like corpus text—closing the query/document style gap.
Intuition: vector search compares the question to stored documents. Documents are declarative, third-person, complete sentences. Questions are short, interrogative, and incomplete. HyDE translates question → pseudo-document in the same register as your KB, then embeds the pseudo-document. You never show the hypothesis to the user; it is a retrieval-only artifact.
When HyDE helps vs hurts
| Scenario | HyDE | Notes |
|---|---|---|
| Paraphrase-heavy support FAQ | Often +10–20% recall@5 | Hypothesis mirrors policy tone |
| Exact ID / SKU lookup | Can hallucinate wrong IDs | Prefer BM25 or filter metadata |
| Highly factual regulated content | Risky | Fake doc may invent compliance language |
| Low-latency chat (< 1 s budget) | Usually skip | Extra LLM + embed round trip |
Implementation sketch
- Prompt LLM: “Write a short passage that would answer this question, as it might appear in our documentation. Do not invent product names.”
- Embed the hypothetical passage (same model as index).
- Vector search with hypothetical embedding; optionally also search with original question and merge (multi-vector).
- Pass retrieved real chunks to the answer LLM—never the hypothesis.
HYDE_PROMPT = """Write a single paragraph that would appear in internal documentation
and answer the question. Use neutral policy language. Do not invent IDs or numbers.
Question: {question}
Passage:"""
def hyde_retrieve(question: str, store, k: int = 5) -> list[dict]:
hypo = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
max_tokens=256,
messages=[{"role": "user", "content": HYDE_PROMPT.format(question=question)}],
).choices[0].message.content.strip()
hypo_vec = embed(hypo)
hits = store.similarity_search(hypo_vec, k=k)
for h in hits:
h["meta"] = {**h.get("meta", {}), "hyde_hypothesis": hypo[:200]}
return hits
public List<Document> hydeRetrieve(String question, int k) {
String hypo = chatClient.prompt()
.user(HYDE_PROMPT.formatted(question))
.options(ChatOptionsBuilder.builder()
.withModel("gpt-4o-mini").withTemperature(0.0).withMaxTokens(256).build())
.call()
.getResult().getOutput().getContent();
float[] hypoVec = embeddingModel.embed(hypo);
List<Document> hits = vectorStore.similaritySearch(
SearchRequest.query(hypoVec).withTopK(k));
hits.forEach(d -> d.getMetadata().put("hyde_hypothesis", hypo.substring(0, 200)));
return hits;
}
HyDE adds one chat completion (~100–300 output tokens) plus one embedding call per query. At 10k queries/day on gpt-4o-mini, budget roughly $15–40/month for the rewriter alone—before answer generation. Cache hypotheses for repeated FAQ clicks.
Teams often run HyDE only when the first-pass naive retrieval score is below a threshold (e.g. top-1 cosine < 0.72). That preserves latency on easy queries while spending LLM tokens on hard ones.
The HyDE prompt sees the user question. If the question contains PII, the hypothetical passage may echo it into logs. Redact before HyDE; do not log full hypotheses in production without a data classification review.
Multi-query — N variants, deduped results
Multi-query retrieval asks the LLM to produce N paraphrases or angle variants of the user question, runs retrieval for each, then deduplicates and merges chunk lists before context assembly. Different wordings surface different nearest neighbors in the index.
One question hides multiple retrieval intents. “How do I reset SSO and who approves it?” blends procedure and ownership. Multi-query expansion does not fully decompose (see sub-questions below)—it generates alternate phrasings of the same intent to reduce bad luck from a single embedding direction.
Dedup and merge strategies
| Strategy | Behavior | When to use |
|---|---|---|
| Union by chunk ID | Keep first occurrence; preserve best rank score | Simple; good default |
| Max score per ID | Across variants, take max similarity per chunk | When scores are comparable across queries |
| RRF across lists | Reciprocal rank fusion on each variant’s ranking | Many variants (3–5); avoids score scale issues |
| Cap per variant | Retrieve top 3 per variant, merge to top 10 unique | Control context bloat |
MULTI_QUERY_PROMPT = """Generate {n} different search queries that would help retrieve
documentation to answer the user question. One per line, no numbering.
Question: {question}"""
def generate_query_variants(question: str, n: int = 3) -> list[str]:
text = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.3,
max_tokens=200,
messages=[{"role": "user", "content": MULTI_QUERY_PROMPT.format(n=n, question=question)}],
).choices[0].message.content
variants = [question] + [ln.strip() for ln in text.splitlines() if ln.strip()]
return variants[: n + 1]
def dedupe_by_id(lists: list[list[dict]], top_n: int = 8) -> list[dict]:
seen: dict[str, dict] = {}
for results in lists:
for rank, hit in enumerate(results):
cid = hit["id"]
if cid not in seen or hit.get("score", 0) > seen[cid].get("score", 0):
seen[cid] = {**hit, "best_rank": min(seen.get(cid, {}).get("best_rank", 999), rank)}
ordered = sorted(seen.values(), key=lambda h: (h.get("best_rank", 999), -h.get("score", 0)))
return ordered[:top_n]
def multi_query_retrieve(question: str, store, k_per_query: int = 4) -> list[dict]:
variants = generate_query_variants(question)
lists = [naive_retrieve(q, store, k=k_per_query) for q in variants]
return dedupe_by_id(lists)
public List<Document> multiQueryRetrieve(String question, int kPerQuery, int topN) {
List<String> variants = new ArrayList<>();
variants.add(question);
variants.addAll(queryGenerator.generate(question, 3));
Map<String, Document> best = new LinkedHashMap<>();
for (String q : variants) {
for (Document d : naiveRetriever.retrieve(q, kPerQuery)) {
String id = d.getId();
best.merge(id, d, (a, b) ->
a.getScore() >= b.getScore() ? a : b);
}
}
return best.values().stream().limit(topN).toList();
}
Always include the original question as one of the variants. LLM paraphrases occasionally drift intent; the raw user wording is a free anchor you should not discard.
Retrieving top-10 per variant with five variants → 50 chunks before dedupe, blowing the context window after dedupe still leaves 15+ chunks. Cap per variant and final K aggressively; let reranking trim (next section).
“How would you improve RAG recall?” — Mention measurable baseline, then multi-query or HyDE for paraphrase gaps, hybrid for lexical IDs, reranker for precision, and explicit latency/cost trade-offs. Avoid saying “use a better model” without retrieval mechanics.
Step-back prompting
Step-back retrieval first asks a broader question that captures the underlying principle, retrieves on that, then retrieves on the original specific question. Principle-level chunks anchor context; detail-level chunks answer the user.
Users ask narrow questions; documentation often states rules at a higher level of abstraction. “Can contractor laptops use personal Dropbox?” may not mention Dropbox—the acceptable-use policy discusses unsanctioned cloud storage under a general principle. Step-back generates: “What is our policy on unsanctioned third-party cloud file storage?” retrieves the principle chunk; the original query retrieves exceptions and procedures.
Two-pass retrieval flow
- LLM produces a step-back question (more general, same domain).
- Retrieve top-K₁ on step-back question.
- Retrieve top-K₂ on original question.
- Merge lists (dedupe by ID); prefer diverse sources when assembling context.
- Single answer LLM call with merged context.
STEP_BACK_PROMPT = """You are an expert at generalizing questions to underlying principles.
Given a specific user question, write ONE broader question that would help retrieve
high-level policy or concept documentation. Output only the question.
Specific question: {question}"""
def step_back_question(question: str) -> str:
return client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
max_tokens=120,
messages=[{"role": "user", "content": STEP_BACK_PROMPT.format(question=question)}],
).choices[0].message.content.strip()
def step_back_retrieve(question: str, store) -> list[dict]:
broad = step_back_question(question)
principle_hits = naive_retrieve(broad, store, k=3)
detail_hits = naive_retrieve(question, store, k=5)
return dedupe_by_id([principle_hits, detail_hits], top_n=8)
public List<Document> stepBackRetrieve(String question) {
String broad = chatClient.prompt()
.user(STEP_BACK_PROMPT.formatted(question))
.call().getResult().getOutput().getContent().trim();
List<Document> principle = naiveRetriever.retrieve(broad, 3);
List<Document> detail = naiveRetriever.retrieve(question, 5);
return mergeDedupe(principle, detail, 8);
}
Step-back can pull overly general chunks that dilute the context window. Limit K₁ (principle pass) to 2–3 chunks and instruct the answer model to cite the specific passage when both general and specific rules appear.
Step-back is related to “query relaxation” in classical IR. The LLM acts as a learned query expander toward hypernyms (specific → general). It differs from HyDE, which fabricates a document rather than a broader question.
Sub-question decomposition
Compound questions contain multiple independent information needs. Decomposition splits them into sub-questions, retrieves for each sub-question, and merges evidence—enabling true multi-hop retrieval in one user turn without an agent loop.
“What’s the retention period for audit logs in the EU region, and who is the DPO contact?” requires GDPR retention docs and a contacts directory. No single embedding aligns with both. Decomposition yields: (1) EU audit log retention period, (2) DPO contact information—each gets its own retrieval pass.
Decomposition vs multi-query
| Technique | Output | Best for |
|---|---|---|
| Multi-query | Paraphrases of same intent | Lexical / embedding mismatch on one fact |
| Sub-questions | Distinct sub-intents | Compound “and” / multi-part questions |
| Step-back | One broader principle question | Specific detail needs general policy context |
DECOMPOSE_PROMPT = """Break the user question into independent sub-questions answerable from a knowledge base.
Return JSON: {{"sub_questions": ["...", "..."]}} Max 4 sub-questions.
Question: {question}"""
def decompose(question: str) -> list[str]:
import json
raw = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
response_format={"type": "json_object"},
messages=[{"role": "user", "content": DECOMPOSE_PROMPT.format(question=question)}],
).choices[0].message.content
data = json.loads(raw)
subs = data.get("sub_questions", [])
return subs if subs else [question]
def decomposed_retrieve(question: str, store, k_per_sub: int = 4) -> list[dict]:
subs = decompose(question)
lists = []
for sq in subs:
hits = naive_retrieve(sq, store, k=k_per_sub)
for h in hits:
h.setdefault("meta", {})["sub_question"] = sq
lists.append(hits)
merged = dedupe_by_id(lists, top_n=10)
return merged
public List<Document> decomposedRetrieve(String question, int kPerSub) {
List<String> subs = decomposer.decompose(question);
if (subs.isEmpty()) subs = List.of(question);
List<List<Document>> lists = new ArrayList<>();
for (String sq : subs) {
List<Document> hits = naiveRetriever.retrieve(sq, kPerSub);
hits.forEach(d -> d.getMetadata().put("sub_question", sq));
lists.add(hits);
}
return mergeDedupeLists(lists, 10);
}
For answer synthesis, pass sub-question labels in context headers so the LLM maps evidence to parts of the user question: [Sub-Q: DPO contact] before each chunk group. Reduces conflation when chunks contradict across regions.
Enterprise search products often detect compound questions with a lightweight classifier (regex on “ and ”, “?” count, conjunction NLP) before paying for decomposition LLM calls—routing simple questions to naive retrieve only.
Over-decomposition on single-intent questions adds noise sub-queries that retrieve irrelevant chunks. Cap at 3–4 sub-questions and fall back to single-query when the model returns one sub-question identical to the original.
Reranking — two-stage top-50 → top-5
First-stage bi-encoder retrieval optimizes recall over millions of chunks in milliseconds. Reranking optimizes precision on a small candidate set using a cross-encoder or vendor rerank API that scores (query, document) pairs jointly. Production pattern: retrieve 50, rerank to 5, then prompt the LLM.
Cross-encoders concatenate query and candidate text through a transformer and emit a relevance score. They see token-level interactions (“cancel” in query aligns with “termination” in doc) that bi-encoders miss. Cost is linear in candidates—hence the two-stage funnel.
Reranker options
| Option | Hosting | Latency (50 pairs) | Notes |
|---|---|---|---|
| Cohere Rerank | API (rerank v3) | ~100–300 ms | Strong multilingual; per-search pricing |
| Jina Reranker | API or self-host | ~80–250 ms | jina-reranker-v2-base-multilingual |
| BGE reranker | Self-host (FlagEmbedding) | GPU-dependent | bge-reranker-v2-m3; data stays in VPC |
| ms-marco MiniLM | Self-host CPU | ~200–500 ms | Classic baseline; English-centric |
| Voyage rerank | API | Similar to Cohere | Pairs with Voyage embeddings |
Two-stage pipeline
- Retrieve top-N with bi-encoder (N = 30–50; tune on eval set).
- Rerank all N pairs; sort by rerank score.
- Truncate to top-K for LLM context (K = 5–8).
- Log both bi-encoder rank and rerank rank for drift analysis.
import cohere
co = cohere.Client()
def rerank_hits(question: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
if not candidates:
return []
docs = [c["text"] for c in candidates]
response = co.rerank(
model="rerank-v3.5",
query=question,
documents=docs,
top_n=min(top_n, len(docs)),
)
out = []
for r in response.results:
hit = {**candidates[r.index], "rerank_score": r.relevance_score}
out.append(hit)
return out
def two_stage_retrieve(question: str, store, retrieve_n: int = 50, final_k: int = 5) -> list[dict]:
coarse = naive_retrieve(question, store, k=retrieve_n)
return rerank_hits(question, coarse, top_n=final_k)
// Spring AI Rerank API (Cohere or custom RerankModel bean)
public List<Document> twoStageRetrieve(String question, int retrieveN, int finalK) {
List<Document> coarse = naiveRetriever.retrieve(question, retrieveN);
RerankRequest req = new RerankRequest(question, coarse, finalK);
return rerankModel.call(req).getResults();
}
from sentence_transformers import CrossEncoder
_reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def bge_rerank(question: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
pairs = [(question, c["text"]) for c in candidates]
scores = _reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [{**c, "rerank_score": float(s)} for c, s in ranked[:top_n]]
// ONNX or DJL-serving deployment of bge-reranker in VPC
// Java client posts batch of {query, document} pairs to inference endpoint
// and receives float[] scores — same contract as Python CrossEncoder.predict
Cohere Rerank bills per search (query + all documents in one call). Fifty 500-token chunks ≈ 25k rerank tokens. Compare API cost vs GPU amortization for self-hosted BGE at your QPS—break-even often appears around 1–5M rerank calls/month.
Rerank after hybrid fusion, not twice. Run bi-encoder + BM25 → RRF → single rerank on the merged top-50. Reranking each rewriter variant separately multiplies cost.
Explain bi-encoder vs cross-encoder latency complexity: O(index) vs O(K) per query. The two-stage pattern is the standard answer for “how do you scale semantic search to millions of docs with high precision?”
Contextual retrieval (Anthropic)
Chunks lose surrounding context when split. Contextual retrieval prepends a short, LLM-generated context blurb to each chunk before embedding at index time, so vectors encode “this passage is about EU GDPR retention in the logging policy” not just “retention is 90 days.”
Anthropic’s published workflow: for each chunk, prompt Claude with the full document (or parent section) and the chunk text; ask for a concise situating prefix (50–100 tokens). Store context_prefix + chunk as the embedded text; at query time search as usual. Reported gains: ~35% reduction in retrieval failures when combined with contextual embeddings and BM25.
Index-time vs query-time
| Approach | When | Cost | Benefit |
|---|---|---|---|
| Contextual prefix (index) | Ingest pipeline | One LLM call per chunk (batchable) | Fixes orphan chunk embeddings permanently |
| Parent document summary | Ingest | One call per doc + attach as metadata | Cheaper; less granular |
| HyDE (query) | Each query | Per user question | Fixes query side, not chunk side |
CONTEXTUALIZE_PROMPT = """<document>
{whole_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Give a short succinct context (50-100 tokens) to situate this chunk in the document
for search retrieval. Answer only with the context, nothing else."""
def contextualize_chunk(whole_document: str, chunk: str) -> str:
msg = CONTEXTUALIZE_PROMPT.format(whole_document=whole_document[:120_000], chunk=chunk)
prefix = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
max_tokens=150,
messages=[{"role": "user", "content": msg}],
).choices[0].message.content.strip()
return f"{prefix}\n\n{chunk}"
def index_with_context(chunks: list[dict], doc_text: str, store) -> None:
for ch in chunks:
embedded_text = contextualize_chunk(doc_text, ch["text"])
vec = embed(embedded_text)
store.upsert(id=ch["id"], vector=vec, text=ch["text"], meta={
**ch.get("meta", {}),
"embedded_text": embedded_text[:500], # debug only; redact in prod logs
})
public String contextualizeChunk(String wholeDocument, String chunk) {
String prompt = CONTEXTUALIZE_PROMPT.formatted(
truncate(wholeDocument, 120_000), chunk);
String prefix = chatClient.prompt().user(prompt)
.options(ChatOptionsBuilder.builder().withTemperature(0.0).withMaxTokens(150).build())
.call().getResult().getOutput().getContent().trim();
return prefix + "\n\n" + chunk;
}
public void indexWithContext(List<Chunk> chunks, String docText) {
for (Chunk ch : chunks) {
String embeddedText = contextualizeChunk(docText, ch.getText());
vectorStore.add(new Document(ch.getId(), ch.getText(),
embeddingModel.embed(embeddedText), ch.getMeta()));
}
}
Re-index when contextualization prompt or model version changes—vectors are not backward compatible. Version the index (index_version: contextual-v2-gpt4o-mini) and dual-write during migrations.
Contextual retrieval fixes the document tower of asymmetric retrieval. Query rewriting fixes the query tower. Production stacks often use both: contextual prefixes at ingest, light multi-query or rerank at query time.
One LLM call per chunk at ingest is expensive for 10M chunks. Use parent-section context instead of full document for long PDFs, batch prompts, and run contextualization only on high-traffic collections first.
Contextualization prompts include full document text. Route through the same data-residency and PII scrubbing pipeline as embedding; do not send restricted annexes to external APIs without legal review.
Sparse retrieval — BM25 & TF-IDF
Sparse methods represent text as high-dimensional vectors with mostly zero weights—classic bag-of-words with term frequency and inverse document frequency. BM25 is the production default for lexical search; dense embeddings are the default for semantic search. Serious RAG runs both.
BM25 scores documents by weighted token overlap with the query, with length normalization so long boilerplate chunks do not dominate. It excels at SKUs, error codes, legal citations, person names, and rare technical tokens that embedding models under-weight. It fails on paraphrase and cross-language semantic match—complement, not replacement, for dense retrieval.
BM25 vs TF-IDF
| Method | Scoring idea | Typical use |
|---|---|---|
| TF-IDF | Term freq × log(N/df) | Teaching baseline; sklearn prototypes |
| BM25Okapi | Saturated TF + length norm + IDF | Production sparse retrieval default |
| Elasticsearch BM25 | Same family, distributed inverted index | 10M+ docs, analyzers, facets |
Managed vs in-process
OpenSearch / Elasticsearch: inverted index at scale, analyzers (stemming, stopwords, synonyms), hybrid kNN + BM25 in one cluster. rank_bm25 in Python: fine for <500k chunks in memory on a single pod. Pick managed search when you need ACL filters, aggregations, and ops-owned scaling.
from rank_bm25 import BM25Okapi
class Bm25Index:
def __init__(self, chunks: list[dict]):
self.chunks = chunks
corpus = [c["text"].lower().split() for c in chunks]
self.bm25 = BM25Okapi(corpus)
def search(self, query: str, k: int = 10) -> list[dict]:
tokens = query.lower().split()
scores = self.bm25.get_scores(tokens)
ranked = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
return [
{"id": self.chunks[i]["id"], "text": self.chunks[i]["text"],
"score": float(scores[i]), "channel": "bm25"}
for i in ranked if scores[i] > 0
]
// Elasticsearch Java API Client — BM25 query
SearchResponse<ChunkDoc> response = esClient.search(s -> s
.index("kb-chunks")
.query(q -> q.match(m -> m.field("text").query(question)))
.size(k),
ChunkDoc.class);
List<Document> hits = response.hits().hits().stream()
.map(h -> toDocument(h))
.toList();
# OpenSearch 2.x — combine knn and match in one query (then RRF in app or search pipeline)
hybrid_query = {
"query": {
"hybrid": {
"queries": [
{"match": {"text": {"query": question}}},
{"knn": {"embedding": {"vector": q_vec, "k": 50}}},
]
}
}
}
# Many teams still run two queries + RRF in application code for tunability
// Parallel: CompletableFuture for knn branch + match branch, merge with RrfMerger
CompletableFuture<List<Document>> dense = CompletableFuture.supplyAsync(
() -> vectorStore.similaritySearch(question, 50));
CompletableFuture<List<Document>> sparse = CompletableFuture.supplyAsync(
() -> openSearchClient.bm25Search(question, 50));
List<Document> fused = rrfMerger.merge(List.of(dense.join(), sparse.join()), 10);
Support KBs with ticket IDs and policy numbers almost always ship hybrid. Internal wiki-only RAG may stay dense-only until evals show ORD-/SKU queries failing—then add BM25 on the same chunk IDs without re-chunking.
Tokenizing BM25 with naive .split() breaks on code snippets, emails, and hyphenated product names. Mirror Elasticsearch analyzers in offline evals, or use the same analyzer at index and query time.
Hybrid fusion — reciprocal rank fusion (RRF)
Dense and sparse retrievers return scores on incomparable scales—cosine similarity vs BM25 raw scores. Reciprocal Rank Fusion merges ranked lists using only ranks, not raw scores: robust, parameter-light, and widely used in hybrid RAG.
RRF score formula
For document d, given ranked lists from retrievers 1…M, each list contributing ranki(d) (1-based; unranked = ∞):
RRF(d) = Σi=1..M 1 / (k + ranki(d))
Constant k (typically 60, from the original RRF paper) dampens the impact of top-1 obsession: a document ranked 1st in BM25 and 40th in dense still beats a document 5th in both. Documents appearing in multiple lists accumulate mass.
| Parameter | Typical value | Effect |
|---|---|---|
| k | 60 | Higher → flatter fusion; lower → top ranks dominate |
| Lists merged | dense + BM25 (+ optional multi-query lists) | More lists → more recall; tune k |
| top_n after fusion | 10–50 before rerank | Feed reranker a wide funnel |
from collections import defaultdict
def rrf_score(rank: int, k: int = 60) -> float:
"""rank is 1-based (first result = 1)."""
return 1.0 / (k + rank)
def rrf_merge(lists: list[list[dict]], k: int = 60, top_n: int = 10) -> list[dict]:
"""
lists: each inner list is ordered best-first with unique chunk 'id' keys.
Returns merged list sorted by RRF score descending.
"""
scores: dict[str, float] = defaultdict(float)
payload: dict[str, dict] = {}
for results in lists:
for rank_0, hit in enumerate(results):
cid = hit["id"]
scores[cid] += rrf_score(rank_0 + 1, k=k)
# keep richest payload (prefer dense metadata if duplicate)
payload[cid] = {**payload.get(cid, {}), **hit}
ordered_ids = sorted(scores.keys(), key=lambda c: scores[c], reverse=True)[:top_n]
return [{**payload[cid], "rrf_score": scores[cid]} for cid in ordered_ids]
def hybrid_retrieve(question: str, dense_store, bm25: Bm25Index, k_each: int = 25, final_n: int = 10) -> list[dict]:
dense_hits = naive_retrieve(question, dense_store, k=k_each)
sparse_hits = bm25.search(question, k=k_each)
return rrf_merge([dense_hits, sparse_hits], k=60, top_n=final_n)
public final class RrfMerger {
private final int k;
public RrfMerger(int k) { this.k = k; }
private double rrfScore(int rankOneBased) {
return 1.0 / (k + rankOneBased);
}
public List<Document> merge(List<List<Document>> lists, int topN) {
Map<String, Double> scores = new HashMap<>();
Map<String, Document> payload = new HashMap<>();
for (List<Document> results : lists) {
for (int i = 0; i < results.size(); i++) {
Document d = results.get(i);
scores.merge(d.getId(), rrfScore(i + 1), Double::sum);
payload.merge(d.getId(), d, (a, b) -> b.getScore() > a.getScore() ? b : a);
}
}
return scores.entrySet().stream()
.sorted((a, b) -> Double.compare(b.getValue(), a.getValue()))
.limit(topN)
.map(e -> {
Document doc = payload.get(e.getKey());
doc.getMetadata().put("rrf_score", e.getValue());
return doc;
})
.toList();
}
}
End-to-end hybrid + rerank stack
- Dense top-25 + BM25 top-25 in parallel.
- RRF → top-15 unique chunks.
- Cross-encoder / Cohere rerank → top-5 for LLM.
- Log channel attribution (dense-only, sparse-only, both) for eval dashboards.
RRF is rank-based—no score normalization required. Alternatives include weighted linear combination after min-max scaling (fragile when score distributions shift) and learned fusion (LTR models)—heavier ops. RRF is the default hybrid merge in most RAG platforms in 2024–2026.
Tag each hit with retrieval_channels: ["dense","bm25"] before fusion. When debugging wrong answers, you instantly see whether the failure was semantic, lexical, or fusion.
Hybrid doubles index storage (vectors + inverted index) and query work (two searches). The dominant cost is often still the answer LLM—not retrieval— unless you run five query rewrites on every turn. Right-size k_each before adding more lists to RRF.
What this looks like in production
Retrieval strategy is a latency and cost budget problem. The winning stack is not “enable everything”—it is a routed pipeline with metrics, fallbacks, and clear rules for when to skip rewriting.
Latency budget (example enterprise chat)
| Stage | p95 target | Skip when |
|---|---|---|
| Query rewrite (HyDE / multi-query) | 300–800 ms | Naive top-1 score > threshold; cached FAQ |
| Dense + sparse parallel | 50–150 ms | Never—core path |
| RRF merge | < 5 ms | Never—in-process |
| Rerank top-50 → 5 | 100–400 ms | Low-stakes internal bot; offline batch |
| Answer LLM | 1–4 s | N/A |
| Total retrieval | < 1.2 s p95 | Route simple queries to fast path |
When to skip query rewriting
- High-confidence retrieval — top-1 cosine ≥ tuned threshold on held-out evals.
- Metadata-filtered lookup — user picked product + region; filter narrows to <100 chunks.
- Exact keyword routes — regex detects ticket ID; go BM25-only.
- Latency SLO < 2 s end-to-end — rewriting + rerank + answer may exceed budget.
- Cached questions — same FAQ click within TTL; reuse chunk IDs.
When to enable each strategy
| Strategy | Enable when eval shows… |
|---|---|
| HyDE / multi-query | Paraphrase bucket recall@5 < 70% |
| Step-back | Policy “principle” questions miss general section |
| Decomposition | Compound questions > 15% of traffic |
| Contextual retrieval | Small orphan chunks dominate failures |
| Hybrid BM25 | ID/code/SKU queries fail dense-only |
| Rerank | Right chunk in top-50 but not top-5 |
Observability fields (every query)
- retrieval_strategy_path — e.g. fast | hyde+hybrid+rerank
- chunk_ids, ranks per channel, rrf_score, rerank_score
- Rewrite prompts version; hypothesis length (HyDE)
- retrieve_ms, rerank_ms, rewrite_ms, answer_ms
- user_feedback / thumbs — link to retrieved set for failure mining
def retrieve_for_production(question: str, store, bm25: Bm25Index, cfg) -> tuple[list[dict], dict]:
meta = {"path": "fast"}
coarse = hybrid_retrieve(question, store, bm25, k_each=25, final_n=50)
if coarse and coarse[0].get("score", 0) >= cfg.confidence_skip_rewrite:
meta["path"] = "hybrid_only"
final = rerank_hits(question, coarse, top_n=5) if cfg.use_rerank else coarse[:5]
return final, meta
if cfg.use_multi_query:
coarse = multi_query_retrieve(question, store)
meta["path"] = "multi_query+hybrid"
if cfg.use_rerank:
final = rerank_hits(question, coarse, top_n=5)
meta["path"] += "+rerank"
else:
final = coarse[:5]
return final, meta
public RetrievalResult retrieveForProduction(String question, RetrievalConfig cfg) {
RetrievalMeta meta = new RetrievalMeta();
List<Document> coarse = hybridRetriever.retrieve(question, 50);
if (!coarse.isEmpty() && coarse.get(0).getScore() >= cfg.getConfidenceSkipRewrite()) {
meta.setPath("hybrid_only");
return new RetrievalResult(
cfg.isUseRerank() ? reranker.rerank(question, coarse, 5) : coarse.subList(0, 5),
meta);
}
if (cfg.isUseMultiQuery()) {
coarse = multiQueryRetriever.retrieve(question);
meta.setPath("multi_query+hybrid");
}
List<Document> finalList = cfg.isUseRerank()
? reranker.rerank(question, coarse, 5)
: coarse.subList(0, Math.min(5, coarse.size()));
return new RetrievalResult(finalList, meta);
}
Production readiness checklist
- Labelled eval set with per-strategy recall@5 and MRR
- Fast path + slow path with explicit routing rules
- Same chunk IDs across dense, sparse, and citation UI
- Index versioning for contextual embedding migrations
- Rerank and rewrite behind feature flags per tenant
- Retrieval logs joined to answer quality (thumbs, CSAT)
- Load test: p95 retrieve+rerank under 2× expected QPS
Perplexity-style products iterate retrieval in offline replay before online A/B. Ship one strategy at a time—hybrid first, rerank second, query rewriting last—so regressions are attributable.
“Design RAG retrieval for a legal KB.” — Open with chunking + hybrid + rerank, mention contextual indexing for citations, latency budget with fast path, eval metrics, and never sending raw user queries to rewrite without logging/redaction policy.
Store the retrieval snapshot (chunk IDs + scores) on each answer record. When users report wrong answers six weeks later, you can replay with new strategies without reproducing the session.