Lang
API

Retrieval strategies & query rewriting

Naive RAG embeds the user question and runs one vector search. That works until paraphrases miss, compound questions need two hops, or exact tokens like ORD-1042 never appear in the embedding space. This guide is a production-first catalog of retrieval upgrades—query rewriting, decomposition, reranking, contextual indexing, sparse search, and hybrid fusion—so you can pick the right lever without turning every query into a five-LLM pipeline.

After reading, you should be able to: diagnose when single-embedding retrieval fails; implement HyDE, multi-query, step-back, and sub-question decomposition; run a two-stage retrieve→rerank pipeline with cross-encoders or vendor APIs; apply Anthropic-style contextual chunk prefixes at index time; combine BM25 and dense vectors with reciprocal rank fusion; and budget latency so rewriting runs only when evals justify it.

developer platform architect Track 2 HyDE Cohere Rerank BM25 + RRF LangChain 0.3+ Spring AI 1.0+

Why naive retrieval breaks

The default RAG path—embed the question, cosine-search the index, stuff top-K chunks into the prompt—is correct as a baseline. It is not sufficient for production Q&A over heterogeneous corpora. Failures cluster into predictable patterns you can measure before adding complexity.

A bi-encoder maps query and document into the same vector space independently. Similarity is cheap and scalable, but the query vector is built from how the user asked, not from how the answer appears in the corpus. Support tickets use informal language; policy PDFs use legal phrasing. The embedding model may bridge that gap—or may not, especially on domain jargon, rare acronyms, or questions that never appeared in training data.

Paraphrase and vocabulary mismatch

Consider a user who asks: “Can I get money back if I cancel my yearly plan mid-cycle?” The refund policy chunk says: “Annual subscriptions are eligible for pro-rata credit upon early termination.” Few token overlaps; a strong embedding model often still retrieves the right chunk. But swap in a product-specific term— “FlexAnnual tier” in the doc vs “the flexible annual billing option” in the query—and recall drops sharply. This is the lexical gap: dense vectors handle semantic similarity; they are weaker when the match depends on exact product names, error codes, or SKU strings that are nearly unique in the corpus.

Failure modeExampleWhy one embedding strugglesTypical fix (later sections)
Paraphrase “cancel yearly” vs “early termination of annual subscription” Query and doc live in different wording regions of vector space HyDE, multi-query, hybrid BM25
Multi-hop “Does our EU DPA cover subprocessors used by the analytics integration?” Answer requires chunk A (DPA scope) + chunk B (analytics vendor list) Sub-question decomposition, multi-query
Abstract → concrete “What’s our stance on shadow IT?” Policy uses “unsanctioned SaaS” not “shadow IT” Step-back prompting, HyDE
Exact token “Status for ORD-1042” Embedding treats ID as low-signal noise Sparse BM25, hybrid RRF
Underspecified chunk Chunk: “The limit is 500 requests per minute.” No surrounding context—which API? which tier? Contextual retrieval at index time

Multi-hop reasoning without multi-hop retrieval

Large language models can synthesize across multiple facts once those facts are in context. Naive retrieval assumes a single hop: one query vector, one ranked list. Multi-hop questions need either (a) multiple retrieval passes with different sub-queries, (b) a graph or agent loop that retrieves, reads, and retrieves again, or (c) larger chunks that accidentally contain both facts. Option (c) scales poorly; option (b) is Track 3 (agents). This guide focuses on (a)—rewriting and decomposing the query before the first LLM call.

Measure the baseline before upgrading. On a labeled set of 50–200 questions, log recall@K (is the gold chunk in top K?) and MRR (mean reciprocal rank). If recall@5 is already 90%, query rewriting adds latency and cost for marginal gain. If recall@5 is 55% on paraphrase-heavy tickets, HyDE or multi-query is justified.

Naive retrieve — baseline to beat
from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    r = client.embeddings.create(model="text-embedding-3-small", input=text)
    return r.data[0].embedding

def naive_retrieve(question: str, store, k: int = 5) -> list[dict]:
    """Single embedding, single search — baseline."""
    q_vec = embed(question)
    return store.similarity_search(q_vec, k=k)

# store = your VectorStore wrapper (Pinecone, pgvector, Chroma, etc.)
@Service
public class NaiveRetriever {
    private final EmbeddingModel embeddingModel;
    private final VectorStore vectorStore;

    public List<Document> retrieve(String question, int k) {
        float[] qVec = embeddingModel.embed(question);
        return vectorStore.similaritySearch(
            SearchRequest.query(qVec).withTopK(k));
    }
}
⚠️ Pitfall

Blaming the embedding model before checking chunking. A perfect retriever cannot find an answer split across two 256-token chunks with no overlap. Fix chunk boundaries and metadata filters first; then tune retrieval strategies.

🔬 Under the Hood

Bi-encoders are trained on (query, relevant_doc) pairs—often MS MARCO or synthetic QA. At inference, your user question is the “query” tower input, but your indexed chunks are “document” tower inputs. Asymmetric training helps, yet the query distribution at runtime (typos, abbreviations, multilingual mix) still drifts from training.

⚖️ Trade-off

More retrieval strategies mean more LLM calls before the answer call. Each rewriter adds 200–800 ms and input tokens. Treat strategies as a menu, not a checklist—enable per route or per confidence score, not globally by default.

HyDE — Hypothetical Document Embeddings

HyDE (Gao et al., 2022) asks the LLM to write a fake answer passage, embeds that hypothetical document, and searches the index with the hypothetical vector instead of the raw question. The hypothesis reads like corpus text—closing the query/document style gap.

Intuition: vector search compares the question to stored documents. Documents are declarative, third-person, complete sentences. Questions are short, interrogative, and incomplete. HyDE translates question → pseudo-document in the same register as your KB, then embeds the pseudo-document. You never show the hypothesis to the user; it is a retrieval-only artifact.

When HyDE helps vs hurts

ScenarioHyDENotes
Paraphrase-heavy support FAQOften +10–20% recall@5Hypothesis mirrors policy tone
Exact ID / SKU lookupCan hallucinate wrong IDsPrefer BM25 or filter metadata
Highly factual regulated contentRiskyFake doc may invent compliance language
Low-latency chat (< 1 s budget)Usually skipExtra LLM + embed round trip

Implementation sketch

  1. Prompt LLM: “Write a short passage that would answer this question, as it might appear in our documentation. Do not invent product names.”
  2. Embed the hypothetical passage (same model as index).
  3. Vector search with hypothetical embedding; optionally also search with original question and merge (multi-vector).
  4. Pass retrieved real chunks to the answer LLM—never the hypothesis.
HyDE retrieve
HYDE_PROMPT = """Write a single paragraph that would appear in internal documentation
and answer the question. Use neutral policy language. Do not invent IDs or numbers.

Question: {question}

Passage:"""

def hyde_retrieve(question: str, store, k: int = 5) -> list[dict]:
    hypo = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        max_tokens=256,
        messages=[{"role": "user", "content": HYDE_PROMPT.format(question=question)}],
    ).choices[0].message.content.strip()

    hypo_vec = embed(hypo)
    hits = store.similarity_search(hypo_vec, k=k)
    for h in hits:
        h["meta"] = {**h.get("meta", {}), "hyde_hypothesis": hypo[:200]}
    return hits
public List<Document> hydeRetrieve(String question, int k) {
    String hypo = chatClient.prompt()
        .user(HYDE_PROMPT.formatted(question))
        .options(ChatOptionsBuilder.builder()
            .withModel("gpt-4o-mini").withTemperature(0.0).withMaxTokens(256).build())
        .call()
        .getResult().getOutput().getContent();

    float[] hypoVec = embeddingModel.embed(hypo);
    List<Document> hits = vectorStore.similaritySearch(
        SearchRequest.query(hypoVec).withTopK(k));
    hits.forEach(d -> d.getMetadata().put("hyde_hypothesis", hypo.substring(0, 200)));
    return hits;
}
💰 Cost

HyDE adds one chat completion (~100–300 output tokens) plus one embedding call per query. At 10k queries/day on gpt-4o-mini, budget roughly $15–40/month for the rewriter alone—before answer generation. Cache hypotheses for repeated FAQ clicks.

📦 Real World

Teams often run HyDE only when the first-pass naive retrieval score is below a threshold (e.g. top-1 cosine < 0.72). That preserves latency on easy queries while spending LLM tokens on hard ones.

🔒 Security

The HyDE prompt sees the user question. If the question contains PII, the hypothetical passage may echo it into logs. Redact before HyDE; do not log full hypotheses in production without a data classification review.

Multi-query — N variants, deduped results

Multi-query retrieval asks the LLM to produce N paraphrases or angle variants of the user question, runs retrieval for each, then deduplicates and merges chunk lists before context assembly. Different wordings surface different nearest neighbors in the index.

One question hides multiple retrieval intents. “How do I reset SSO and who approves it?” blends procedure and ownership. Multi-query expansion does not fully decompose (see sub-questions below)—it generates alternate phrasings of the same intent to reduce bad luck from a single embedding direction.

Dedup and merge strategies

StrategyBehaviorWhen to use
Union by chunk ID Keep first occurrence; preserve best rank score Simple; good default
Max score per ID Across variants, take max similarity per chunk When scores are comparable across queries
RRF across lists Reciprocal rank fusion on each variant’s ranking Many variants (3–5); avoids score scale issues
Cap per variant Retrieve top 3 per variant, merge to top 10 unique Control context bloat
Multi-query with dedupe by chunk ID
MULTI_QUERY_PROMPT = """Generate {n} different search queries that would help retrieve
documentation to answer the user question. One per line, no numbering.

Question: {question}"""

def generate_query_variants(question: str, n: int = 3) -> list[str]:
    text = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.3,
        max_tokens=200,
        messages=[{"role": "user", "content": MULTI_QUERY_PROMPT.format(n=n, question=question)}],
    ).choices[0].message.content
    variants = [question] + [ln.strip() for ln in text.splitlines() if ln.strip()]
    return variants[: n + 1]

def dedupe_by_id(lists: list[list[dict]], top_n: int = 8) -> list[dict]:
    seen: dict[str, dict] = {}
    for results in lists:
        for rank, hit in enumerate(results):
            cid = hit["id"]
            if cid not in seen or hit.get("score", 0) > seen[cid].get("score", 0):
                seen[cid] = {**hit, "best_rank": min(seen.get(cid, {}).get("best_rank", 999), rank)}
    ordered = sorted(seen.values(), key=lambda h: (h.get("best_rank", 999), -h.get("score", 0)))
    return ordered[:top_n]

def multi_query_retrieve(question: str, store, k_per_query: int = 4) -> list[dict]:
    variants = generate_query_variants(question)
    lists = [naive_retrieve(q, store, k=k_per_query) for q in variants]
    return dedupe_by_id(lists)
public List<Document> multiQueryRetrieve(String question, int kPerQuery, int topN) {
    List<String> variants = new ArrayList<>();
    variants.add(question);
    variants.addAll(queryGenerator.generate(question, 3));

    Map<String, Document> best = new LinkedHashMap<>();
    for (String q : variants) {
        for (Document d : naiveRetriever.retrieve(q, kPerQuery)) {
            String id = d.getId();
            best.merge(id, d, (a, b) ->
                a.getScore() >= b.getScore() ? a : b);
        }
    }
    return best.values().stream().limit(topN).toList();
}
💡 Pro Tip

Always include the original question as one of the variants. LLM paraphrases occasionally drift intent; the raw user wording is a free anchor you should not discard.

⚠️ Pitfall

Retrieving top-10 per variant with five variants → 50 chunks before dedupe, blowing the context window after dedupe still leaves 15+ chunks. Cap per variant and final K aggressively; let reranking trim (next section).

🎯 Interview Tip

“How would you improve RAG recall?” — Mention measurable baseline, then multi-query or HyDE for paraphrase gaps, hybrid for lexical IDs, reranker for precision, and explicit latency/cost trade-offs. Avoid saying “use a better model” without retrieval mechanics.

Step-back prompting

Step-back retrieval first asks a broader question that captures the underlying principle, retrieves on that, then retrieves on the original specific question. Principle-level chunks anchor context; detail-level chunks answer the user.

Users ask narrow questions; documentation often states rules at a higher level of abstraction. “Can contractor laptops use personal Dropbox?” may not mention Dropbox—the acceptable-use policy discusses unsanctioned cloud storage under a general principle. Step-back generates: “What is our policy on unsanctioned third-party cloud file storage?” retrieves the principle chunk; the original query retrieves exceptions and procedures.

Two-pass retrieval flow

  1. LLM produces a step-back question (more general, same domain).
  2. Retrieve top-K₁ on step-back question.
  3. Retrieve top-K₂ on original question.
  4. Merge lists (dedupe by ID); prefer diverse sources when assembling context.
  5. Single answer LLM call with merged context.
Step-back dual retrieval
STEP_BACK_PROMPT = """You are an expert at generalizing questions to underlying principles.
Given a specific user question, write ONE broader question that would help retrieve
high-level policy or concept documentation. Output only the question.

Specific question: {question}"""

def step_back_question(question: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        max_tokens=120,
        messages=[{"role": "user", "content": STEP_BACK_PROMPT.format(question=question)}],
    ).choices[0].message.content.strip()

def step_back_retrieve(question: str, store) -> list[dict]:
    broad = step_back_question(question)
    principle_hits = naive_retrieve(broad, store, k=3)
    detail_hits = naive_retrieve(question, store, k=5)
    return dedupe_by_id([principle_hits, detail_hits], top_n=8)
public List<Document> stepBackRetrieve(String question) {
    String broad = chatClient.prompt()
        .user(STEP_BACK_PROMPT.formatted(question))
        .call().getResult().getOutput().getContent().trim();

    List<Document> principle = naiveRetriever.retrieve(broad, 3);
    List<Document> detail = naiveRetriever.retrieve(question, 5);
    return mergeDedupe(principle, detail, 8);
}
⚖️ Trade-off

Step-back can pull overly general chunks that dilute the context window. Limit K₁ (principle pass) to 2–3 chunks and instruct the answer model to cite the specific passage when both general and specific rules appear.

🔬 Under the Hood

Step-back is related to “query relaxation” in classical IR. The LLM acts as a learned query expander toward hypernyms (specific → general). It differs from HyDE, which fabricates a document rather than a broader question.

Sub-question decomposition

Compound questions contain multiple independent information needs. Decomposition splits them into sub-questions, retrieves for each sub-question, and merges evidence—enabling true multi-hop retrieval in one user turn without an agent loop.

“What’s the retention period for audit logs in the EU region, and who is the DPO contact?” requires GDPR retention docs and a contacts directory. No single embedding aligns with both. Decomposition yields: (1) EU audit log retention period, (2) DPO contact information—each gets its own retrieval pass.

Decomposition vs multi-query

TechniqueOutputBest for
Multi-queryParaphrases of same intentLexical / embedding mismatch on one fact
Sub-questionsDistinct sub-intentsCompound “and” / multi-part questions
Step-backOne broader principle questionSpecific detail needs general policy context
Decompose → retrieve per sub-question → merge
DECOMPOSE_PROMPT = """Break the user question into independent sub-questions answerable from a knowledge base.
Return JSON: {{"sub_questions": ["...", "..."]}} Max 4 sub-questions.

Question: {question}"""

def decompose(question: str) -> list[str]:
    import json
    raw = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": DECOMPOSE_PROMPT.format(question=question)}],
    ).choices[0].message.content
    data = json.loads(raw)
    subs = data.get("sub_questions", [])
    return subs if subs else [question]

def decomposed_retrieve(question: str, store, k_per_sub: int = 4) -> list[dict]:
    subs = decompose(question)
    lists = []
    for sq in subs:
        hits = naive_retrieve(sq, store, k=k_per_sub)
        for h in hits:
            h.setdefault("meta", {})["sub_question"] = sq
        lists.append(hits)
    merged = dedupe_by_id(lists, top_n=10)
    return merged
public List<Document> decomposedRetrieve(String question, int kPerSub) {
    List<String> subs = decomposer.decompose(question);
    if (subs.isEmpty()) subs = List.of(question);

    List<List<Document>> lists = new ArrayList<>();
    for (String sq : subs) {
        List<Document> hits = naiveRetriever.retrieve(sq, kPerSub);
        hits.forEach(d -> d.getMetadata().put("sub_question", sq));
        lists.add(hits);
    }
    return mergeDedupeLists(lists, 10);
}

For answer synthesis, pass sub-question labels in context headers so the LLM maps evidence to parts of the user question: [Sub-Q: DPO contact] before each chunk group. Reduces conflation when chunks contradict across regions.

📦 Real World

Enterprise search products often detect compound questions with a lightweight classifier (regex on “ and ”, “?” count, conjunction NLP) before paying for decomposition LLM calls—routing simple questions to naive retrieve only.

⚠️ Pitfall

Over-decomposition on single-intent questions adds noise sub-queries that retrieve irrelevant chunks. Cap at 3–4 sub-questions and fall back to single-query when the model returns one sub-question identical to the original.

Reranking — two-stage top-50 → top-5

First-stage bi-encoder retrieval optimizes recall over millions of chunks in milliseconds. Reranking optimizes precision on a small candidate set using a cross-encoder or vendor rerank API that scores (query, document) pairs jointly. Production pattern: retrieve 50, rerank to 5, then prompt the LLM.

Cross-encoders concatenate query and candidate text through a transformer and emit a relevance score. They see token-level interactions (“cancel” in query aligns with “termination” in doc) that bi-encoders miss. Cost is linear in candidates—hence the two-stage funnel.

Reranker options

OptionHostingLatency (50 pairs)Notes
Cohere RerankAPI (rerank v3)~100–300 msStrong multilingual; per-search pricing
Jina RerankerAPI or self-host~80–250 msjina-reranker-v2-base-multilingual
BGE rerankerSelf-host (FlagEmbedding)GPU-dependentbge-reranker-v2-m3; data stays in VPC
ms-marco MiniLMSelf-host CPU~200–500 msClassic baseline; English-centric
Voyage rerankAPISimilar to CoherePairs with Voyage embeddings

Two-stage pipeline

  1. Retrieve top-N with bi-encoder (N = 30–50; tune on eval set).
  2. Rerank all N pairs; sort by rerank score.
  3. Truncate to top-K for LLM context (K = 5–8).
  4. Log both bi-encoder rank and rerank rank for drift analysis.
Cohere rerank — top-50 → top-5
import cohere

co = cohere.Client()

def rerank_hits(question: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    if not candidates:
        return []
    docs = [c["text"] for c in candidates]
    response = co.rerank(
        model="rerank-v3.5",
        query=question,
        documents=docs,
        top_n=min(top_n, len(docs)),
    )
    out = []
    for r in response.results:
        hit = {**candidates[r.index], "rerank_score": r.relevance_score}
        out.append(hit)
    return out

def two_stage_retrieve(question: str, store, retrieve_n: int = 50, final_k: int = 5) -> list[dict]:
    coarse = naive_retrieve(question, store, k=retrieve_n)
    return rerank_hits(question, coarse, top_n=final_k)
// Spring AI Rerank API (Cohere or custom RerankModel bean)
public List<Document> twoStageRetrieve(String question, int retrieveN, int finalK) {
    List<Document> coarse = naiveRetriever.retrieve(question, retrieveN);
    RerankRequest req = new RerankRequest(question, coarse, finalK);
    return rerankModel.call(req).getResults();
}
Local BGE cross-encoder (self-hosted)
from sentence_transformers import CrossEncoder

_reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

def bge_rerank(question: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    pairs = [(question, c["text"]) for c in candidates]
    scores = _reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [{**c, "rerank_score": float(s)} for c, s in ranked[:top_n]]
// ONNX or DJL-serving deployment of bge-reranker in VPC
// Java client posts batch of {query, document} pairs to inference endpoint
// and receives float[] scores — same contract as Python CrossEncoder.predict
💰 Cost

Cohere Rerank bills per search (query + all documents in one call). Fifty 500-token chunks ≈ 25k rerank tokens. Compare API cost vs GPU amortization for self-hosted BGE at your QPS—break-even often appears around 1–5M rerank calls/month.

💡 Pro Tip

Rerank after hybrid fusion, not twice. Run bi-encoder + BM25 → RRF → single rerank on the merged top-50. Reranking each rewriter variant separately multiplies cost.

🎯 Interview Tip

Explain bi-encoder vs cross-encoder latency complexity: O(index) vs O(K) per query. The two-stage pattern is the standard answer for “how do you scale semantic search to millions of docs with high precision?”

Contextual retrieval (Anthropic)

Chunks lose surrounding context when split. Contextual retrieval prepends a short, LLM-generated context blurb to each chunk before embedding at index time, so vectors encode “this passage is about EU GDPR retention in the logging policy” not just “retention is 90 days.”

Anthropic’s published workflow: for each chunk, prompt Claude with the full document (or parent section) and the chunk text; ask for a concise situating prefix (50–100 tokens). Store context_prefix + chunk as the embedded text; at query time search as usual. Reported gains: ~35% reduction in retrieval failures when combined with contextual embeddings and BM25.

Index-time vs query-time

ApproachWhenCostBenefit
Contextual prefix (index)Ingest pipelineOne LLM call per chunk (batchable)Fixes orphan chunk embeddings permanently
Parent document summaryIngestOne call per doc + attach as metadataCheaper; less granular
HyDE (query)Each queryPer user questionFixes query side, not chunk side
Generate contextual prefix at index time
CONTEXTUALIZE_PROMPT = """<document>
{whole_document}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Give a short succinct context (50-100 tokens) to situate this chunk in the document
for search retrieval. Answer only with the context, nothing else."""

def contextualize_chunk(whole_document: str, chunk: str) -> str:
    msg = CONTEXTUALIZE_PROMPT.format(whole_document=whole_document[:120_000], chunk=chunk)
    prefix = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        max_tokens=150,
        messages=[{"role": "user", "content": msg}],
    ).choices[0].message.content.strip()
    return f"{prefix}\n\n{chunk}"

def index_with_context(chunks: list[dict], doc_text: str, store) -> None:
    for ch in chunks:
        embedded_text = contextualize_chunk(doc_text, ch["text"])
        vec = embed(embedded_text)
        store.upsert(id=ch["id"], vector=vec, text=ch["text"], meta={
            **ch.get("meta", {}),
            "embedded_text": embedded_text[:500],  # debug only; redact in prod logs
        })
public String contextualizeChunk(String wholeDocument, String chunk) {
    String prompt = CONTEXTUALIZE_PROMPT.formatted(
        truncate(wholeDocument, 120_000), chunk);
    String prefix = chatClient.prompt().user(prompt)
        .options(ChatOptionsBuilder.builder().withTemperature(0.0).withMaxTokens(150).build())
        .call().getResult().getOutput().getContent().trim();
    return prefix + "\n\n" + chunk;
}

public void indexWithContext(List<Chunk> chunks, String docText) {
    for (Chunk ch : chunks) {
        String embeddedText = contextualizeChunk(docText, ch.getText());
        vectorStore.add(new Document(ch.getId(), ch.getText(),
            embeddingModel.embed(embeddedText), ch.getMeta()));
    }
}

Re-index when contextualization prompt or model version changes—vectors are not backward compatible. Version the index (index_version: contextual-v2-gpt4o-mini) and dual-write during migrations.

🔬 Under the Hood

Contextual retrieval fixes the document tower of asymmetric retrieval. Query rewriting fixes the query tower. Production stacks often use both: contextual prefixes at ingest, light multi-query or rerank at query time.

⚖️ Trade-off

One LLM call per chunk at ingest is expensive for 10M chunks. Use parent-section context instead of full document for long PDFs, batch prompts, and run contextualization only on high-traffic collections first.

🔒 Security

Contextualization prompts include full document text. Route through the same data-residency and PII scrubbing pipeline as embedding; do not send restricted annexes to external APIs without legal review.

Sparse retrieval — BM25 & TF-IDF

Sparse methods represent text as high-dimensional vectors with mostly zero weights—classic bag-of-words with term frequency and inverse document frequency. BM25 is the production default for lexical search; dense embeddings are the default for semantic search. Serious RAG runs both.

BM25 scores documents by weighted token overlap with the query, with length normalization so long boilerplate chunks do not dominate. It excels at SKUs, error codes, legal citations, person names, and rare technical tokens that embedding models under-weight. It fails on paraphrase and cross-language semantic match—complement, not replacement, for dense retrieval.

BM25 vs TF-IDF

MethodScoring ideaTypical use
TF-IDFTerm freq × log(N/df)Teaching baseline; sklearn prototypes
BM25OkapiSaturated TF + length norm + IDFProduction sparse retrieval default
Elasticsearch BM25Same family, distributed inverted index10M+ docs, analyzers, facets

Managed vs in-process

OpenSearch / Elasticsearch: inverted index at scale, analyzers (stemming, stopwords, synonyms), hybrid kNN + BM25 in one cluster. rank_bm25 in Python: fine for <500k chunks in memory on a single pod. Pick managed search when you need ACL filters, aggregations, and ops-owned scaling.

In-process BM25 with rank_bm25
from rank_bm25 import BM25Okapi

class Bm25Index:
    def __init__(self, chunks: list[dict]):
        self.chunks = chunks
        corpus = [c["text"].lower().split() for c in chunks]
        self.bm25 = BM25Okapi(corpus)

    def search(self, query: str, k: int = 10) -> list[dict]:
        tokens = query.lower().split()
        scores = self.bm25.get_scores(tokens)
        ranked = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
        return [
            {"id": self.chunks[i]["id"], "text": self.chunks[i]["text"],
             "score": float(scores[i]), "channel": "bm25"}
            for i in ranked if scores[i] > 0
        ]
// Elasticsearch Java API Client — BM25 query
SearchResponse<ChunkDoc> response = esClient.search(s -> s
    .index("kb-chunks")
    .query(q -> q.match(m -> m.field("text").query(question)))
    .size(k),
    ChunkDoc.class);

List<Document> hits = response.hits().hits().stream()
    .map(h -> toDocument(h))
    .toList();
OpenSearch hybrid query (kNN + BM25 bool)
# OpenSearch 2.x — combine knn and match in one query (then RRF in app or search pipeline)
hybrid_query = {
    "query": {
        "hybrid": {
            "queries": [
                {"match": {"text": {"query": question}}},
                {"knn": {"embedding": {"vector": q_vec, "k": 50}}},
            ]
        }
    }
}
# Many teams still run two queries + RRF in application code for tunability
// Parallel: CompletableFuture for knn branch + match branch, merge with RrfMerger
CompletableFuture<List<Document>> dense = CompletableFuture.supplyAsync(
    () -> vectorStore.similaritySearch(question, 50));
CompletableFuture<List<Document>> sparse = CompletableFuture.supplyAsync(
    () -> openSearchClient.bm25Search(question, 50));
List<Document> fused = rrfMerger.merge(List.of(dense.join(), sparse.join()), 10);
📦 Real World

Support KBs with ticket IDs and policy numbers almost always ship hybrid. Internal wiki-only RAG may stay dense-only until evals show ORD-/SKU queries failing—then add BM25 on the same chunk IDs without re-chunking.

⚠️ Pitfall

Tokenizing BM25 with naive .split() breaks on code snippets, emails, and hyphenated product names. Mirror Elasticsearch analyzers in offline evals, or use the same analyzer at index and query time.

Hybrid fusion — reciprocal rank fusion (RRF)

Dense and sparse retrievers return scores on incomparable scales—cosine similarity vs BM25 raw scores. Reciprocal Rank Fusion merges ranked lists using only ranks, not raw scores: robust, parameter-light, and widely used in hybrid RAG.

RRF score formula

For document d, given ranked lists from retrievers 1…M, each list contributing ranki(d) (1-based; unranked = ∞):

RRF(d) = Σi=1..M 1 / (k + ranki(d))

Constant k (typically 60, from the original RRF paper) dampens the impact of top-1 obsession: a document ranked 1st in BM25 and 40th in dense still beats a document 5th in both. Documents appearing in multiple lists accumulate mass.

ParameterTypical valueEffect
k60Higher → flatter fusion; lower → top ranks dominate
Lists mergeddense + BM25 (+ optional multi-query lists)More lists → more recall; tune k
top_n after fusion10–50 before rerankFeed reranker a wide funnel
RRF merge — full Python implementation
from collections import defaultdict

def rrf_score(rank: int, k: int = 60) -> float:
    """rank is 1-based (first result = 1)."""
    return 1.0 / (k + rank)

def rrf_merge(lists: list[list[dict]], k: int = 60, top_n: int = 10) -> list[dict]:
    """
    lists: each inner list is ordered best-first with unique chunk 'id' keys.
    Returns merged list sorted by RRF score descending.
    """
    scores: dict[str, float] = defaultdict(float)
    payload: dict[str, dict] = {}

    for results in lists:
        for rank_0, hit in enumerate(results):
            cid = hit["id"]
            scores[cid] += rrf_score(rank_0 + 1, k=k)
            # keep richest payload (prefer dense metadata if duplicate)
            payload[cid] = {**payload.get(cid, {}), **hit}

    ordered_ids = sorted(scores.keys(), key=lambda c: scores[c], reverse=True)[:top_n]
    return [{**payload[cid], "rrf_score": scores[cid]} for cid in ordered_ids]

def hybrid_retrieve(question: str, dense_store, bm25: Bm25Index, k_each: int = 25, final_n: int = 10) -> list[dict]:
    dense_hits = naive_retrieve(question, dense_store, k=k_each)
    sparse_hits = bm25.search(question, k=k_each)
    return rrf_merge([dense_hits, sparse_hits], k=60, top_n=final_n)
public final class RrfMerger {
    private final int k;

    public RrfMerger(int k) { this.k = k; }

    private double rrfScore(int rankOneBased) {
        return 1.0 / (k + rankOneBased);
    }

    public List<Document> merge(List<List<Document>> lists, int topN) {
        Map<String, Double> scores = new HashMap<>();
        Map<String, Document> payload = new HashMap<>();
        for (List<Document> results : lists) {
            for (int i = 0; i < results.size(); i++) {
                Document d = results.get(i);
                scores.merge(d.getId(), rrfScore(i + 1), Double::sum);
                payload.merge(d.getId(), d, (a, b) -> b.getScore() > a.getScore() ? b : a);
            }
        }
        return scores.entrySet().stream()
            .sorted((a, b) -> Double.compare(b.getValue(), a.getValue()))
            .limit(topN)
            .map(e -> {
                Document doc = payload.get(e.getKey());
                doc.getMetadata().put("rrf_score", e.getValue());
                return doc;
            })
            .toList();
    }
}

End-to-end hybrid + rerank stack

  1. Dense top-25 + BM25 top-25 in parallel.
  2. RRF → top-15 unique chunks.
  3. Cross-encoder / Cohere rerank → top-5 for LLM.
  4. Log channel attribution (dense-only, sparse-only, both) for eval dashboards.
🔬 Under the Hood

RRF is rank-based—no score normalization required. Alternatives include weighted linear combination after min-max scaling (fragile when score distributions shift) and learned fusion (LTR models)—heavier ops. RRF is the default hybrid merge in most RAG platforms in 2024–2026.

💡 Pro Tip

Tag each hit with retrieval_channels: ["dense","bm25"] before fusion. When debugging wrong answers, you instantly see whether the failure was semantic, lexical, or fusion.

💰 Cost

Hybrid doubles index storage (vectors + inverted index) and query work (two searches). The dominant cost is often still the answer LLM—not retrieval— unless you run five query rewrites on every turn. Right-size k_each before adding more lists to RRF.

What this looks like in production

Retrieval strategy is a latency and cost budget problem. The winning stack is not “enable everything”—it is a routed pipeline with metrics, fallbacks, and clear rules for when to skip rewriting.

Latency budget (example enterprise chat)

Stagep95 targetSkip when
Query rewrite (HyDE / multi-query)300–800 msNaive top-1 score > threshold; cached FAQ
Dense + sparse parallel50–150 msNever—core path
RRF merge< 5 msNever—in-process
Rerank top-50 → 5100–400 msLow-stakes internal bot; offline batch
Answer LLM1–4 sN/A
Total retrieval< 1.2 s p95Route simple queries to fast path

When to skip query rewriting

  • High-confidence retrieval — top-1 cosine ≥ tuned threshold on held-out evals.
  • Metadata-filtered lookup — user picked product + region; filter narrows to <100 chunks.
  • Exact keyword routes — regex detects ticket ID; go BM25-only.
  • Latency SLO < 2 s end-to-end — rewriting + rerank + answer may exceed budget.
  • Cached questions — same FAQ click within TTL; reuse chunk IDs.

When to enable each strategy

StrategyEnable when eval shows…
HyDE / multi-queryParaphrase bucket recall@5 < 70%
Step-backPolicy “principle” questions miss general section
DecompositionCompound questions > 15% of traffic
Contextual retrievalSmall orphan chunks dominate failures
Hybrid BM25ID/code/SKU queries fail dense-only
RerankRight chunk in top-50 but not top-5

Observability fields (every query)

  • retrieval_strategy_path — e.g. fast | hyde+hybrid+rerank
  • chunk_ids, ranks per channel, rrf_score, rerank_score
  • Rewrite prompts version; hypothesis length (HyDE)
  • retrieve_ms, rerank_ms, rewrite_ms, answer_ms
  • user_feedback / thumbs — link to retrieved set for failure mining
Routed retrieval orchestrator
def retrieve_for_production(question: str, store, bm25: Bm25Index, cfg) -> tuple[list[dict], dict]:
    meta = {"path": "fast"}
    coarse = hybrid_retrieve(question, store, bm25, k_each=25, final_n=50)

    if coarse and coarse[0].get("score", 0) >= cfg.confidence_skip_rewrite:
        meta["path"] = "hybrid_only"
        final = rerank_hits(question, coarse, top_n=5) if cfg.use_rerank else coarse[:5]
        return final, meta

    if cfg.use_multi_query:
        coarse = multi_query_retrieve(question, store)
        meta["path"] = "multi_query+hybrid"

    if cfg.use_rerank:
        final = rerank_hits(question, coarse, top_n=5)
        meta["path"] += "+rerank"
    else:
        final = coarse[:5]

    return final, meta
public RetrievalResult retrieveForProduction(String question, RetrievalConfig cfg) {
    RetrievalMeta meta = new RetrievalMeta();
    List<Document> coarse = hybridRetriever.retrieve(question, 50);

    if (!coarse.isEmpty() && coarse.get(0).getScore() >= cfg.getConfidenceSkipRewrite()) {
        meta.setPath("hybrid_only");
        return new RetrievalResult(
            cfg.isUseRerank() ? reranker.rerank(question, coarse, 5) : coarse.subList(0, 5),
            meta);
    }
    if (cfg.isUseMultiQuery()) {
        coarse = multiQueryRetriever.retrieve(question);
        meta.setPath("multi_query+hybrid");
    }
    List<Document> finalList = cfg.isUseRerank()
        ? reranker.rerank(question, coarse, 5)
        : coarse.subList(0, Math.min(5, coarse.size()));
    return new RetrievalResult(finalList, meta);
}

Production readiness checklist

  • Labelled eval set with per-strategy recall@5 and MRR
  • Fast path + slow path with explicit routing rules
  • Same chunk IDs across dense, sparse, and citation UI
  • Index versioning for contextual embedding migrations
  • Rerank and rewrite behind feature flags per tenant
  • Retrieval logs joined to answer quality (thumbs, CSAT)
  • Load test: p95 retrieve+rerank under 2× expected QPS
📦 Real World

Perplexity-style products iterate retrieval in offline replay before online A/B. Ship one strategy at a time—hybrid first, rerank second, query rewriting last—so regressions are attributable.

🎯 Interview Tip

“Design RAG retrieval for a legal KB.” — Open with chunking + hybrid + rerank, mention contextual indexing for citations, latency budget with fast path, eval metrics, and never sending raw user queries to rewrite without logging/redaction policy.

💡 Pro Tip

Store the retrieval snapshot (chunk IDs + scores) on each answer record. When users report wrong answers six weeks later, you can replay with new strategies without reproducing the session.