Retrieval strategies & query rewriting

Why naive retrieval breaks

The default RAG path—embed the question, cosine-search the index, stuff top-K chunks into the prompt—is correct as a baseline. It is not sufficient for production Q&A over heterogeneous corpora. Failures cluster into predictable patterns you can measure before adding complexity.

A bi-encoder maps query and document into the same vector space independently. Similarity is cheap and scalable, but the query vector is built from how the user asked, not from how the answer appears in the corpus. Support tickets use informal language; policy PDFs use legal phrasing. The embedding model may bridge that gap—or may not, especially on domain jargon, rare acronyms, or questions that never appeared in training data.

Paraphrase and vocabulary mismatch

Consider a user who asks: “Can I get money back if I cancel my yearly plan mid-cycle?” The refund policy chunk says: “Annual subscriptions are eligible for pro-rata credit upon early termination.” Few token overlaps; a strong embedding model often still retrieves the right chunk. But swap in a product-specific term— “FlexAnnual tier” in the doc vs “the flexible annual billing option” in the query—and recall drops sharply. This is the lexical gap: dense vectors handle semantic similarity; they are weaker when the match depends on exact product names, error codes, or SKU strings that are nearly unique in the corpus.

Failure mode	Example	Why one embedding struggles	Typical fix (later sections)
Paraphrase	“cancel yearly” vs “early termination of annual subscription”	Query and doc live in different wording regions of vector space	HyDE, multi-query, hybrid BM25
Multi-hop	“Does our EU DPA cover subprocessors used by the analytics integration?”	Answer requires chunk A (DPA scope) + chunk B (analytics vendor list)	Sub-question decomposition, multi-query
Abstract → concrete	“What’s our stance on shadow IT?”	Policy uses “unsanctioned SaaS” not “shadow IT”	Step-back prompting, HyDE
Exact token	“Status for ORD-1042”	Embedding treats ID as low-signal noise	Sparse BM25, hybrid RRF
Underspecified chunk	Chunk: “The limit is 500 requests per minute.”	No surrounding context—which API? which tier?	Contextual retrieval at index time

Multi-hop reasoning without multi-hop retrieval

Large language models can synthesize across multiple facts once those facts are in context. Naive retrieval assumes a single hop: one query vector, one ranked list. Multi-hop questions need either (a) multiple retrieval passes with different sub-queries, (b) a graph or agent loop that retrieves, reads, and retrieves again, or (c) larger chunks that accidentally contain both facts. Option (c) scales poorly; option (b) is Track 3 (agents). This guide focuses on (a)—rewriting and decomposing the query before the first LLM call.

Measure the baseline before upgrading. On a labeled set of 50–200 questions, log recall@K (is the gold chunk in top K?) and MRR (mean reciprocal rank). If recall@5 is already 90%, query rewriting adds latency and cost for marginal gain. If recall@5 is 55% on paraphrase-heavy tickets, HyDE or multi-query is justified.

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    r = client.embeddings.create(model="text-embedding-3-small", input=text)
    return r.data[0].embedding

def naive_retrieve(question: str, store, k: int = 5) -> list[dict]:
    """Single embedding, single search — baseline."""
    q_vec = embed(question)
    return store.similarity_search(q_vec, k=k)

# store = your VectorStore wrapper (Pinecone, pgvector, Chroma, etc.)

@Service
public class NaiveRetriever {
    private final EmbeddingModel embeddingModel;
    private final VectorStore vectorStore;

    public List<Document> retrieve(String question, int k) {
        float[] qVec = embeddingModel.embed(question);
        return vectorStore.similaritySearch(
            SearchRequest.query(qVec).withTopK(k));
    }
}

⚠️ Pitfall

Blaming the embedding model before checking chunking. A perfect retriever cannot find an answer split across two 256-token chunks with no overlap. Fix chunk boundaries and metadata filters first; then tune retrieval strategies.

🔬 Under the Hood

Bi-encoders are trained on (query, relevant_doc) pairs—often MS MARCO or synthetic QA. At inference, your user question is the “query” tower input, but your indexed chunks are “document” tower inputs. Asymmetric training helps, yet the query distribution at runtime (typos, abbreviations, multilingual mix) still drifts from training.

⚖️ Trade-off

More retrieval strategies mean more LLM calls before the answer call. Each rewriter adds 200–800 ms and input tokens. Treat strategies as a menu, not a checklist—enable per route or per confidence score, not globally by default.

HyDE — Hypothetical Document Embeddings

HyDE (Gao et al., 2022) asks the LLM to write a fake answer passage, embeds that hypothetical document, and searches the index with the hypothetical vector instead of the raw question. The hypothesis reads like corpus text—closing the query/document style gap.

Intuition: vector search compares the question to stored documents. Documents are declarative, third-person, complete sentences. Questions are short, interrogative, and incomplete. HyDE translates question → pseudo-document in the same register as your KB, then embeds the pseudo-document. You never show the hypothesis to the user; it is a retrieval-only artifact.

When HyDE helps vs hurts

Scenario	HyDE	Notes
Paraphrase-heavy support FAQ	Often +10–20% recall@5	Hypothesis mirrors policy tone
Exact ID / SKU lookup	Can hallucinate wrong IDs	Prefer BM25 or filter metadata
Highly factual regulated content	Risky	Fake doc may invent compliance language
Low-latency chat (< 1 s budget)	Usually skip	Extra LLM + embed round trip

Implementation sketch

Prompt LLM: “Write a short passage that would answer this question, as it might appear in our documentation. Do not invent product names.”
Embed the hypothetical passage (same model as index).
Vector search with hypothetical embedding; optionally also search with original question and merge (multi-vector).
Pass retrieved real chunks to the answer LLM—never the hypothesis.

HYDE_PROMPT = """Write a single paragraph that would appear in internal documentation
and answer the question. Use neutral policy language. Do not invent IDs or numbers.

Question: {question}

Passage:"""

def hyde_retrieve(question: str, store, k: int = 5) -> list[dict]:
    hypo = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        max_tokens=256,
        messages=[{"role": "user", "content": HYDE_PROMPT.format(question=question)}],
    ).choices[0].message.content.strip()

    hypo_vec = embed(hypo)
    hits = store.similarity_search(hypo_vec, k=k)
    for h in hits:
        h["meta"] = {**h.get("meta", {}), "hyde_hypothesis": hypo[:200]}
    return hits

public List<Document> hydeRetrieve(String question, int k) {
    String hypo = chatClient.prompt()
        .user(HYDE_PROMPT.formatted(question))
        .options(ChatOptionsBuilder.builder()
            .withModel("gpt-4o-mini").withTemperature(0.0).withMaxTokens(256).build())
        .call()
        .getResult().getOutput().getContent();

    float[] hypoVec = embeddingModel.embed(hypo);
    List<Document> hits = vectorStore.similaritySearch(
        SearchRequest.query(hypoVec).withTopK(k));
    hits.forEach(d -> d.getMetadata().put("hyde_hypothesis", hypo.substring(0, 200)));
    return hits;
}

💰 Cost

HyDE adds one chat completion (~100–300 output tokens) plus one embedding call per query. At 10k queries/day on gpt-4o-mini, budget roughly $15–40/month for the rewriter alone—before answer generation. Cache hypotheses for repeated FAQ clicks.

📦 Real World

Teams often run HyDE only when the first-pass naive retrieval score is below a threshold (e.g. top-1 cosine < 0.72). That preserves latency on easy queries while spending LLM tokens on hard ones.

🔒 Security

The HyDE prompt sees the user question. If the question contains PII, the hypothetical passage may echo it into logs. Redact before HyDE; do not log full hypotheses in production without a data classification review.

Multi-query — N variants, deduped results

Multi-query retrieval asks the LLM to produce N paraphrases or angle variants of the user question, runs retrieval for each, then deduplicates and merges chunk lists before context assembly. Different wordings surface different nearest neighbors in the index.

One question hides multiple retrieval intents. “How do I reset SSO and who approves it?” blends procedure and ownership. Multi-query expansion does not fully decompose (see sub-questions below)—it generates alternate phrasings of the same intent to reduce bad luck from a single embedding direction.

Dedup and merge strategies

Strategy	Behavior	When to use
Union by chunk ID	Keep first occurrence; preserve best rank score	Simple; good default
Max score per ID	Across variants, take max similarity per chunk	When scores are comparable across queries
RRF across lists	Reciprocal rank fusion on each variant’s ranking	Many variants (3–5); avoids score scale issues
Cap per variant	Retrieve top 3 per variant, merge to top 10 unique	Control context bloat

MULTI_QUERY_PROMPT = """Generate {n} different search queries that would help retrieve
documentation to answer the user question. One per line, no numbering.

Question: {question}"""

def generate_query_variants(question: str, n: int = 3) -> list[str]:
    text = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.3,
        max_tokens=200,
        messages=[{"role": "user", "content": MULTI_QUERY_PROMPT.format(n=n, question=question)}],
    ).choices[0].message.content
    variants = [question] + [ln.strip() for ln in text.splitlines() if ln.strip()]
    return variants[: n + 1]

def dedupe_by_id(lists: list[list[dict]], top_n: int = 8) -> list[dict]:
    seen: dict[str, dict] = {}
    for results in lists:
        for rank, hit in enumerate(results):
            cid = hit["id"]
            if cid not in seen or hit.get("score", 0) > seen[cid].get("score", 0):
                seen[cid] = {**hit, "best_rank": min(seen.get(cid, {}).get("best_rank", 999), rank)}
    ordered = sorted(seen.values(), key=lambda h: (h.get("best_rank", 999), -h.get("score", 0)))
    return ordered[:top_n]

def multi_query_retrieve(question: str, store, k_per_query: int = 4) -> list[dict]:
    variants = generate_query_variants(question)
    lists = [naive_retrieve(q, store, k=k_per_query) for q in variants]
    return dedupe_by_id(lists)

public List<Document> multiQueryRetrieve(String question, int kPerQuery, int topN) {
    List<String> variants = new ArrayList<>();
    variants.add(question);
    variants.addAll(queryGenerator.generate(question, 3));

    Map<String, Document> best = new LinkedHashMap<>();
    for (String q : variants) {
        for (Document d : naiveRetriever.retrieve(q, kPerQuery)) {
            String id = d.getId();
            best.merge(id, d, (a, b) ->
                a.getScore() >= b.getScore() ? a : b);
        }
    }
    return best.values().stream().limit(topN).toList();
}

💡 Pro Tip

Always include the original question as one of the variants. LLM paraphrases occasionally drift intent; the raw user wording is a free anchor you should not discard.

⚠️ Pitfall

Retrieving top-10 per variant with five variants → 50 chunks before dedupe, blowing the context window after dedupe still leaves 15+ chunks. Cap per variant and final K aggressively; let reranking trim (next section).

🎯 Interview Tip

“How would you improve RAG recall?” — Mention measurable baseline, then multi-query or HyDE for paraphrase gaps, hybrid for lexical IDs, reranker for precision, and explicit latency/cost trade-offs. Avoid saying “use a better model” without retrieval mechanics.

Step-back prompting

Step-back retrieval first asks a broader question that captures the underlying principle, retrieves on that, then retrieves on the original specific question. Principle-level chunks anchor context; detail-level chunks answer the user.

Users ask narrow questions; documentation often states rules at a higher level of abstraction. “Can contractor laptops use personal Dropbox?” may not mention Dropbox—the acceptable-use policy discusses unsanctioned cloud storage under a general principle. Step-back generates: “What is our policy on unsanctioned third-party cloud file storage?” retrieves the principle chunk; the original query retrieves exceptions and procedures.

Two-pass retrieval flow

LLM produces a step-back question (more general, same domain).
Retrieve top-K₁ on step-back question.
Retrieve top-K₂ on original question.
Merge lists (dedupe by ID); prefer diverse sources when assembling context.
Single answer LLM call with merged context.

STEP_BACK_PROMPT = """You are an expert at generalizing questions to underlying principles.
Given a specific user question, write ONE broader question that would help retrieve
high-level policy or concept documentation. Output only the question.

Specific question: {question}"""

def step_back_question(question: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        max_tokens=120,
        messages=[{"role": "user", "content": STEP_BACK_PROMPT.format(question=question)}],
    ).choices[0].message.content.strip()

def step_back_retrieve(question: str, store) -> list[dict]:
    broad = step_back_question(question)
    principle_hits = naive_retrieve(broad, store, k=3)
    detail_hits = naive_retrieve(question, store, k=5)
    return dedupe_by_id([principle_hits, detail_hits], top_n=8)

public List<Document> stepBackRetrieve(String question) {
    String broad = chatClient.prompt()
        .user(STEP_BACK_PROMPT.formatted(question))
        .call().getResult().getOutput().getContent().trim();

    List<Document> principle = naiveRetriever.retrieve(broad, 3);
    List<Document> detail = naiveRetriever.retrieve(question, 5);
    return mergeDedupe(principle, detail, 8);
}

⚖️ Trade-off

Step-back can pull overly general chunks that dilute the context window. Limit K₁ (principle pass) to 2–3 chunks and instruct the answer model to cite the specific passage when both general and specific rules appear.

🔬 Under the Hood

Step-back is related to “query relaxation” in classical IR. The LLM acts as a learned query expander toward hypernyms (specific → general). It differs from HyDE, which fabricates a document rather than a broader question.

Sub-question decomposition

Compound questions contain multiple independent information needs. Decomposition splits them into sub-questions, retrieves for each sub-question, and merges evidence—enabling true multi-hop retrieval in one user turn without an agent loop.

“What’s the retention period for audit logs in the EU region, and who is the DPO contact?” requires GDPR retention docs and a contacts directory. No single embedding aligns with both. Decomposition yields: (1) EU audit log retention period, (2) DPO contact information—each gets its own retrieval pass.

Decomposition vs multi-query

Technique	Output	Best for
Multi-query	Paraphrases of same intent	Lexical / embedding mismatch on one fact
Sub-questions	Distinct sub-intents	Compound “and” / multi-part questions
Step-back	One broader principle question	Specific detail needs general policy context

DECOMPOSE_PROMPT = """Break the user question into independent sub-questions answerable from a knowledge base.
Return JSON: {{"sub_questions": ["...", "..."]}} Max 4 sub-questions.

Question: {question}"""

def decompose(question: str) -> list[str]:
    import json
    raw = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": DECOMPOSE_PROMPT.format(question=question)}],
    ).choices[0].message.content
    data = json.loads(raw)
    subs = data.get("sub_questions", [])
    return subs if subs else [question]

def decomposed_retrieve(question: str, store, k_per_sub: int = 4) -> list[dict]:
    subs = decompose(question)
    lists = []
    for sq in subs:
        hits = naive_retrieve(sq, store, k=k_per_sub)
        for h in hits:
            h.setdefault("meta", {})["sub_question"] = sq
        lists.append(hits)
    merged = dedupe_by_id(lists, top_n=10)
    return merged

public List<Document> decomposedRetrieve(String question, int kPerSub) {
    List<String> subs = decomposer.decompose(question);
    if (subs.isEmpty()) subs = List.of(question);

    List<List<Document>> lists = new ArrayList<>();
    for (String sq : subs) {
        List<Document> hits = naiveRetriever.retrieve(sq, kPerSub);
        hits.forEach(d -> d.getMetadata().put("sub_question", sq));
        lists.add(hits);
    }
    return mergeDedupeLists(lists, 10);
}

For answer synthesis, pass sub-question labels in context headers so the LLM maps evidence to parts of the user question: [Sub-Q: DPO contact] before each chunk group. Reduces conflation when chunks contradict across regions.

📦 Real World

Enterprise search products often detect compound questions with a lightweight classifier (regex on “ and ”, “?” count, conjunction NLP) before paying for decomposition LLM calls—routing simple questions to naive retrieve only.

⚠️ Pitfall

Over-decomposition on single-intent questions adds noise sub-queries that retrieve irrelevant chunks. Cap at 3–4 sub-questions and fall back to single-query when the model returns one sub-question identical to the original.

Reranking — two-stage top-50 → top-5

First-stage bi-encoder retrieval optimizes recall over millions of chunks in milliseconds. Reranking optimizes precision on a small candidate set using a cross-encoder or vendor rerank API that scores (query, document) pairs jointly. Production pattern: retrieve 50, rerank to 5, then prompt the LLM.

Cross-encoders concatenate query and candidate text through a transformer and emit a relevance score. They see token-level interactions (“cancel” in query aligns with “termination” in doc) that bi-encoders miss. Cost is linear in candidates—hence the two-stage funnel.

Reranker options

Option	Hosting	Latency (50 pairs)	Notes
Cohere Rerank	API (rerank v3)	~100–300 ms	Strong multilingual; per-search pricing
Jina Reranker	API or self-host	~80–250 ms	jina-reranker-v2-base-multilingual
BGE reranker	Self-host (FlagEmbedding)	GPU-dependent	bge-reranker-v2-m3; data stays in VPC
ms-marco MiniLM	Self-host CPU	~200–500 ms	Classic baseline; English-centric
Voyage rerank	API	Similar to Cohere	Pairs with Voyage embeddings

Two-stage pipeline

Retrieve top-N with bi-encoder (N = 30–50; tune on eval set).
Rerank all N pairs; sort by rerank score.
Truncate to top-K for LLM context (K = 5–8).
Log both bi-encoder rank and rerank rank for drift analysis.

import cohere

co = cohere.Client()

def rerank_hits(question: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    if not candidates:
        return []
    docs = [c["text"] for c in candidates]
    response = co.rerank(
        model="rerank-v3.5",
        query=question,
        documents=docs,
        top_n=min(top_n, len(docs)),
    )
    out = []
    for r in response.results:
        hit = {**candidates[r.index], "rerank_score": r.relevance_score}
        out.append(hit)
    return out

def two_stage_retrieve(question: str, store, retrieve_n: int = 50, final_k: int = 5) -> list[dict]:
    coarse = naive_retrieve(question, store, k=retrieve_n)
    return rerank_hits(question, coarse, top_n=final_k)

// Spring AI Rerank API (Cohere or custom RerankModel bean)
public List<Document> twoStageRetrieve(String question, int retrieveN, int finalK) {
    List<Document> coarse = naiveRetriever.retrieve(question, retrieveN);
    RerankRequest req = new RerankRequest(question, coarse, finalK);
    return rerankModel.call(req).getResults();
}

from sentence_transformers import CrossEncoder

_reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

def bge_rerank(question: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    pairs = [(question, c["text"]) for c in candidates]
    scores = _reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [{**c, "rerank_score": float(s)} for c, s in ranked[:top_n]]

// ONNX or DJL-serving deployment of bge-reranker in VPC
// Java client posts batch of {query, document} pairs to inference endpoint
// and receives float[] scores — same contract as Python CrossEncoder.predict

💰 Cost

Cohere Rerank bills per search (query + all documents in one call). Fifty 500-token chunks ≈ 25k rerank tokens. Compare API cost vs GPU amortization for self-hosted BGE at your QPS—break-even often appears around 1–5M rerank calls/month.

💡 Pro Tip

Rerank after hybrid fusion, not twice. Run bi-encoder + BM25 → RRF → single rerank on the merged top-50. Reranking each rewriter variant separately multiplies cost.

🎯 Interview Tip

Explain bi-encoder vs cross-encoder latency complexity: O(index) vs O(K) per query. The two-stage pattern is the standard answer for “how do you scale semantic search to millions of docs with high precision?”

Contextual retrieval (Anthropic)

Chunks lose surrounding context when split. Contextual retrieval prepends a short, LLM-generated context blurb to each chunk before embedding at index time, so vectors encode “this passage is about EU GDPR retention in the logging policy” not just “retention is 90 days.”

Anthropic’s published workflow: for each chunk, prompt Claude with the full document (or parent section) and the chunk text; ask for a concise situating prefix (50–100 tokens). Store context_prefix + chunk as the embedded text; at query time search as usual. Reported gains: ~35% reduction in retrieval failures when combined with contextual embeddings and BM25.

Index-time vs query-time

Approach	When	Cost	Benefit
Contextual prefix (index)	Ingest pipeline	One LLM call per chunk (batchable)	Fixes orphan chunk embeddings permanently
Parent document summary	Ingest	One call per doc + attach as metadata	Cheaper; less granular
HyDE (query)	Each query	Per user question	Fixes query side, not chunk side

CONTEXTUALIZE_PROMPT = """<document>
{whole_document}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Give a short succinct context (50-100 tokens) to situate this chunk in the document
for search retrieval. Answer only with the context, nothing else."""

def contextualize_chunk(whole_document: str, chunk: str) -> str:
    msg = CONTEXTUALIZE_PROMPT.format(whole_document=whole_document[:120_000], chunk=chunk)
    prefix = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        max_tokens=150,
        messages=[{"role": "user", "content": msg}],
    ).choices[0].message.content.strip()
    return f"{prefix}\n\n{chunk}"

def index_with_context(chunks: list[dict], doc_text: str, store) -> None:
    for ch in chunks:
        embedded_text = contextualize_chunk(doc_text, ch["text"])
        vec = embed(embedded_text)
        store.upsert(id=ch["id"], vector=vec, text=ch["text"], meta={
            **ch.get("meta", {}),
            "embedded_text": embedded_text[:500],  # debug only; redact in prod logs
        })

public String contextualizeChunk(String wholeDocument, String chunk) {
    String prompt = CONTEXTUALIZE_PROMPT.formatted(
        truncate(wholeDocument, 120_000), chunk);
    String prefix = chatClient.prompt().user(prompt)
        .options(ChatOptionsBuilder.builder().withTemperature(0.0).withMaxTokens(150).build())
        .call().getResult().getOutput().getContent().trim();
    return prefix + "\n\n" + chunk;
}

public void indexWithContext(List<Chunk> chunks, String docText) {
    for (Chunk ch : chunks) {
        String embeddedText = contextualizeChunk(docText, ch.getText());
        vectorStore.add(new Document(ch.getId(), ch.getText(),
            embeddingModel.embed(embeddedText), ch.getMeta()));
    }
}

Re-index when contextualization prompt or model version changes—vectors are not backward compatible. Version the index (index_version: contextual-v2-gpt4o-mini) and dual-write during migrations.

🔬 Under the Hood

Contextual retrieval fixes the document tower of asymmetric retrieval. Query rewriting fixes the query tower. Production stacks often use both: contextual prefixes at ingest, light multi-query or rerank at query time.

⚖️ Trade-off

One LLM call per chunk at ingest is expensive for 10M chunks. Use parent-section context instead of full document for long PDFs, batch prompts, and run contextualization only on high-traffic collections first.

🔒 Security

Contextualization prompts include full document text. Route through the same data-residency and PII scrubbing pipeline as embedding; do not send restricted annexes to external APIs without legal review.

Sparse retrieval — BM25 & TF-IDF

Sparse methods represent text as high-dimensional vectors with mostly zero weights—classic bag-of-words with term frequency and inverse document frequency. BM25 is the production default for lexical search; dense embeddings are the default for semantic search. Serious RAG runs both.

BM25 scores documents by weighted token overlap with the query, with length normalization so long boilerplate chunks do not dominate. It excels at SKUs, error codes, legal citations, person names, and rare technical tokens that embedding models under-weight. It fails on paraphrase and cross-language semantic match—complement, not replacement, for dense retrieval.

BM25 vs TF-IDF

Method	Scoring idea	Typical use
TF-IDF	Term freq × log(N/df)	Teaching baseline; sklearn prototypes
BM25Okapi	Saturated TF + length norm + IDF	Production sparse retrieval default
Elasticsearch BM25	Same family, distributed inverted index	10M+ docs, analyzers, facets

Managed vs in-process

OpenSearch / Elasticsearch: inverted index at scale, analyzers (stemming, stopwords, synonyms), hybrid kNN + BM25 in one cluster. rank_bm25 in Python: fine for <500k chunks in memory on a single pod. Pick managed search when you need ACL filters, aggregations, and ops-owned scaling.

from rank_bm25 import BM25Okapi

class Bm25Index:
    def __init__(self, chunks: list[dict]):
        self.chunks = chunks
        corpus = [c["text"].lower().split() for c in chunks]
        self.bm25 = BM25Okapi(corpus)

    def search(self, query: str, k: int = 10) -> list[dict]:
        tokens = query.lower().split()
        scores = self.bm25.get_scores(tokens)
        ranked = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
        return [
            {"id": self.chunks[i]["id"], "text": self.chunks[i]["text"],
             "score": float(scores[i]), "channel": "bm25"}
            for i in ranked if scores[i] > 0
        ]

// Elasticsearch Java API Client — BM25 query
SearchResponse<ChunkDoc> response = esClient.search(s -> s
    .index("kb-chunks")
    .query(q -> q.match(m -> m.field("text").query(question)))
    .size(k),
    ChunkDoc.class);

List<Document> hits = response.hits().hits().stream()
    .map(h -> toDocument(h))
    .toList();

# OpenSearch 2.x — combine knn and match in one query (then RRF in app or search pipeline)
hybrid_query = {
    "query": {
        "hybrid": {
            "queries": [
                {"match": {"text": {"query": question}}},
                {"knn": {"embedding": {"vector": q_vec, "k": 50}}},
            ]
        }
    }
}
# Many teams still run two queries + RRF in application code for tunability

// Parallel: CompletableFuture for knn branch + match branch, merge with RrfMerger
CompletableFuture<List<Document>> dense = CompletableFuture.supplyAsync(
    () -> vectorStore.similaritySearch(question, 50));
CompletableFuture<List<Document>> sparse = CompletableFuture.supplyAsync(
    () -> openSearchClient.bm25Search(question, 50));
List<Document> fused = rrfMerger.merge(List.of(dense.join(), sparse.join()), 10);

📦 Real World

Support KBs with ticket IDs and policy numbers almost always ship hybrid. Internal wiki-only RAG may stay dense-only until evals show ORD-/SKU queries failing—then add BM25 on the same chunk IDs without re-chunking.

⚠️ Pitfall

Tokenizing BM25 with naive .split() breaks on code snippets, emails, and hyphenated product names. Mirror Elasticsearch analyzers in offline evals, or use the same analyzer at index and query time.

Hybrid fusion — reciprocal rank fusion (RRF)

Dense and sparse retrievers return scores on incomparable scales—cosine similarity vs BM25 raw scores. Reciprocal Rank Fusion merges ranked lists using only ranks, not raw scores: robust, parameter-light, and widely used in hybrid RAG.

RRF score formula

For document d, given ranked lists from retrievers 1…M, each list contributing rank_i(d) (1-based; unranked = ∞):

RRF(d) = Σ_i=1..M 1 / (k + rank_i(d))

Constant k (typically 60, from the original RRF paper) dampens the impact of top-1 obsession: a document ranked 1st in BM25 and 40th in dense still beats a document 5th in both. Documents appearing in multiple lists accumulate mass.

Parameter	Typical value	Effect
k	60	Higher → flatter fusion; lower → top ranks dominate
Lists merged	dense + BM25 (+ optional multi-query lists)	More lists → more recall; tune k
top_n after fusion	10–50 before rerank	Feed reranker a wide funnel

from collections import defaultdict

def rrf_score(rank: int, k: int = 60) -> float:
    """rank is 1-based (first result = 1)."""
    return 1.0 / (k + rank)

def rrf_merge(lists: list[list[dict]], k: int = 60, top_n: int = 10) -> list[dict]:
    """
    lists: each inner list is ordered best-first with unique chunk 'id' keys.
    Returns merged list sorted by RRF score descending.
    """
    scores: dict[str, float] = defaultdict(float)
    payload: dict[str, dict] = {}

    for results in lists:
        for rank_0, hit in enumerate(results):
            cid = hit["id"]
            scores[cid] += rrf_score(rank_0 + 1, k=k)
            # keep richest payload (prefer dense metadata if duplicate)
            payload[cid] = {**payload.get(cid, {}), **hit}

    ordered_ids = sorted(scores.keys(), key=lambda c: scores[c], reverse=True)[:top_n]
    return [{**payload[cid], "rrf_score": scores[cid]} for cid in ordered_ids]

def hybrid_retrieve(question: str, dense_store, bm25: Bm25Index, k_each: int = 25, final_n: int = 10) -> list[dict]:
    dense_hits = naive_retrieve(question, dense_store, k=k_each)
    sparse_hits = bm25.search(question, k=k_each)
    return rrf_merge([dense_hits, sparse_hits], k=60, top_n=final_n)

public final class RrfMerger {
    private final int k;

    public RrfMerger(int k) { this.k = k; }

    private double rrfScore(int rankOneBased) {
        return 1.0 / (k + rankOneBased);
    }

    public List<Document> merge(List<List<Document>> lists, int topN) {
        Map<String, Double> scores = new HashMap<>();
        Map<String, Document> payload = new HashMap<>();
        for (List<Document> results : lists) {
            for (int i = 0; i < results.size(); i++) {
                Document d = results.get(i);
                scores.merge(d.getId(), rrfScore(i + 1), Double::sum);
                payload.merge(d.getId(), d, (a, b) -> b.getScore() > a.getScore() ? b : a);
            }
        }
        return scores.entrySet().stream()
            .sorted((a, b) -> Double.compare(b.getValue(), a.getValue()))
            .limit(topN)
            .map(e -> {
                Document doc = payload.get(e.getKey());
                doc.getMetadata().put("rrf_score", e.getValue());
                return doc;
            })
            .toList();
    }
}

End-to-end hybrid + rerank stack

Dense top-25 + BM25 top-25 in parallel.
RRF → top-15 unique chunks.
Cross-encoder / Cohere rerank → top-5 for LLM.
Log channel attribution (dense-only, sparse-only, both) for eval dashboards.

🔬 Under the Hood

RRF is rank-based—no score normalization required. Alternatives include weighted linear combination after min-max scaling (fragile when score distributions shift) and learned fusion (LTR models)—heavier ops. RRF is the default hybrid merge in most RAG platforms in 2024–2026.

💡 Pro Tip

Tag each hit with retrieval_channels: ["dense","bm25"] before fusion. When debugging wrong answers, you instantly see whether the failure was semantic, lexical, or fusion.

💰 Cost

Hybrid doubles index storage (vectors + inverted index) and query work (two searches). The dominant cost is often still the answer LLM—not retrieval— unless you run five query rewrites on every turn. Right-size k_each before adding more lists to RRF.

What this looks like in production

Retrieval strategy is a latency and cost budget problem. The winning stack is not “enable everything”—it is a routed pipeline with metrics, fallbacks, and clear rules for when to skip rewriting.

Latency budget (example enterprise chat)

Stage	p95 target	Skip when
Query rewrite (HyDE / multi-query)	300–800 ms	Naive top-1 score > threshold; cached FAQ
Dense + sparse parallel	50–150 ms	Never—core path
RRF merge	< 5 ms	Never—in-process
Rerank top-50 → 5	100–400 ms	Low-stakes internal bot; offline batch
Answer LLM	1–4 s	N/A
Total retrieval	< 1.2 s p95	Route simple queries to fast path

When to skip query rewriting

High-confidence retrieval — top-1 cosine ≥ tuned threshold on held-out evals.
Metadata-filtered lookup — user picked product + region; filter narrows to <100 chunks.
Exact keyword routes — regex detects ticket ID; go BM25-only.
Latency SLO < 2 s end-to-end — rewriting + rerank + answer may exceed budget.
Cached questions — same FAQ click within TTL; reuse chunk IDs.

When to enable each strategy

Strategy	Enable when eval shows…
HyDE / multi-query	Paraphrase bucket recall@5 < 70%
Step-back	Policy “principle” questions miss general section
Decomposition	Compound questions > 15% of traffic
Contextual retrieval	Small orphan chunks dominate failures
Hybrid BM25	ID/code/SKU queries fail dense-only
Rerank	Right chunk in top-50 but not top-5

Observability fields (every query)

retrieval_strategy_path — e.g. fast | hyde+hybrid+rerank
chunk_ids, ranks per channel, rrf_score, rerank_score
Rewrite prompts version; hypothesis length (HyDE)
retrieve_ms, rerank_ms, rewrite_ms, answer_ms
user_feedback / thumbs — link to retrieved set for failure mining

def retrieve_for_production(question: str, store, bm25: Bm25Index, cfg) -> tuple[list[dict], dict]:
    meta = {"path": "fast"}
    coarse = hybrid_retrieve(question, store, bm25, k_each=25, final_n=50)

    if coarse and coarse[0].get("score", 0) >= cfg.confidence_skip_rewrite:
        meta["path"] = "hybrid_only"
        final = rerank_hits(question, coarse, top_n=5) if cfg.use_rerank else coarse[:5]
        return final, meta

    if cfg.use_multi_query:
        coarse = multi_query_retrieve(question, store)
        meta["path"] = "multi_query+hybrid"

    if cfg.use_rerank:
        final = rerank_hits(question, coarse, top_n=5)
        meta["path"] += "+rerank"
    else:
        final = coarse[:5]

    return final, meta

public RetrievalResult retrieveForProduction(String question, RetrievalConfig cfg) {
    RetrievalMeta meta = new RetrievalMeta();
    List<Document> coarse = hybridRetriever.retrieve(question, 50);

    if (!coarse.isEmpty() && coarse.get(0).getScore() >= cfg.getConfidenceSkipRewrite()) {
        meta.setPath("hybrid_only");
        return new RetrievalResult(
            cfg.isUseRerank() ? reranker.rerank(question, coarse, 5) : coarse.subList(0, 5),
            meta);
    }
    if (cfg.isUseMultiQuery()) {
        coarse = multiQueryRetriever.retrieve(question);
        meta.setPath("multi_query+hybrid");
    }
    List<Document> finalList = cfg.isUseRerank()
        ? reranker.rerank(question, coarse, 5)
        : coarse.subList(0, Math.min(5, coarse.size()));
    return new RetrievalResult(finalList, meta);
}

Production readiness checklist

Labelled eval set with per-strategy recall@5 and MRR
Fast path + slow path with explicit routing rules
Same chunk IDs across dense, sparse, and citation UI
Index versioning for contextual embedding migrations
Rerank and rewrite behind feature flags per tenant
Retrieval logs joined to answer quality (thumbs, CSAT)
Load test: p95 retrieve+rerank under 2× expected QPS

📦 Real World

Perplexity-style products iterate retrieval in offline replay before online A/B. Ship one strategy at a time—hybrid first, rerank second, query rewriting last—so regressions are attributable.

🎯 Interview Tip

“Design RAG retrieval for a legal KB.” — Open with chunking + hybrid + rerank, mention contextual indexing for citations, latency budget with fast path, eval metrics, and never sending raw user queries to rewrite without logging/redaction policy.

💡 Pro Tip

Store the retrieval snapshot (chunk IDs + scores) on each answer record. When users report wrong answers six weeks later, you can replay with new strategies without reproducing the session.