RAG · topic 11 of 13

REFRAG

Meta’s relevance-aware pipeline: compress chunks, RL-filter, selectively expand before the decoder.

RAG vs REFRAG

RAGs involve similarity matching between the query vector and the vectors of the additional documents. However, questions are structurally very diﬀerent from answers. Typical RAG systems are well-suited only for lookup-based question-answering systems. For instance, we cannot build a RAG pipeline to summarize the additional data. The LLM never gets info about all the documents in its prompt because the similarity matching step only retrieves top matches. So, it’s pretty clear that RAG has both pros and cons.

● We never have to ﬁne-tune the model, which saves a lot of computing power.

● But this also limits the applicability to speciﬁc types of systems.

RAG vs REFRAG

Most of what we retrieve in RAG setups never actually helps the LLM. As discussed earlier, in classic RAG, when a query arrives:

● You encode it into a vector.

● Fetch similar chunks from vector DB.

● Dump the retrieved context into the LLM.

It typically works, but at a huge cost:

● Most chunks contain irrelevant text.

● The LLM has to process far more tokens.

● You pay for compute, latency, and context.

That’s the exact problem Meta AI’s new method REFRAG solves. It fundamentally rethinks retrieval and the diagram below explains how it works.

Essentially, instead of feeding the LLM every chunk and every token, REFRAG compresses and ﬁlters context at a vector level:

● Chunk compression: Each chunk is encoded into a single compressed embedding, rather than hundreds of token embeddings.

● Relevance policy: A lightweight RL-trained policy evaluates the compressed embeddings and keeps only the most relevant chunks.

● Selective expansion: Only the chunks chosen by the RL policy are expanded back into their full embeddings and passed to the LLM.

REFRAG: compress chunks to compact vectors, score relevance with a policy, expand only what matters.

This way, the model processes just what matters and ignores the rest. Here's the step-by-step walkthrough:

Step 1-2) Encode the docs and store them in a vector database.

Step 3-5) Encode the full user query and ﬁnd relevant chunks. Also, compute the token-level embeddings for both the query (step 7) and matching chunks.

Step 6) Use a relevance policy (trained via RL) to select chunks to keep.

Step 8) Concatenate the token-level representations of the input query with the token-level embedding of selected chunks and a compressed single-vector representation of the rejected chunks.

Step 9-10) Send all that to the LLM.

The RL step makes REFRAG a more relevance-aware RAG pipeline. Based on the research paper, this approach:

● has 30.85x faster time-to-ﬁrst-token (3.75x better than previous SOTA)

● provides 16x larger context windows

● outperforms LLaMA on 16 RAG benchmarks while using 2–4x fewer decoder tokens.

● leads to no accuracy loss across RAG, summarization, and multi-turn conversation tasks

That means you can process 16x more context at 30x the speed, with the same accuracy.

Key takeaways

Most retrieved tokens never help; REFRAG trims expansion with a learned policy.
Reported gains include faster time-to-first-token and wider effective context.