RAG · topic 5 of 13

Chunking strategies for RAG

Fixed-size, semantic, recursive, structure-aware, and LLM-driven chunking—trade-offs and when to test.

Five chunking strategies

Almost done! Once the most relevant chunks are re-ranked, they are fed into the LLM. This model combines the user's original query with the retrieved chunks in a prompt template to generate a response that synthesizes information from the selected documents. This is depicted below: The LLM leverages the context provided by the chunks to generate a coherent and contextually relevant answer that directly addresses the user’s query. Since chunking is the very ﬁrst step in any RAG pipeline, it’s important to understand the diﬀerent ways it can be done. 5 chunking strategies for RAG Here’s the typical workﬂow of RAG:

Overview of five chunking approaches used in RAG pipelines.

Since the additional document(s) can be large, step 1 also involves chunking, wherein a large document is divided into smaller/manageable pieces. This step is crucial since it ensures the text ﬁts the input size of the embedding model. Here are ﬁve chunking strategies for RAG: Let’s understand them!

1) Fixed-size chunking

Fixed-size windows with overlap so ideas split across boundaries still appear in adjacent chunks.

Split the text into uniform segments based on a pre-deﬁned number of characters, words, or tokens. Since a direct split can disrupt the semantic ﬂow, it is recommended to maintain some overlap between two consecutive chunks (the blue part above). This is simple to implement. Also, since all chunks are of equal size, it simpliﬁes batch processing. But this usually breaks sentences (or ideas) in between. Thus, important information will likely get distributed between chunks.

2) Semantic chunking

Segment the document based on meaningful units like sentences, paragraphs, or thematic sections. Next, create embeddings for each segment. Let’s say we start with the ﬁrst segment and its embedding. If the ﬁrst segment’s embedding has a high cosine similarity with that of the second segment, both segments form a chunk.

Semantic chunking: grow a segment while cosine similarity to the next unit stays high.

Recursive chunking: respect paragraphs first, then subdivide only when a block exceeds the limit.

This continues until cosine similarity drops signiﬁcantly. The moment it does, we start a new chunk and repeat. Here’s what the output could look like: Unlike ﬁxed-size chunks, this maintains the natural ﬂow of language and preserves complete ideas. Since each chunk is richer, it improves the retrieval accuracy, which, in turn, produces more coherent and relevant responses by the LLM. A minor problem is that it depends on a threshold to determine if cosine similarity has dropped signiﬁcantly, which can vary from document to document.

3) Recursive chunking

First, chunk based on inherent separators like paragraphs, or sections. Next, split each chunk into smaller chunks if the size exceeds a pre-deﬁned chunk size limit. If, however, the chunk ﬁts the chunk-size limit, no further splitting is done. Here’s what the output could look like:

Example hierarchical splits from the deck’s recursive strategy.

Structure-aware chunking along headings, sections, or logical document outline.

As shown above: First, we deﬁne two chunks (the two paragraphs in purple). Next, paragraph 1 is further split into smaller chunks. Unlike ﬁxed-size chunks, this approach also maintains the natural ﬂow of language and preserves complete ideas. However, there is some extra overhead in terms of implementation and computational complexity.

4) Document structure-based chunking

It utilizes the inherent structure of documents, like headings, sections, or paragraphs, to deﬁne chunk boundaries. This way, it maintains structural integrity by aligning with the document’s logical sections. Here’s what the output could look like:

Illustration: headings define chunk borders when the source is well structured.

LLM-assisted chunking for semantically tight units—higher cost, higher fidelity boundaries.

That said, this approach assumes that the document has a clear structure, which may not be true. Also, chunks may vary in length, possibly exceeding model token limits. You can try merging it with recursive splitting.

5) LLM-based chunking

Prompt the LLM to generate semantically isolated and meaningful chunks. This method ensures high semantic accuracy since the LLM can understand context and meaning beyond simple heuristics (used in the above four approaches). But this is the most computationally demanding chunking technique of all ﬁve techniques discussed here.

Chunking trade-offs: accuracy, throughput, and tuning effort across the five methods.

Decision framing: when chunk quality and architecture hinge on content type and stack constraints.

Also, since LLMs typically have a limited context window, that is something to be taken care of. Each technique has its own advantages and trade-oﬀs. We have observed that semantic chunking works pretty well in many cases, but again, you need to test. The choice will depend on the nature of your content, the capabilities of the embedding model, computational resources, etc.

Key takeaways

Chunking quality dominates retrieval: one giant embedding per doc is usually useless.
Pick a strategy based on document structure, embedding limits, and compute budget.