Chunking strategies
Vector search does not read your PDF—it reads chunks. Each chunk becomes one embedding vector. If a refund rule is split across two chunks, a question about refunds may retrieve only half the policy and the LLM will guess the rest. Chunking is the highest-leverage knob in RAG after model choice: it determines whether retrieval finds the right evidence at all.
After reading, you should be able to: explain why embedding quality depends on chunk coherence; implement fixed-size, sentence, recursive, semantic, and document-aware splitters in Python and Java; choose chunk sizes by use case; attach metadata for filtered retrieval and citations; design parent–child indexes; and evaluate chunk policy on golden documents before re-indexing production.
Why chunking matters
Embedding quality depends on chunk coherence. An embedding model compresses an entire chunk into one vector— if the chunk mixes unrelated topics, the vector becomes an ambiguous average and similarity scores flatten. Good chunks are self-contained units of meaning that a user question can match precisely.
Consider a support policy document: “Annual plans: full refund within 30 days of purchase. Monthly plans: cancel anytime; no refund for partial months.” A naive 200-character split might cut after “30 days of” and leave “purchase. Monthly plans…” in the next chunk. A user asking “Can I get a refund on my yearly subscription?” retrieves the first chunk—missing the word annual—and the LLM may hallucinate eligibility.
What a good chunk optimizes for
| Property | Why it matters | Failure mode when ignored |
|---|---|---|
| Coherence | One topic or rule per vector | Embedding dilution; wrong chunk wins on cosine |
| Completeness | Conditions stay together (time window + plan type) | LLM fills gaps from parametric knowledge |
| Retrievability | Chunk text overlaps vocabulary in real questions | Correct doc never surfaces in top-K |
| Context budget | Small enough that several chunks fit in the LLM window | Truncated context; dropped citations |
| Traceability | Metadata maps chunk → source page/section | Users cannot verify answers; compliance gaps |
Chunking sits between ingest and embed
- Extract — PDF, HTML, code, tickets; normalize to text with layout hints where possible.
- Split — apply a chunk policy (this guide); emit text + metadata records.
- Embed — batch chunks through the embedding model; store vectors with metadata.
- Retrieve — query embedding vs chunk vectors; optional rerank; pass top chunks to LLM.
Changing chunk policy after launch requires re-embedding (new vectors)—not an in-place patch. Version your collection (support_kb_v3_sentence_512) and measure recall before cutover.
Treating chunk size as “whatever fits the embedding model max length.” Most embedding APIs accept 8K+ tokens, but retrieval quality peaks at hundreds of tokens per chunk—long chunks bury the one sentence that matched the query. Split for search, then optionally expand to parent context at generation time (see parent–child).
Bi-encoder embeddings are mean-pooled (or CLS-pooled) over all tokens in the chunk. Attention mixes token representations—outlier sentences pull the vector toward their topic. That is why mixing “Refunds” and “Shipping” in one chunk hurts both retrieval paths.
“How would you debug RAG returning wrong answers?” — Start with chunk boundaries: print the retrieved chunk text, check if the answer span is split, then tune size/overlap/structure before blaming the embedding model or prompt.
Fixed-size chunks
The baseline strategy: split every N tokens (or characters) with M tokens overlap between consecutive chunks. Simple, fast, deterministic—often the first implementation and a useful A/B baseline.
Parameters
| Parameter | Typical value | Effect |
|---|---|---|
| chunk_size | 256–512 tokens | Upper bound on chunk length |
| chunk_overlap | 10–20% of size (50–100 tokens) | Duplicates boundary text in adjacent chunks |
| length_function | Token count (tiktoken), not chars | Aligns with embedder and LLM billing |
Pros and cons
| Pros | Cons |
|---|---|
| Predictable chunk count and storage cost | Cuts mid-sentence, mid-table, mid-code block |
| Easy to parallelize in ingest workers | Overlap inflates index size (same text embedded twice) |
| Works on any plain text without structure | Ignores document semantics (headings, lists) |
Use overlap so facts on boundaries appear whole in at least one chunk. Example: if a rule ends at token 512 and the next chunk starts at 513 without overlap, the embedding at the boundary may miss “annual plans only” attached to the preceding sentence.
import tiktoken
from dataclasses import dataclass
enc = tiktoken.get_encoding("cl100k_base")
@dataclass
class Chunk:
id: str
text: str
token_count: int
source: str
def token_len(text: str) -> int:
return len(enc.encode(text))
def chunk_fixed(text: str, source: str, size: int = 512, overlap: int = 64) -> list[Chunk]:
tokens = enc.encode(text)
chunks: list[Chunk] = []
start, i = 0, 0
while start < len(tokens):
end = min(start + size, len(tokens))
piece = enc.decode(tokens[start:end])
chunks.append(Chunk(
id=f"{source}-{i}",
text=piece,
token_count=end - start,
source=source,
))
i += 1
if end == len(tokens):
break
start = end - overlap # slide window back by overlap
return chunks
public record TextChunk(String id, String text, int tokenCount, String source) {}
public class FixedSizeChunker {
private final TokenCounter counter; // wraps jtokkit or provider tokenizer
private final int size;
private final int overlap;
public List<TextChunk> chunk(String text, String source) {
List<Integer> tokens = counter.encode(text);
List<TextChunk> out = new ArrayList<>();
int start = 0, i = 0;
while (start < tokens.size()) {
int end = Math.min(start + size, tokens.size());
String piece = counter.decode(tokens.subList(start, end));
out.add(new TextChunk(source + "-" + i, piece, end - start, source));
i++;
if (end == tokens.size()) break;
start = end - overlap;
}
return out;
}
}
More overlap improves recall on boundary facts but increases embed cost and index size linearly. A 512-token chunk with 128-token overlap means ~25% duplicate embeddings—acceptable for high-stakes policies, wasteful for millions of log lines. Prefer sentence-aware splits before cranking overlap past 20%.
Estimate index cost: total_tokens × (1 + overlap/size) × embed_price_per_million. Halving chunk size doubles chunk count (and often overlap duplication)—run the math before shrinking chunks to “improve recall.”
Sentence splitting
Pack sentences into token-budget chunks instead of blind windows. Respecting NLTK or spaCy sentence boundaries keeps clauses intact and usually reduces overlap needs.
Boundary detectors
| Tool | Approach | Best for | Caveats |
|---|---|---|---|
| Regex ((?<=[.!?])\s+) | Split on punctuation + whitespace | Prototypes, English prose | Breaks on “Dr. Smith”, “v2.0”, decimals |
| NLTK punkt | Unsupervised sentence tokenizer | General English, lightweight | Extra dependency; download punkt data |
| spaCy (sentencizer or full pipeline) | Linguistic rules + model | Multi-language, legal/clinical text | Heavier; model load time in workers |
Algorithm: split document into sentences → greedily append sentences until adding the next exceeds max_tokens → flush chunk → start new buffer (optionally carry last sentence for overlap).
import nltk
import spacy
from chunking.tokens import token_len # from fixed-size section
nltk.download("punkt", quiet=True)
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
def sentences_nltk(text: str) -> list[str]:
return nltk.sent_tokenize(text)
def sentences_spacy(text: str) -> list[str]:
return [s.text.strip() for s in nlp(text).sents if s.text.strip()]
def chunk_by_sentences(text: str, source: str, max_tokens: int = 400) -> list[dict]:
chunks, buf, i = [], [], 0
for sent in sentences_spacy(text):
trial = " ".join(buf + [sent])
if token_len(trial) > max_tokens and buf:
piece = " ".join(buf)
chunks.append({"id": f"{source}-{i}", "text": piece, "source": source})
i += 1
buf = [sent]
else:
buf.append(sent)
if buf:
chunks.append({"id": f"{source}-{i}", "text": " ".join(buf), "source": source})
return chunks
// OpenNLP or Stanford CoreNLP for sentence detection
public class SentenceChunker {
private final SentenceDetector sentenceDetector;
private final TokenCounter counter;
private final int maxTokens;
public List<TextChunk> chunk(String text, String source) {
String[] sentences = sentenceDetector.sentDetect(text);
List<TextChunk> chunks = new ArrayList<>();
List<String> buf = new ArrayList<>();
int i = 0;
for (String sent : sentences) {
String trial = String.join(" ", concat(buf, sent));
if (counter.count(trial) > maxTokens && !buf.isEmpty()) {
String piece = String.join(" ", buf);
chunks.add(new TextChunk(source + "-" + i++, piece, counter.count(piece), source));
buf = new ArrayList<>(List.of(sent));
} else {
buf.add(sent);
}
}
if (!buf.isEmpty()) {
String piece = String.join(" ", buf);
chunks.add(new TextChunk(source + "-" + i, piece, counter.count(piece), source));
}
return chunks;
}
}
For bullet lists, treat each bullet as a pseudo-sentence (split on \n[-*]) before packing— otherwise one chunk may contain ten unrelated list items and the embedding becomes useless for single-item queries.
Notion and Confluence exports often mix headings, bullets, and callouts. Teams run structure-aware pre-pass (headers → sections) then sentence chunking within each section—pure sentence splitting on the whole page still merges unrelated sections.
Recursive splitting
LangChain’s RecursiveCharacterTextSplitter tries coarse separators first (paragraph breaks), then finer ones (newlines, sentences, words) until pieces fit the token budget. This is the default in many RAG tutorials for good reason—it handles mixed prose without custom rules per doc type.
Separator hierarchy (typical order)
- \n\n — paragraph boundary
- \n — line break (lists, poetry)
- . — sentence boundary (fragile but useful fallback)
- — word boundary
- Character split — last resort (CJK text without spaces)
Recursion: split text on the first separator in the list. If any piece exceeds chunk_size, re-split that piece with the next separator. Merge small pieces up to the size limit with overlap applied on the token sequence.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
length_function=token_len,
separators=["\n\n", "\n", ". ", " ", ""],
)
docs = splitter.create_documents(
texts=[open("policy.txt").read()],
metadatas=[{"source": "policy.txt"}],
)
chunks = [{"id": f"policy-{i}", "text": d.page_content, **d.metadata}
for i, d in enumerate(docs)]
// Spring AI TokenTextSplitter — recursive token-aware splits
@Bean
TokenTextSplitter recursiveSplitter() {
return new TokenTextSplitter(
512, // defaultChunkSize tokens
64, // minChunkSizeChars (overlap-related floor)
5, // minChunkLengthToEmbed
10000, // maxNumChunks
true // keepSeparator
);
}
public List<Document> chunk(String text, Map<String, Object> meta) {
return recursiveSplitter().split(new Document(text, meta));
}
When recursive beats fixed-size
| Document shape | Fixed-size | Recursive |
|---|---|---|
| Long paragraphs, few headings | Mid-paragraph cuts common | ✓ Breaks on paragraph first |
| Chat transcripts with short lines | May merge unrelated turns | ✓ Respects line breaks |
| Markdown with code fences | May split inside ``` blocks | ⚠ Still weak—use document-aware |
Store splitter config in versioned YAML: chunk_policy: recursive_v2, chunk_size: 512, separators: [...]. Ingest workers read the same config; changing it triggers a new collection name and backfill job.
Using character chunk_size=1000 while the embedder counts tokens—1000 characters ≈ 250–400 tokens depending on language. Always align length_function with your tokenizer (tiktoken, cl100k_base for OpenAI).
Semantic chunking
Instead of fixed windows, embed each sentence (or paragraph), measure cosine distance between consecutive units, and split where similarity drops sharply—a heuristic for topic boundaries.
Algorithm sketch
- Split document into sentences (spaCy / NLTK).
- Embed each sentence with the same model used for retrieval (critical—same vector space).
- Compute cosine distance between sentence i and i+1: 1 − cos(eᵢ, eᵢ₊₁).
- Find breakpoints where distance exceeds a threshold (absolute or percentile—e.g. 95th percentile of distances).
- Merge sentences between breakpoints; sub-split any group still over token budget with recursive splitter.
Libraries: LangChain SemanticChunker, LlamaIndex SemanticSplitterNodeParser, or a 50-line NumPy script for full control.
import numpy as np
from openai import OpenAI
client = OpenAI()
def embed_batch(texts: list[str]) -> np.ndarray:
resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
return np.array([d.embedding for d in resp.data])
def cosine_dist(a: np.ndarray, b: np.ndarray) -> float:
return 1.0 - float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def semantic_breakpoints(sents: list[str], percentile: float = 95) -> list[int]:
vecs = embed_batch(sents)
dists = [cosine_dist(vecs[i], vecs[i + 1]) for i in range(len(sents) - 1)]
thresh = np.percentile(dists, percentile)
return [i + 1 for i, d in enumerate(dists) if d >= thresh]
def chunk_semantic(text: str, source: str) -> list[dict]:
sents = sentences_spacy(text)
breaks = set(semantic_breakpoints(sents))
groups, buf = [], []
for i, s in enumerate(sents):
buf.append(s)
if (i + 1) in breaks:
groups.append(" ".join(buf))
buf = []
if buf:
groups.append(" ".join(buf))
return [{"id": f"{source}-{i}", "text": g, "source": source}
for i, g in enumerate(groups)]
// Spring AI EmbeddingModel + manual breakpoint detection
public List<TextChunk> semanticChunk(String text, String source, EmbeddingModel embedder) {
List<String> sents = sentenceChunker.splitSentences(text);
float[][] vecs = embedder.embed(sents);
List<Double> dists = new ArrayList<>();
for (int i = 0; i < vecs.length - 1; i++)
dists.add(1.0 - cosine(vecs[i], vecs[i + 1]));
double thresh = percentile(dists, 95);
List<TextChunk> chunks = new ArrayList<>();
StringBuilder buf = new StringBuilder();
int idx = 0;
for (int i = 0; i < sents.size(); i++) {
buf.append(sents.get(i)).append(" ");
if (i < dists.size() && dists.get(i) >= thresh) {
chunks.add(new TextChunk(source + "-" + idx++, buf.toString().trim(), 0, source));
buf = new StringBuilder();
}
}
if (!buf.isEmpty())
chunks.add(new TextChunk(source + "-" + idx, buf.toString().trim(), 0, source));
return chunks;
}
Semantic chunking costs one embedding API call per sentence at index time—expensive on large corpora. Batch aggressively (64–256 sentences per request) and cache sentence embeddings by hash. Use it on high-value docs (policies, contracts) while keeping recursive splitting for logs and tickets.
Breakpoint methods assume adjacent-sentence distance spikes at topic shifts. Gradual transitions (two related sections) may not trigger— combine with header-aware splitting for structured docs. Percentile thresholds adapt per document better than fixed 0.5 cutoffs.
Document-aware chunking
Structured documents carry explicit boundaries—Markdown headings, HTML <h1>–<h3>, code functions and classes. Split on structure first, then apply token limits inside each section.
Splitters by format
| Format | Boundary signal | Tooling |
|---|---|---|
| Markdown | # / ## headers | LangChain MarkdownHeaderTextSplitter |
| HTML | Heading tags, article, section | BeautifulSoup, Unstructured.io |
| Code | AST: functions, classes, methods | tree-sitter, LangChain Language splitters |
| JSON / YAML | Top-level keys or array elements | Custom path-based chunker |
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
import ast
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")],
)
token_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50, length_function=token_len)
def chunk_markdown(text: str, source: str) -> list[dict]:
sections = md_splitter.split_text(text)
out = []
for i, doc in enumerate(sections):
for j, sub in enumerate(token_splitter.split_documents([doc])):
out.append({
"id": f"{source}-{i}-{j}",
"text": sub.page_content,
"source": source,
"section": sub.metadata.get("h2") or sub.metadata.get("h1", ""),
})
return out
def chunk_python_source(code: str, source: str) -> list[dict]:
tree = ast.parse(code)
chunks = []
for i, node in enumerate(tree.body):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
segment = ast.get_source_segment(code, node)
chunks.append({"id": f"{source}-{node.name}", "text": segment, "source": source, "symbol": node.name})
return chunks
// JavaParser for code; JSoup for HTML
public List<TextChunk> chunkMarkdown(String md, String source) {
MarkdownParser parser = MarkdownParser.builder().build();
Node doc = parser.parse(md);
List<TextChunk> chunks = new ArrayList<>();
String currentH2 = "";
for (Node node : doc.getChildren()) {
if (node instanceof Heading h && h.getLevel() == 2)
currentH2 = h.getText().toString();
else if (node instanceof Paragraph p) {
String text = p.getText().toString();
for (TextChunk sub : recursiveSplitter.chunk(text, source)) {
chunks.add(new TextChunk(sub.id(), sub.text(), sub.tokenCount(), source,
Map.of("section", currentH2)));
}
}
}
return chunks;
}
Preserve heading text in chunk content or metadata so retrieved snippets include context: “## Refunds — Annual plans: full refund within 30 days…” ranks better than a orphan sentence without section title.
For HTML docs, strip nav/footer boilerplate before chunking—otherwise every chunk repeats cookie banners and site chrome, polluting embeddings. Use readability extractors (trafilatura, Unstructured) in the ingest pipeline.
HTML chunking must not execute scripts. Parse with a sandboxed parser; never eval embedded JS. Redact PII blocks (SSN patterns) before chunking so sensitive spans do not land in vector stores without ACL metadata.
Agentic chunking
An LLM decides chunk boundaries—given a passage, it returns split points, titles, or JSON segment objects aligned with semantic topics. Highest quality on messy unstructured text; highest cost and latency at ingest.
When it is worth the cost
| Scenario | Agentic? | Rationale |
|---|---|---|
| 10M support tickets/year | ✗ | LLM cost dominates; use recursive + metadata |
| 500-page regulatory PDF (one-time) | ✓ | High stakes; manual review of LLM splits |
| M&A data room | ✓ | Messy scans; budget for quality |
| Internal wiki with clean Markdown | ✗ | Header splitter is cheaper and deterministic |
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
class Segment(BaseModel):
title: str
start_char: int
end_char: int
class ChunkPlan(BaseModel):
segments: list[Segment]
def agentic_chunk(text: str, source: str) -> list[dict]:
resp = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Split this document into coherent retrieval chunks. Return char offsets.\n\n{text[:12000]}"
}],
response_format=ChunkPlan,
)
plan = resp.choices[0].message.parsed
return [{
"id": f"{source}-{i}",
"text": text[s.start_char:s.end_char],
"source": source,
"section": s.title,
} for i, s in enumerate(plan.segments)]
// Spring AI ChatClient with JSON schema output
public List<TextChunk> agenticChunk(String text, String source) {
ChunkPlan plan = chatClient.prompt()
.user("Split into coherent RAG chunks with titles and char offsets:\n" + text.substring(0, 12000))
.call()
.entity(ChunkPlan.class);
List<TextChunk> out = new ArrayList<>();
int i = 0;
for (Segment s : plan.segments()) {
out.add(new TextChunk(
source + "-" + i++,
text.substring(s.startChar(), s.endChar()),
0, source,
Map.of("section", s.title())));
}
return out;
}
A 100-page doc ≈ 50K tokens input + structured output per pass—orders of magnitude more than recursive splitting. Batch docs offline; cache chunk plans by content hash; human-review splits for legal/compliance corpora before embed.
Run agentic chunking as a Kubernetes Job with rate limits and dead-letter queue—never inline on the upload API path. Failed LLM splits should fall back to recursive chunking so ingest does not stall.
Size guidelines
Optimal chunk size depends on query type and downstream LLM use. Short chunks improve precision for factoid Q&A; longer chunks preserve narrative for summarization.
| Use case | Token range | Overlap | Splitter preference |
|---|---|---|---|
| Q&A / FAQ retrieval | 256–512 | 10–15% | Sentence or header-aware |
| Summarization / narrative | 1024–2048 | 5–10% | Paragraph / semantic groups |
| Code search | One function or class | Minimal; split mid-function only if huge | AST / tree-sitter |
| Tables / structured rows | Row group or full table | Header row repeated in each chunk | Custom—never mid-row |
| Chat / ticket threads | 1–3 turns (~200–400) | Last turn of prev chunk | Turn-boundary splitter |
Code: chunk by function / class
| Language | Unit | Max tokens if oversized | Metadata field |
|---|---|---|---|
| Python | FunctionDef, ClassDef | 512 then recursive | symbol, lineno |
| Java | Method, class | 512 then recursive | qualified_name |
| TypeScript | Function, class, export | 512 | file_path |
| Go | Function, struct | 512 | package |
Tune with evals, not intuition: start at 512 for general docs, run recall@5 on 50 labeled questions, then sweep {256, 512, 768, 1024} and plot the curve—most corpora show a plateau then drop when chunks get too large.
“What chunk size would you use?” — Answer with the use case: “For support FAQ I’d start at 400 tokens sentence-aware; for code, one function per chunk; I’d validate with recall@K on real queries before locking config.”
Expose chunk size and policy as Terraform variables or Helm values per environment— chunk_size_dev=256 for fast iteration, chunk_size_prod=512 after eval sign-off. Prevent accidental prod re-index with different params.
Metadata
Every chunk carries metadata for filtered retrieval, citation UI, and debugging. Without source, page, and section, users cannot trust answers—and you cannot scope queries by date or tenant.
Essential fields
| Field | Example | Used for |
|---|---|---|
| source | refund-policy.pdf | Citation label, dedupe on re-ingest |
| page | 12 | Deep link to PDF viewer |
| section | Refunds | Filtered search, breadcrumb in UI |
| updated_at | 2026-03-15T10:00:00Z | Recency boost, stale chunk purge |
| acl / tenant_id | team_42 | Pre-filter before ANN search |
| chunk_policy | recursive_v3_512 | Audit which split produced vector |
from datetime import datetime, timezone
def enrich_metadata(chunk: dict, doc_meta: dict) -> dict:
return {
"source": chunk["source"],
"page": doc_meta.get("page"),
"section": chunk.get("section", ""),
"updated_at": doc_meta.get("updated_at", datetime.now(timezone.utc).isoformat()),
"tenant_id": doc_meta["tenant_id"],
"tokens": token_len(chunk["text"]),
"chunk_policy": "recursive_v3_512",
}
# Pinecone upsert
index.upsert(vectors=[{
"id": c["id"],
"values": embed(c["text"]),
"metadata": enrich_metadata(c, doc_meta),
} for c in chunks])
# pgvector: metadata in JSONB column alongside embedding
public Map<String, Object> enrichMetadata(TextChunk chunk, DocMeta doc) {
return Map.of(
"source", chunk.source(),
"page", doc.page(),
"section", chunk.section(),
"updated_at", doc.updatedAt().toString(),
"tenant_id", doc.tenantId(),
"tokens", chunk.tokenCount(),
"chunk_policy", "recursive_v3_512"
);
}
// Query with metadata filter (Pinecone-style)
SearchRequest req = SearchRequest.builder()
.queryVector(queryEmb)
.topK(10)
.filter("tenant_id = 'team_42' AND updated_at > '2026-01-01'")
.build();
Perplexity-style citations show title + URL from metadata, not hallucinated links. Stripe’s internal doc search filters by team ACL at the vector store layer—chunks without tenant_id never enter the index.
Metadata filters are part of authorization—do not rely on post-retrieval filtering alone. A misconfigured ANN query without tenant filter leaks cross-customer chunks into the LLM context.
Parent–child chunks
Index small child chunks for precise retrieval; when a child hits, return the parent (section or page) to the LLM for generation. Best of both worlds: sharp search, rich context.
Pattern
- Parent — logical unit (Markdown section, PDF page, full policy clause) stored in object storage or parent table.
- Child — 128–256 token sub-splits embedded and indexed in the vector store.
- Link — each child metadata includes parent_id.
- Query — ANN on children → dedupe parent IDs → fetch parent text for prompt.
| Layer | Typical size | Stored in | Role |
|---|---|---|---|
| Child | 128–256 tokens | Vector DB (with embedding) | Similarity search |
| Parent | 512–2048 tokens | Postgres / S3 / KV | LLM context assembly |
def build_parent_child(sections: list[dict]) -> tuple[list[dict], list[dict]]:
parents, children = [], []
for sec in sections:
pid = f"{sec['source']}#{sec['section']}"
parents.append({"id": pid, "text": sec["text"], "source": sec["source"]})
for i, child in enumerate(chunk_fixed(sec["text"], pid, size=256, overlap=32)):
child["parent_id"] = pid
children.append(child)
return parents, children
def retrieve_with_parents(query: str, k: int = 5) -> list[str]:
child_hits = vector_search(embed(query), k=k * 3) # over-fetch
parent_ids = list(dict.fromkeys(h["metadata"]["parent_id"] for h in child_hits))[:k]
return [parent_store[pid]["text"] for pid in parent_ids]
public List<String> retrieveWithParents(float[] queryEmb, int k) {
List<ChildHit> childHits = vectorStore.similaritySearch(queryEmb, k * 3);
LinkedHashSet<String> parentIds = new LinkedHashSet<>();
for (ChildHit hit : childHits) {
parentIds.add(hit.metadata().get("parent_id"));
if (parentIds.size() >= k) break;
}
return parentIds.stream().map(parentStore::getText).toList();
}
Parent–child adds storage complexity and duplicate text (parent blob + child slices). Worth it when recall improves measurably on golden sets—especially for long policy sections where 512-token chunks dilute embeddings.
Deduplicate parent IDs before loading context—three child hits from the same section should not triple the parent text in the prompt.
What this looks like in production
Chunk policy is versioned configuration, not a notebook one-liner. Promote changes only after chunking eval on golden documents—labeled questions with expected chunk IDs or answer spans.
Chunking eval workflow
- Golden docs — 20–50 representative sources (policies, APIs, runbooks) with known tricky boundaries.
- Golden queries — real user questions; label gold_chunk_ids or acceptable parent IDs.
- Sweep policies — fixed 512, sentence 400, recursive 512, header-aware; one variable at a time.
- Metrics — recall@K, MRR, answer faithfulness when passed to LLM (optional end-to-end).
- Promote — bump chunk_policy version; new collection; backfill job; alias swap.
Ingest pipeline checklist
- Token-based limits aligned with embedding model tokenizer—not raw character counts
- Structure-aware split before overlap tuning (headers, AST, sentences)
- Metadata: source, section, updated_at, ACL on every chunk
- Content-hash skip for unchanged docs on re-ingest
- Chunk policy logged in collection metadata and chunk rows
- Fallback chain: agentic → recursive → fixed-size on failure
- Metrics: chunks_per_doc histogram, embed cost, ingest latency p95
Regression gate (CI)
Store golden queries in git; CI job chunks golden docs with candidate policy, embeds (or mocks), runs retrieval, fails if recall@5 drops more than 2 points vs baseline. Block merges that change chunk_size without updated eval artifacts.
def recall_at_k(gold_ids: set[str], retrieved_ids: list[str], k: int = 5) -> float:
top = set(retrieved_ids[:k])
return len(gold_ids & top) / len(gold_ids)
def eval_policy(chunks: list[dict], golden: list[dict], k: int = 5) -> float:
scores = []
for row in golden:
hits = vector_search(embed(row["query"]), index=chunks, k=k)
scores.append(recall_at_k(set(row["gold_chunk_ids"]), [h["id"] for h in hits], k))
return sum(scores) / len(scores)
# Compare policies before prod promotion
for name, chunker in [("fixed512", chunk_fixed), ("sentence400", chunk_by_sentences)]:
chunks = chunker(corpus_text, "golden.pdf")
print(name, eval_policy(chunks, golden_queries))
public double recallAtK(Set<String> gold, List<String> retrieved, int k) {
Set<String> top = retrieved.stream().limit(k).collect(Collectors.toSet());
long hit = gold.stream().filter(top::contains).count();
return (double) hit / gold.size();
}
@Test
void chunkPolicyRegression() {
double baseline = evalPolicy(fixedChunker, goldenQueries);
double candidate = evalPolicy(sentenceChunker, goldenQueries);
assertThat(candidate).isGreaterThanOrEqualTo(baseline - 0.02);
}
Changing chunk overlap alone changes chunk IDs—your golden gold_chunk_ids labels become invalid. Label by content span or parent section ID, not ephemeral chunk indices, when policies churn frequently.
Teams store CHUNK_POLICY=v4_sentence_400 in config; ingest workers read it; PRs that modify chunk config must attach recall@5 from the regression set—same discipline as prompt versioning.
Example prod config block:
chunk_policy: markdown_header_recursive · child_size: 256 · parent_max: 2048 · collection: support_kb_v7