Lang
API

Chunking strategies

Vector search does not read your PDF—it reads chunks. Each chunk becomes one embedding vector. If a refund rule is split across two chunks, a question about refunds may retrieve only half the policy and the LLM will guess the rest. Chunking is the highest-leverage knob in RAG after model choice: it determines whether retrieval finds the right evidence at all.

After reading, you should be able to: explain why embedding quality depends on chunk coherence; implement fixed-size, sentence, recursive, semantic, and document-aware splitters in Python and Java; choose chunk sizes by use case; attach metadata for filtered retrieval and citations; design parent–child indexes; and evaluate chunk policy on golden documents before re-indexing production.

developer platform architect Track 2 LangChain 0.3+ Spring AI 1.0+ tiktoken RecursiveCharacterTextSplitter

Why chunking matters

Embedding quality depends on chunk coherence. An embedding model compresses an entire chunk into one vector— if the chunk mixes unrelated topics, the vector becomes an ambiguous average and similarity scores flatten. Good chunks are self-contained units of meaning that a user question can match precisely.

Consider a support policy document: “Annual plans: full refund within 30 days of purchase. Monthly plans: cancel anytime; no refund for partial months.” A naive 200-character split might cut after “30 days of” and leave “purchase. Monthly plans…” in the next chunk. A user asking “Can I get a refund on my yearly subscription?” retrieves the first chunk—missing the word annual—and the LLM may hallucinate eligibility.

What a good chunk optimizes for

PropertyWhy it mattersFailure mode when ignored
Coherence One topic or rule per vector Embedding dilution; wrong chunk wins on cosine
Completeness Conditions stay together (time window + plan type) LLM fills gaps from parametric knowledge
Retrievability Chunk text overlaps vocabulary in real questions Correct doc never surfaces in top-K
Context budget Small enough that several chunks fit in the LLM window Truncated context; dropped citations
Traceability Metadata maps chunk → source page/section Users cannot verify answers; compliance gaps

Chunking sits between ingest and embed

  1. Extract — PDF, HTML, code, tickets; normalize to text with layout hints where possible.
  2. Split — apply a chunk policy (this guide); emit text + metadata records.
  3. Embed — batch chunks through the embedding model; store vectors with metadata.
  4. Retrieve — query embedding vs chunk vectors; optional rerank; pass top chunks to LLM.

Changing chunk policy after launch requires re-embedding (new vectors)—not an in-place patch. Version your collection (support_kb_v3_sentence_512) and measure recall before cutover.

⚠️ Pitfall

Treating chunk size as “whatever fits the embedding model max length.” Most embedding APIs accept 8K+ tokens, but retrieval quality peaks at hundreds of tokens per chunk—long chunks bury the one sentence that matched the query. Split for search, then optionally expand to parent context at generation time (see parent–child).

🔬 Under the Hood

Bi-encoder embeddings are mean-pooled (or CLS-pooled) over all tokens in the chunk. Attention mixes token representations—outlier sentences pull the vector toward their topic. That is why mixing “Refunds” and “Shipping” in one chunk hurts both retrieval paths.

🎯 Interview Tip

“How would you debug RAG returning wrong answers?” — Start with chunk boundaries: print the retrieved chunk text, check if the answer span is split, then tune size/overlap/structure before blaming the embedding model or prompt.

Fixed-size chunks

The baseline strategy: split every N tokens (or characters) with M tokens overlap between consecutive chunks. Simple, fast, deterministic—often the first implementation and a useful A/B baseline.

Parameters

ParameterTypical valueEffect
chunk_size256–512 tokensUpper bound on chunk length
chunk_overlap10–20% of size (50–100 tokens)Duplicates boundary text in adjacent chunks
length_functionToken count (tiktoken), not charsAligns with embedder and LLM billing

Pros and cons

ProsCons
Predictable chunk count and storage cost Cuts mid-sentence, mid-table, mid-code block
Easy to parallelize in ingest workers Overlap inflates index size (same text embedded twice)
Works on any plain text without structure Ignores document semantics (headings, lists)

Use overlap so facts on boundaries appear whole in at least one chunk. Example: if a rule ends at token 512 and the next chunk starts at 513 without overlap, the embedding at the boundary may miss “annual plans only” attached to the preceding sentence.

Fixed-size token chunks with overlap
import tiktoken
from dataclasses import dataclass

enc = tiktoken.get_encoding("cl100k_base")

@dataclass
class Chunk:
    id: str
    text: str
    token_count: int
    source: str

def token_len(text: str) -> int:
    return len(enc.encode(text))

def chunk_fixed(text: str, source: str, size: int = 512, overlap: int = 64) -> list[Chunk]:
    tokens = enc.encode(text)
    chunks: list[Chunk] = []
    start, i = 0, 0
    while start < len(tokens):
        end = min(start + size, len(tokens))
        piece = enc.decode(tokens[start:end])
        chunks.append(Chunk(
            id=f"{source}-{i}",
            text=piece,
            token_count=end - start,
            source=source,
        ))
        i += 1
        if end == len(tokens):
            break
        start = end - overlap  # slide window back by overlap
    return chunks
public record TextChunk(String id, String text, int tokenCount, String source) {}

public class FixedSizeChunker {
  private final TokenCounter counter; // wraps jtokkit or provider tokenizer
  private final int size;
  private final int overlap;

  public List<TextChunk> chunk(String text, String source) {
    List<Integer> tokens = counter.encode(text);
    List<TextChunk> out = new ArrayList<>();
    int start = 0, i = 0;
    while (start < tokens.size()) {
      int end = Math.min(start + size, tokens.size());
      String piece = counter.decode(tokens.subList(start, end));
      out.add(new TextChunk(source + "-" + i, piece, end - start, source));
      i++;
      if (end == tokens.size()) break;
      start = end - overlap;
    }
    return out;
  }
}
⚖️ Trade-off

More overlap improves recall on boundary facts but increases embed cost and index size linearly. A 512-token chunk with 128-token overlap means ~25% duplicate embeddings—acceptable for high-stakes policies, wasteful for millions of log lines. Prefer sentence-aware splits before cranking overlap past 20%.

💰 Cost

Estimate index cost: total_tokens × (1 + overlap/size) × embed_price_per_million. Halving chunk size doubles chunk count (and often overlap duplication)—run the math before shrinking chunks to “improve recall.”

Sentence splitting

Pack sentences into token-budget chunks instead of blind windows. Respecting NLTK or spaCy sentence boundaries keeps clauses intact and usually reduces overlap needs.

Boundary detectors

ToolApproachBest forCaveats
Regex ((?<=[.!?])\s+) Split on punctuation + whitespace Prototypes, English prose Breaks on “Dr. Smith”, “v2.0”, decimals
NLTK punkt Unsupervised sentence tokenizer General English, lightweight Extra dependency; download punkt data
spaCy (sentencizer or full pipeline) Linguistic rules + model Multi-language, legal/clinical text Heavier; model load time in workers

Algorithm: split document into sentences → greedily append sentences until adding the next exceeds max_tokens → flush chunk → start new buffer (optionally carry last sentence for overlap).

Sentence-aware chunking — NLTK and spaCy
import nltk
import spacy
from chunking.tokens import token_len  # from fixed-size section

nltk.download("punkt", quiet=True)
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

def sentences_nltk(text: str) -> list[str]:
    return nltk.sent_tokenize(text)

def sentences_spacy(text: str) -> list[str]:
    return [s.text.strip() for s in nlp(text).sents if s.text.strip()]

def chunk_by_sentences(text: str, source: str, max_tokens: int = 400) -> list[dict]:
    chunks, buf, i = [], [], 0
    for sent in sentences_spacy(text):
        trial = " ".join(buf + [sent])
        if token_len(trial) > max_tokens and buf:
            piece = " ".join(buf)
            chunks.append({"id": f"{source}-{i}", "text": piece, "source": source})
            i += 1
            buf = [sent]
        else:
            buf.append(sent)
    if buf:
        chunks.append({"id": f"{source}-{i}", "text": " ".join(buf), "source": source})
    return chunks
// OpenNLP or Stanford CoreNLP for sentence detection
public class SentenceChunker {
  private final SentenceDetector sentenceDetector;
  private final TokenCounter counter;
  private final int maxTokens;

  public List<TextChunk> chunk(String text, String source) {
    String[] sentences = sentenceDetector.sentDetect(text);
    List<TextChunk> chunks = new ArrayList<>();
    List<String> buf = new ArrayList<>();
    int i = 0;
    for (String sent : sentences) {
      String trial = String.join(" ", concat(buf, sent));
      if (counter.count(trial) > maxTokens && !buf.isEmpty()) {
        String piece = String.join(" ", buf);
        chunks.add(new TextChunk(source + "-" + i++, piece, counter.count(piece), source));
        buf = new ArrayList<>(List.of(sent));
      } else {
        buf.add(sent);
      }
    }
    if (!buf.isEmpty()) {
      String piece = String.join(" ", buf);
      chunks.add(new TextChunk(source + "-" + i, piece, counter.count(piece), source));
    }
    return chunks;
  }
}
💡 Pro Tip

For bullet lists, treat each bullet as a pseudo-sentence (split on \n[-*]) before packing— otherwise one chunk may contain ten unrelated list items and the embedding becomes useless for single-item queries.

📦 Real World

Notion and Confluence exports often mix headings, bullets, and callouts. Teams run structure-aware pre-pass (headers → sections) then sentence chunking within each section—pure sentence splitting on the whole page still merges unrelated sections.

Recursive splitting

LangChain’s RecursiveCharacterTextSplitter tries coarse separators first (paragraph breaks), then finer ones (newlines, sentences, words) until pieces fit the token budget. This is the default in many RAG tutorials for good reason—it handles mixed prose without custom rules per doc type.

Separator hierarchy (typical order)

  1. \n\n — paragraph boundary
  2. \n — line break (lists, poetry)
  3. . — sentence boundary (fragile but useful fallback)
  4. — word boundary
  5. Character split — last resort (CJK text without spaces)

Recursion: split text on the first separator in the list. If any piece exceeds chunk_size, re-split that piece with the next separator. Merge small pieces up to the size limit with overlap applied on the token sequence.

RecursiveCharacterTextSplitter — LangChain and Spring AI
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    length_function=token_len,
    separators=["\n\n", "\n", ". ", " ", ""],
)

docs = splitter.create_documents(
    texts=[open("policy.txt").read()],
    metadatas=[{"source": "policy.txt"}],
)
chunks = [{"id": f"policy-{i}", "text": d.page_content, **d.metadata}
          for i, d in enumerate(docs)]
// Spring AI TokenTextSplitter — recursive token-aware splits
@Bean
TokenTextSplitter recursiveSplitter() {
  return new TokenTextSplitter(
      512,   // defaultChunkSize tokens
      64,    // minChunkSizeChars (overlap-related floor)
      5,     // minChunkLengthToEmbed
      10000, // maxNumChunks
      true   // keepSeparator
  );
}

public List<Document> chunk(String text, Map<String, Object> meta) {
  return recursiveSplitter().split(new Document(text, meta));
}

When recursive beats fixed-size

Document shapeFixed-sizeRecursive
Long paragraphs, few headingsMid-paragraph cuts common✓ Breaks on paragraph first
Chat transcripts with short linesMay merge unrelated turns✓ Respects line breaks
Markdown with code fencesMay split inside ``` blocks⚠ Still weak—use document-aware
⚙️ Config

Store splitter config in versioned YAML: chunk_policy: recursive_v2, chunk_size: 512, separators: [...]. Ingest workers read the same config; changing it triggers a new collection name and backfill job.

⚠️ Pitfall

Using character chunk_size=1000 while the embedder counts tokens—1000 characters ≈ 250–400 tokens depending on language. Always align length_function with your tokenizer (tiktoken, cl100k_base for OpenAI).

Semantic chunking

Instead of fixed windows, embed each sentence (or paragraph), measure cosine distance between consecutive units, and split where similarity drops sharply—a heuristic for topic boundaries.

Algorithm sketch

  1. Split document into sentences (spaCy / NLTK).
  2. Embed each sentence with the same model used for retrieval (critical—same vector space).
  3. Compute cosine distance between sentence i and i+1: 1 − cos(eᵢ, eᵢ₊₁).
  4. Find breakpoints where distance exceeds a threshold (absolute or percentile—e.g. 95th percentile of distances).
  5. Merge sentences between breakpoints; sub-split any group still over token budget with recursive splitter.

Libraries: LangChain SemanticChunker, LlamaIndex SemanticSplitterNodeParser, or a 50-line NumPy script for full control.

Semantic breakpoints via cosine distance drop
import numpy as np
from openai import OpenAI

client = OpenAI()

def embed_batch(texts: list[str]) -> np.ndarray:
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in resp.data])

def cosine_dist(a: np.ndarray, b: np.ndarray) -> float:
    return 1.0 - float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def semantic_breakpoints(sents: list[str], percentile: float = 95) -> list[int]:
    vecs = embed_batch(sents)
    dists = [cosine_dist(vecs[i], vecs[i + 1]) for i in range(len(sents) - 1)]
    thresh = np.percentile(dists, percentile)
    return [i + 1 for i, d in enumerate(dists) if d >= thresh]

def chunk_semantic(text: str, source: str) -> list[dict]:
    sents = sentences_spacy(text)
    breaks = set(semantic_breakpoints(sents))
    groups, buf = [], []
    for i, s in enumerate(sents):
        buf.append(s)
        if (i + 1) in breaks:
            groups.append(" ".join(buf))
            buf = []
    if buf:
        groups.append(" ".join(buf))
    return [{"id": f"{source}-{i}", "text": g, "source": source}
            for i, g in enumerate(groups)]
// Spring AI EmbeddingModel + manual breakpoint detection
public List<TextChunk> semanticChunk(String text, String source, EmbeddingModel embedder) {
  List<String> sents = sentenceChunker.splitSentences(text);
  float[][] vecs = embedder.embed(sents);
  List<Double> dists = new ArrayList<>();
  for (int i = 0; i < vecs.length - 1; i++)
    dists.add(1.0 - cosine(vecs[i], vecs[i + 1]));
  double thresh = percentile(dists, 95);
  List<TextChunk> chunks = new ArrayList<>();
  StringBuilder buf = new StringBuilder();
  int idx = 0;
  for (int i = 0; i < sents.size(); i++) {
    buf.append(sents.get(i)).append(" ");
    if (i < dists.size() && dists.get(i) >= thresh) {
      chunks.add(new TextChunk(source + "-" + idx++, buf.toString().trim(), 0, source));
      buf = new StringBuilder();
    }
  }
  if (!buf.isEmpty())
    chunks.add(new TextChunk(source + "-" + idx, buf.toString().trim(), 0, source));
  return chunks;
}
⚖️ Trade-off

Semantic chunking costs one embedding API call per sentence at index time—expensive on large corpora. Batch aggressively (64–256 sentences per request) and cache sentence embeddings by hash. Use it on high-value docs (policies, contracts) while keeping recursive splitting for logs and tickets.

🔬 Under the Hood

Breakpoint methods assume adjacent-sentence distance spikes at topic shifts. Gradual transitions (two related sections) may not trigger— combine with header-aware splitting for structured docs. Percentile thresholds adapt per document better than fixed 0.5 cutoffs.

Document-aware chunking

Structured documents carry explicit boundaries—Markdown headings, HTML <h1><h3>, code functions and classes. Split on structure first, then apply token limits inside each section.

Splitters by format

FormatBoundary signalTooling
Markdown # / ## headers LangChain MarkdownHeaderTextSplitter
HTML Heading tags, article, section BeautifulSoup, Unstructured.io
Code AST: functions, classes, methods tree-sitter, LangChain Language splitters
JSON / YAML Top-level keys or array elements Custom path-based chunker
Markdown headers + Python AST code chunks
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
import ast

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")],
)
token_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50, length_function=token_len)

def chunk_markdown(text: str, source: str) -> list[dict]:
    sections = md_splitter.split_text(text)
    out = []
    for i, doc in enumerate(sections):
        for j, sub in enumerate(token_splitter.split_documents([doc])):
            out.append({
                "id": f"{source}-{i}-{j}",
                "text": sub.page_content,
                "source": source,
                "section": sub.metadata.get("h2") or sub.metadata.get("h1", ""),
            })
    return out

def chunk_python_source(code: str, source: str) -> list[dict]:
    tree = ast.parse(code)
    chunks = []
    for i, node in enumerate(tree.body):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            segment = ast.get_source_segment(code, node)
            chunks.append({"id": f"{source}-{node.name}", "text": segment, "source": source, "symbol": node.name})
    return chunks
// JavaParser for code; JSoup for HTML
public List<TextChunk> chunkMarkdown(String md, String source) {
  MarkdownParser parser = MarkdownParser.builder().build();
  Node doc = parser.parse(md);
  List<TextChunk> chunks = new ArrayList<>();
  String currentH2 = "";
  for (Node node : doc.getChildren()) {
    if (node instanceof Heading h && h.getLevel() == 2)
      currentH2 = h.getText().toString();
    else if (node instanceof Paragraph p) {
      String text = p.getText().toString();
      for (TextChunk sub : recursiveSplitter.chunk(text, source)) {
        chunks.add(new TextChunk(sub.id(), sub.text(), sub.tokenCount(), source,
            Map.of("section", currentH2)));
      }
    }
  }
  return chunks;
}

Preserve heading text in chunk content or metadata so retrieved snippets include context: “## Refunds — Annual plans: full refund within 30 days…” ranks better than a orphan sentence without section title.

💡 Pro Tip

For HTML docs, strip nav/footer boilerplate before chunking—otherwise every chunk repeats cookie banners and site chrome, polluting embeddings. Use readability extractors (trafilatura, Unstructured) in the ingest pipeline.

🔒 Security

HTML chunking must not execute scripts. Parse with a sandboxed parser; never eval embedded JS. Redact PII blocks (SSN patterns) before chunking so sensitive spans do not land in vector stores without ACL metadata.

Agentic chunking

An LLM decides chunk boundaries—given a passage, it returns split points, titles, or JSON segment objects aligned with semantic topics. Highest quality on messy unstructured text; highest cost and latency at ingest.

When it is worth the cost

ScenarioAgentic?Rationale
10M support tickets/yearLLM cost dominates; use recursive + metadata
500-page regulatory PDF (one-time)High stakes; manual review of LLM splits
M&A data roomMessy scans; budget for quality
Internal wiki with clean MarkdownHeader splitter is cheaper and deterministic
LLM-proposed chunk boundaries (structured output)
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class Segment(BaseModel):
    title: str
    start_char: int
    end_char: int

class ChunkPlan(BaseModel):
    segments: list[Segment]

def agentic_chunk(text: str, source: str) -> list[dict]:
    resp = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Split this document into coherent retrieval chunks. Return char offsets.\n\n{text[:12000]}"
        }],
        response_format=ChunkPlan,
    )
    plan = resp.choices[0].message.parsed
    return [{
        "id": f"{source}-{i}",
        "text": text[s.start_char:s.end_char],
        "source": source,
        "section": s.title,
    } for i, s in enumerate(plan.segments)]
// Spring AI ChatClient with JSON schema output
public List<TextChunk> agenticChunk(String text, String source) {
  ChunkPlan plan = chatClient.prompt()
      .user("Split into coherent RAG chunks with titles and char offsets:\n" + text.substring(0, 12000))
      .call()
      .entity(ChunkPlan.class);
  List<TextChunk> out = new ArrayList<>();
  int i = 0;
  for (Segment s : plan.segments()) {
    out.add(new TextChunk(
        source + "-" + i++,
        text.substring(s.startChar(), s.endChar()),
        0, source,
        Map.of("section", s.title())));
  }
  return out;
}
💰 Cost

A 100-page doc ≈ 50K tokens input + structured output per pass—orders of magnitude more than recursive splitting. Batch docs offline; cache chunk plans by content hash; human-review splits for legal/compliance corpora before embed.

☸️ Operations

Run agentic chunking as a Kubernetes Job with rate limits and dead-letter queue—never inline on the upload API path. Failed LLM splits should fall back to recursive chunking so ingest does not stall.

Size guidelines

Optimal chunk size depends on query type and downstream LLM use. Short chunks improve precision for factoid Q&A; longer chunks preserve narrative for summarization.

Use caseToken rangeOverlapSplitter preference
Q&A / FAQ retrieval 256–512 10–15% Sentence or header-aware
Summarization / narrative 1024–2048 5–10% Paragraph / semantic groups
Code search One function or class Minimal; split mid-function only if huge AST / tree-sitter
Tables / structured rows Row group or full table Header row repeated in each chunk Custom—never mid-row
Chat / ticket threads 1–3 turns (~200–400) Last turn of prev chunk Turn-boundary splitter

Code: chunk by function / class

LanguageUnitMax tokens if oversizedMetadata field
PythonFunctionDef, ClassDef512 then recursivesymbol, lineno
JavaMethod, class512 then recursivequalified_name
TypeScriptFunction, class, export512file_path
GoFunction, struct512package

Tune with evals, not intuition: start at 512 for general docs, run recall@5 on 50 labeled questions, then sweep {256, 512, 768, 1024} and plot the curve—most corpora show a plateau then drop when chunks get too large.

🎯 Interview Tip

“What chunk size would you use?” — Answer with the use case: “For support FAQ I’d start at 400 tokens sentence-aware; for code, one function per chunk; I’d validate with recall@K on real queries before locking config.”

🏗️ Infrastructure

Expose chunk size and policy as Terraform variables or Helm values per environment— chunk_size_dev=256 for fast iteration, chunk_size_prod=512 after eval sign-off. Prevent accidental prod re-index with different params.

Metadata

Every chunk carries metadata for filtered retrieval, citation UI, and debugging. Without source, page, and section, users cannot trust answers—and you cannot scope queries by date or tenant.

Essential fields

FieldExampleUsed for
sourcerefund-policy.pdfCitation label, dedupe on re-ingest
page12Deep link to PDF viewer
sectionRefundsFiltered search, breadcrumb in UI
updated_at2026-03-15T10:00:00ZRecency boost, stale chunk purge
acl / tenant_idteam_42Pre-filter before ANN search
chunk_policyrecursive_v3_512Audit which split produced vector
Upsert chunks with metadata — Pinecone / pgvector
from datetime import datetime, timezone

def enrich_metadata(chunk: dict, doc_meta: dict) -> dict:
    return {
        "source": chunk["source"],
        "page": doc_meta.get("page"),
        "section": chunk.get("section", ""),
        "updated_at": doc_meta.get("updated_at", datetime.now(timezone.utc).isoformat()),
        "tenant_id": doc_meta["tenant_id"],
        "tokens": token_len(chunk["text"]),
        "chunk_policy": "recursive_v3_512",
    }

# Pinecone upsert
index.upsert(vectors=[{
    "id": c["id"],
    "values": embed(c["text"]),
    "metadata": enrich_metadata(c, doc_meta),
} for c in chunks])

# pgvector: metadata in JSONB column alongside embedding
public Map<String, Object> enrichMetadata(TextChunk chunk, DocMeta doc) {
  return Map.of(
      "source", chunk.source(),
      "page", doc.page(),
      "section", chunk.section(),
      "updated_at", doc.updatedAt().toString(),
      "tenant_id", doc.tenantId(),
      "tokens", chunk.tokenCount(),
      "chunk_policy", "recursive_v3_512"
  );
}

// Query with metadata filter (Pinecone-style)
SearchRequest req = SearchRequest.builder()
    .queryVector(queryEmb)
    .topK(10)
    .filter("tenant_id = 'team_42' AND updated_at > '2026-01-01'")
    .build();
📦 Real World

Perplexity-style citations show title + URL from metadata, not hallucinated links. Stripe’s internal doc search filters by team ACL at the vector store layer—chunks without tenant_id never enter the index.

🔒 Security

Metadata filters are part of authorization—do not rely on post-retrieval filtering alone. A misconfigured ANN query without tenant filter leaks cross-customer chunks into the LLM context.

Parent–child chunks

Index small child chunks for precise retrieval; when a child hits, return the parent (section or page) to the LLM for generation. Best of both worlds: sharp search, rich context.

Pattern

  1. Parent — logical unit (Markdown section, PDF page, full policy clause) stored in object storage or parent table.
  2. Child — 128–256 token sub-splits embedded and indexed in the vector store.
  3. Link — each child metadata includes parent_id.
  4. Query — ANN on children → dedupe parent IDs → fetch parent text for prompt.
LayerTypical sizeStored inRole
Child128–256 tokensVector DB (with embedding)Similarity search
Parent512–2048 tokensPostgres / S3 / KVLLM context assembly
Parent–child index and retrieval expansion
def build_parent_child(sections: list[dict]) -> tuple[list[dict], list[dict]]:
    parents, children = [], []
    for sec in sections:
        pid = f"{sec['source']}#{sec['section']}"
        parents.append({"id": pid, "text": sec["text"], "source": sec["source"]})
        for i, child in enumerate(chunk_fixed(sec["text"], pid, size=256, overlap=32)):
            child["parent_id"] = pid
            children.append(child)
    return parents, children

def retrieve_with_parents(query: str, k: int = 5) -> list[str]:
    child_hits = vector_search(embed(query), k=k * 3)  # over-fetch
    parent_ids = list(dict.fromkeys(h["metadata"]["parent_id"] for h in child_hits))[:k]
    return [parent_store[pid]["text"] for pid in parent_ids]
public List<String> retrieveWithParents(float[] queryEmb, int k) {
  List<ChildHit> childHits = vectorStore.similaritySearch(queryEmb, k * 3);
  LinkedHashSet<String> parentIds = new LinkedHashSet<>();
  for (ChildHit hit : childHits) {
    parentIds.add(hit.metadata().get("parent_id"));
    if (parentIds.size() >= k) break;
  }
  return parentIds.stream().map(parentStore::getText).toList();
}
⚖️ Trade-off

Parent–child adds storage complexity and duplicate text (parent blob + child slices). Worth it when recall improves measurably on golden sets—especially for long policy sections where 512-token chunks dilute embeddings.

💡 Pro Tip

Deduplicate parent IDs before loading context—three child hits from the same section should not triple the parent text in the prompt.

What this looks like in production

Chunk policy is versioned configuration, not a notebook one-liner. Promote changes only after chunking eval on golden documents—labeled questions with expected chunk IDs or answer spans.

Chunking eval workflow

  1. Golden docs — 20–50 representative sources (policies, APIs, runbooks) with known tricky boundaries.
  2. Golden queries — real user questions; label gold_chunk_ids or acceptable parent IDs.
  3. Sweep policies — fixed 512, sentence 400, recursive 512, header-aware; one variable at a time.
  4. Metrics — recall@K, MRR, answer faithfulness when passed to LLM (optional end-to-end).
  5. Promote — bump chunk_policy version; new collection; backfill job; alias swap.

Ingest pipeline checklist

  • Token-based limits aligned with embedding model tokenizer—not raw character counts
  • Structure-aware split before overlap tuning (headers, AST, sentences)
  • Metadata: source, section, updated_at, ACL on every chunk
  • Content-hash skip for unchanged docs on re-ingest
  • Chunk policy logged in collection metadata and chunk rows
  • Fallback chain: agentic → recursive → fixed-size on failure
  • Metrics: chunks_per_doc histogram, embed cost, ingest latency p95

Regression gate (CI)

Store golden queries in git; CI job chunks golden docs with candidate policy, embeds (or mocks), runs retrieval, fails if recall@5 drops more than 2 points vs baseline. Block merges that change chunk_size without updated eval artifacts.

Recall@K eval for chunk policy comparison
def recall_at_k(gold_ids: set[str], retrieved_ids: list[str], k: int = 5) -> float:
    top = set(retrieved_ids[:k])
    return len(gold_ids & top) / len(gold_ids)

def eval_policy(chunks: list[dict], golden: list[dict], k: int = 5) -> float:
    scores = []
    for row in golden:
        hits = vector_search(embed(row["query"]), index=chunks, k=k)
        scores.append(recall_at_k(set(row["gold_chunk_ids"]), [h["id"] for h in hits], k))
    return sum(scores) / len(scores)

# Compare policies before prod promotion
for name, chunker in [("fixed512", chunk_fixed), ("sentence400", chunk_by_sentences)]:
    chunks = chunker(corpus_text, "golden.pdf")
    print(name, eval_policy(chunks, golden_queries))
public double recallAtK(Set<String> gold, List<String> retrieved, int k) {
  Set<String> top = retrieved.stream().limit(k).collect(Collectors.toSet());
  long hit = gold.stream().filter(top::contains).count();
  return (double) hit / gold.size();
}

@Test
void chunkPolicyRegression() {
  double baseline = evalPolicy(fixedChunker, goldenQueries);
  double candidate = evalPolicy(sentenceChunker, goldenQueries);
  assertThat(candidate).isGreaterThanOrEqualTo(baseline - 0.02);
}
⚠️ Pitfall

Changing chunk overlap alone changes chunk IDs—your golden gold_chunk_ids labels become invalid. Label by content span or parent section ID, not ephemeral chunk indices, when policies churn frequently.

📦 Real World

Teams store CHUNK_POLICY=v4_sentence_400 in config; ingest workers read it; PRs that modify chunk config must attach recall@5 from the regression set—same discipline as prompt versioning.

⚙️ Config

Example prod config block:

chunk_policy: markdown_header_recursive · child_size: 256 · parent_max: 2048 · collection: support_kb_v7