Chunking strategies

Why chunking matters

Embedding quality depends on chunk coherence. An embedding model compresses an entire chunk into one vector— if the chunk mixes unrelated topics, the vector becomes an ambiguous average and similarity scores flatten. Good chunks are self-contained units of meaning that a user question can match precisely.

Consider a support policy document: “Annual plans: full refund within 30 days of purchase. Monthly plans: cancel anytime; no refund for partial months.” A naive 200-character split might cut after “30 days of” and leave “purchase. Monthly plans…” in the next chunk. A user asking “Can I get a refund on my yearly subscription?” retrieves the first chunk—missing the word annual—and the LLM may hallucinate eligibility.

What a good chunk optimizes for

Property	Why it matters	Failure mode when ignored
Coherence	One topic or rule per vector	Embedding dilution; wrong chunk wins on cosine
Completeness	Conditions stay together (time window + plan type)	LLM fills gaps from parametric knowledge
Retrievability	Chunk text overlaps vocabulary in real questions	Correct doc never surfaces in top-K
Context budget	Small enough that several chunks fit in the LLM window	Truncated context; dropped citations
Traceability	Metadata maps chunk → source page/section	Users cannot verify answers; compliance gaps

Chunking sits between ingest and embed

Extract — PDF, HTML, code, tickets; normalize to text with layout hints where possible.
Split — apply a chunk policy (this guide); emit text + metadata records.
Embed — batch chunks through the embedding model; store vectors with metadata.
Retrieve — query embedding vs chunk vectors; optional rerank; pass top chunks to LLM.

Changing chunk policy after launch requires re-embedding (new vectors)—not an in-place patch. Version your collection (support_kb_v3_sentence_512) and measure recall before cutover.

⚠️ Pitfall

Treating chunk size as “whatever fits the embedding model max length.” Most embedding APIs accept 8K+ tokens, but retrieval quality peaks at hundreds of tokens per chunk—long chunks bury the one sentence that matched the query. Split for search, then optionally expand to parent context at generation time (see parent–child).

🔬 Under the Hood

Bi-encoder embeddings are mean-pooled (or CLS-pooled) over all tokens in the chunk. Attention mixes token representations—outlier sentences pull the vector toward their topic. That is why mixing “Refunds” and “Shipping” in one chunk hurts both retrieval paths.

🎯 Interview Tip

“How would you debug RAG returning wrong answers?” — Start with chunk boundaries: print the retrieved chunk text, check if the answer span is split, then tune size/overlap/structure before blaming the embedding model or prompt.

Fixed-size chunks

The baseline strategy: split every N tokens (or characters) with M tokens overlap between consecutive chunks. Simple, fast, deterministic—often the first implementation and a useful A/B baseline.

Parameters

Parameter	Typical value	Effect
chunk_size	256–512 tokens	Upper bound on chunk length
chunk_overlap	10–20% of size (50–100 tokens)	Duplicates boundary text in adjacent chunks
length_function	Token count (tiktoken), not chars	Aligns with embedder and LLM billing

Pros and cons

Pros	Cons
Predictable chunk count and storage cost	Cuts mid-sentence, mid-table, mid-code block
Easy to parallelize in ingest workers	Overlap inflates index size (same text embedded twice)
Works on any plain text without structure	Ignores document semantics (headings, lists)

Use overlap so facts on boundaries appear whole in at least one chunk. Example: if a rule ends at token 512 and the next chunk starts at 513 without overlap, the embedding at the boundary may miss “annual plans only” attached to the preceding sentence.

import tiktoken
from dataclasses import dataclass

enc = tiktoken.get_encoding("cl100k_base")

@dataclass
class Chunk:
    id: str
    text: str
    token_count: int
    source: str

def token_len(text: str) -> int:
    return len(enc.encode(text))

def chunk_fixed(text: str, source: str, size: int = 512, overlap: int = 64) -> list[Chunk]:
    tokens = enc.encode(text)
    chunks: list[Chunk] = []
    start, i = 0, 0
    while start < len(tokens):
        end = min(start + size, len(tokens))
        piece = enc.decode(tokens[start:end])
        chunks.append(Chunk(
            id=f"{source}-{i}",
            text=piece,
            token_count=end - start,
            source=source,
        ))
        i += 1
        if end == len(tokens):
            break
        start = end - overlap  # slide window back by overlap
    return chunks

public record TextChunk(String id, String text, int tokenCount, String source) {}

public class FixedSizeChunker {
  private final TokenCounter counter; // wraps jtokkit or provider tokenizer
  private final int size;
  private final int overlap;

  public List<TextChunk> chunk(String text, String source) {
    List<Integer> tokens = counter.encode(text);
    List<TextChunk> out = new ArrayList<>();
    int start = 0, i = 0;
    while (start < tokens.size()) {
      int end = Math.min(start + size, tokens.size());
      String piece = counter.decode(tokens.subList(start, end));
      out.add(new TextChunk(source + "-" + i, piece, end - start, source));
      i++;
      if (end == tokens.size()) break;
      start = end - overlap;
    }
    return out;
  }
}

⚖️ Trade-off

More overlap improves recall on boundary facts but increases embed cost and index size linearly. A 512-token chunk with 128-token overlap means ~25% duplicate embeddings—acceptable for high-stakes policies, wasteful for millions of log lines. Prefer sentence-aware splits before cranking overlap past 20%.

💰 Cost

Estimate index cost: total_tokens × (1 + overlap/size) × embed_price_per_million. Halving chunk size doubles chunk count (and often overlap duplication)—run the math before shrinking chunks to “improve recall.”

Sentence splitting

Pack sentences into token-budget chunks instead of blind windows. Respecting NLTK or spaCy sentence boundaries keeps clauses intact and usually reduces overlap needs.

Boundary detectors

Tool	Approach	Best for	Caveats
Regex ((?<=[.!?])\s+)	Split on punctuation + whitespace	Prototypes, English prose	Breaks on “Dr. Smith”, “v2.0”, decimals
NLTK punkt	Unsupervised sentence tokenizer	General English, lightweight	Extra dependency; download punkt data
spaCy (sentencizer or full pipeline)	Linguistic rules + model	Multi-language, legal/clinical text	Heavier; model load time in workers

Algorithm: split document into sentences → greedily append sentences until adding the next exceeds max_tokens → flush chunk → start new buffer (optionally carry last sentence for overlap).

import nltk
import spacy
from chunking.tokens import token_len  # from fixed-size section

nltk.download("punkt", quiet=True)
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

def sentences_nltk(text: str) -> list[str]:
    return nltk.sent_tokenize(text)

def sentences_spacy(text: str) -> list[str]:
    return [s.text.strip() for s in nlp(text).sents if s.text.strip()]

def chunk_by_sentences(text: str, source: str, max_tokens: int = 400) -> list[dict]:
    chunks, buf, i = [], [], 0
    for sent in sentences_spacy(text):
        trial = " ".join(buf + [sent])
        if token_len(trial) > max_tokens and buf:
            piece = " ".join(buf)
            chunks.append({"id": f"{source}-{i}", "text": piece, "source": source})
            i += 1
            buf = [sent]
        else:
            buf.append(sent)
    if buf:
        chunks.append({"id": f"{source}-{i}", "text": " ".join(buf), "source": source})
    return chunks

// OpenNLP or Stanford CoreNLP for sentence detection
public class SentenceChunker {
  private final SentenceDetector sentenceDetector;
  private final TokenCounter counter;
  private final int maxTokens;

  public List<TextChunk> chunk(String text, String source) {
    String[] sentences = sentenceDetector.sentDetect(text);
    List<TextChunk> chunks = new ArrayList<>();
    List<String> buf = new ArrayList<>();
    int i = 0;
    for (String sent : sentences) {
      String trial = String.join(" ", concat(buf, sent));
      if (counter.count(trial) > maxTokens && !buf.isEmpty()) {
        String piece = String.join(" ", buf);
        chunks.add(new TextChunk(source + "-" + i++, piece, counter.count(piece), source));
        buf = new ArrayList<>(List.of(sent));
      } else {
        buf.add(sent);
      }
    }
    if (!buf.isEmpty()) {
      String piece = String.join(" ", buf);
      chunks.add(new TextChunk(source + "-" + i, piece, counter.count(piece), source));
    }
    return chunks;
  }
}

💡 Pro Tip

For bullet lists, treat each bullet as a pseudo-sentence (split on \n[-*]) before packing— otherwise one chunk may contain ten unrelated list items and the embedding becomes useless for single-item queries.

📦 Real World

Notion and Confluence exports often mix headings, bullets, and callouts. Teams run structure-aware pre-pass (headers → sections) then sentence chunking within each section—pure sentence splitting on the whole page still merges unrelated sections.

Recursive splitting

LangChain’s RecursiveCharacterTextSplitter tries coarse separators first (paragraph breaks), then finer ones (newlines, sentences, words) until pieces fit the token budget. This is the default in many RAG tutorials for good reason—it handles mixed prose without custom rules per doc type.

Separator hierarchy (typical order)

\n\n — paragraph boundary
\n — line break (lists, poetry)
. — sentence boundary (fragile but useful fallback)
— word boundary
Character split — last resort (CJK text without spaces)

Recursion: split text on the first separator in the list. If any piece exceeds chunk_size, re-split that piece with the next separator. Merge small pieces up to the size limit with overlap applied on the token sequence.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    length_function=token_len,
    separators=["\n\n", "\n", ". ", " ", ""],
)

docs = splitter.create_documents(
    texts=[open("policy.txt").read()],
    metadatas=[{"source": "policy.txt"}],
)
chunks = [{"id": f"policy-{i}", "text": d.page_content, **d.metadata}
          for i, d in enumerate(docs)]

// Spring AI TokenTextSplitter — recursive token-aware splits
@Bean
TokenTextSplitter recursiveSplitter() {
  return new TokenTextSplitter(
      512,   // defaultChunkSize tokens
      64,    // minChunkSizeChars (overlap-related floor)
      5,     // minChunkLengthToEmbed
      10000, // maxNumChunks
      true   // keepSeparator
  );
}

public List<Document> chunk(String text, Map<String, Object> meta) {
  return recursiveSplitter().split(new Document(text, meta));
}

When recursive beats fixed-size

Document shape	Fixed-size	Recursive
Long paragraphs, few headings	Mid-paragraph cuts common	✓ Breaks on paragraph first
Chat transcripts with short lines	May merge unrelated turns	✓ Respects line breaks
Markdown with code fences	May split inside ``` blocks	⚠ Still weak—use document-aware

⚙️ Config

Store splitter config in versioned YAML: chunk_policy: recursive_v2, chunk_size: 512, separators: [...]. Ingest workers read the same config; changing it triggers a new collection name and backfill job.

⚠️ Pitfall

Using character chunk_size=1000 while the embedder counts tokens—1000 characters ≈ 250–400 tokens depending on language. Always align length_function with your tokenizer (tiktoken, cl100k_base for OpenAI).

Semantic chunking

Instead of fixed windows, embed each sentence (or paragraph), measure cosine distance between consecutive units, and split where similarity drops sharply—a heuristic for topic boundaries.

Algorithm sketch

Split document into sentences (spaCy / NLTK).
Embed each sentence with the same model used for retrieval (critical—same vector space).
Compute cosine distance between sentence i and i+1: 1 − cos(eᵢ, eᵢ₊₁).
Find breakpoints where distance exceeds a threshold (absolute or percentile—e.g. 95th percentile of distances).
Merge sentences between breakpoints; sub-split any group still over token budget with recursive splitter.

Libraries: LangChain SemanticChunker, LlamaIndex SemanticSplitterNodeParser, or a 50-line NumPy script for full control.

import numpy as np
from openai import OpenAI

client = OpenAI()

def embed_batch(texts: list[str]) -> np.ndarray:
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in resp.data])

def cosine_dist(a: np.ndarray, b: np.ndarray) -> float:
    return 1.0 - float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def semantic_breakpoints(sents: list[str], percentile: float = 95) -> list[int]:
    vecs = embed_batch(sents)
    dists = [cosine_dist(vecs[i], vecs[i + 1]) for i in range(len(sents) - 1)]
    thresh = np.percentile(dists, percentile)
    return [i + 1 for i, d in enumerate(dists) if d >= thresh]

def chunk_semantic(text: str, source: str) -> list[dict]:
    sents = sentences_spacy(text)
    breaks = set(semantic_breakpoints(sents))
    groups, buf = [], []
    for i, s in enumerate(sents):
        buf.append(s)
        if (i + 1) in breaks:
            groups.append(" ".join(buf))
            buf = []
    if buf:
        groups.append(" ".join(buf))
    return [{"id": f"{source}-{i}", "text": g, "source": source}
            for i, g in enumerate(groups)]

// Spring AI EmbeddingModel + manual breakpoint detection
public List<TextChunk> semanticChunk(String text, String source, EmbeddingModel embedder) {
  List<String> sents = sentenceChunker.splitSentences(text);
  float[][] vecs = embedder.embed(sents);
  List<Double> dists = new ArrayList<>();
  for (int i = 0; i < vecs.length - 1; i++)
    dists.add(1.0 - cosine(vecs[i], vecs[i + 1]));
  double thresh = percentile(dists, 95);
  List<TextChunk> chunks = new ArrayList<>();
  StringBuilder buf = new StringBuilder();
  int idx = 0;
  for (int i = 0; i < sents.size(); i++) {
    buf.append(sents.get(i)).append(" ");
    if (i < dists.size() && dists.get(i) >= thresh) {
      chunks.add(new TextChunk(source + "-" + idx++, buf.toString().trim(), 0, source));
      buf = new StringBuilder();
    }
  }
  if (!buf.isEmpty())
    chunks.add(new TextChunk(source + "-" + idx, buf.toString().trim(), 0, source));
  return chunks;
}

⚖️ Trade-off

Semantic chunking costs one embedding API call per sentence at index time—expensive on large corpora. Batch aggressively (64–256 sentences per request) and cache sentence embeddings by hash. Use it on high-value docs (policies, contracts) while keeping recursive splitting for logs and tickets.

🔬 Under the Hood

Breakpoint methods assume adjacent-sentence distance spikes at topic shifts. Gradual transitions (two related sections) may not trigger— combine with header-aware splitting for structured docs. Percentile thresholds adapt per document better than fixed 0.5 cutoffs.

Document-aware chunking

Structured documents carry explicit boundaries—Markdown headings, HTML <h1>–<h3>, code functions and classes. Split on structure first, then apply token limits inside each section.

Splitters by format

Format	Boundary signal	Tooling
Markdown	# / ## headers	LangChain MarkdownHeaderTextSplitter
HTML	Heading tags, article, section	BeautifulSoup, Unstructured.io
Code	AST: functions, classes, methods	tree-sitter, LangChain Language splitters
JSON / YAML	Top-level keys or array elements	Custom path-based chunker

from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
import ast

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")],
)
token_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50, length_function=token_len)

def chunk_markdown(text: str, source: str) -> list[dict]:
    sections = md_splitter.split_text(text)
    out = []
    for i, doc in enumerate(sections):
        for j, sub in enumerate(token_splitter.split_documents([doc])):
            out.append({
                "id": f"{source}-{i}-{j}",
                "text": sub.page_content,
                "source": source,
                "section": sub.metadata.get("h2") or sub.metadata.get("h1", ""),
            })
    return out

def chunk_python_source(code: str, source: str) -> list[dict]:
    tree = ast.parse(code)
    chunks = []
    for i, node in enumerate(tree.body):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            segment = ast.get_source_segment(code, node)
            chunks.append({"id": f"{source}-{node.name}", "text": segment, "source": source, "symbol": node.name})
    return chunks

// JavaParser for code; JSoup for HTML
public List<TextChunk> chunkMarkdown(String md, String source) {
  MarkdownParser parser = MarkdownParser.builder().build();
  Node doc = parser.parse(md);
  List<TextChunk> chunks = new ArrayList<>();
  String currentH2 = "";
  for (Node node : doc.getChildren()) {
    if (node instanceof Heading h && h.getLevel() == 2)
      currentH2 = h.getText().toString();
    else if (node instanceof Paragraph p) {
      String text = p.getText().toString();
      for (TextChunk sub : recursiveSplitter.chunk(text, source)) {
        chunks.add(new TextChunk(sub.id(), sub.text(), sub.tokenCount(), source,
            Map.of("section", currentH2)));
      }
    }
  }
  return chunks;
}

Preserve heading text in chunk content or metadata so retrieved snippets include context: “## Refunds — Annual plans: full refund within 30 days…” ranks better than a orphan sentence without section title.

💡 Pro Tip

For HTML docs, strip nav/footer boilerplate before chunking—otherwise every chunk repeats cookie banners and site chrome, polluting embeddings. Use readability extractors (trafilatura, Unstructured) in the ingest pipeline.

🔒 Security

HTML chunking must not execute scripts. Parse with a sandboxed parser; never eval embedded JS. Redact PII blocks (SSN patterns) before chunking so sensitive spans do not land in vector stores without ACL metadata.

Agentic chunking

An LLM decides chunk boundaries—given a passage, it returns split points, titles, or JSON segment objects aligned with semantic topics. Highest quality on messy unstructured text; highest cost and latency at ingest.

When it is worth the cost

Scenario	Agentic?	Rationale
10M support tickets/year	✗	LLM cost dominates; use recursive + metadata
500-page regulatory PDF (one-time)	✓	High stakes; manual review of LLM splits
M&A data room	✓	Messy scans; budget for quality
Internal wiki with clean Markdown	✗	Header splitter is cheaper and deterministic

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class Segment(BaseModel):
    title: str
    start_char: int
    end_char: int

class ChunkPlan(BaseModel):
    segments: list[Segment]

def agentic_chunk(text: str, source: str) -> list[dict]:
    resp = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Split this document into coherent retrieval chunks. Return char offsets.\n\n{text[:12000]}"
        }],
        response_format=ChunkPlan,
    )
    plan = resp.choices[0].message.parsed
    return [{
        "id": f"{source}-{i}",
        "text": text[s.start_char:s.end_char],
        "source": source,
        "section": s.title,
    } for i, s in enumerate(plan.segments)]

// Spring AI ChatClient with JSON schema output
public List<TextChunk> agenticChunk(String text, String source) {
  ChunkPlan plan = chatClient.prompt()
      .user("Split into coherent RAG chunks with titles and char offsets:\n" + text.substring(0, 12000))
      .call()
      .entity(ChunkPlan.class);
  List<TextChunk> out = new ArrayList<>();
  int i = 0;
  for (Segment s : plan.segments()) {
    out.add(new TextChunk(
        source + "-" + i++,
        text.substring(s.startChar(), s.endChar()),
        0, source,
        Map.of("section", s.title())));
  }
  return out;
}

💰 Cost

A 100-page doc ≈ 50K tokens input + structured output per pass—orders of magnitude more than recursive splitting. Batch docs offline; cache chunk plans by content hash; human-review splits for legal/compliance corpora before embed.

☸️ Operations

Run agentic chunking as a Kubernetes Job with rate limits and dead-letter queue—never inline on the upload API path. Failed LLM splits should fall back to recursive chunking so ingest does not stall.

Size guidelines

Optimal chunk size depends on query type and downstream LLM use. Short chunks improve precision for factoid Q&A; longer chunks preserve narrative for summarization.

Use case	Token range	Overlap	Splitter preference
Q&A / FAQ retrieval	256–512	10–15%	Sentence or header-aware
Summarization / narrative	1024–2048	5–10%	Paragraph / semantic groups
Code search	One function or class	Minimal; split mid-function only if huge	AST / tree-sitter
Tables / structured rows	Row group or full table	Header row repeated in each chunk	Custom—never mid-row
Chat / ticket threads	1–3 turns (~200–400)	Last turn of prev chunk	Turn-boundary splitter

Code: chunk by function / class

Language	Unit	Max tokens if oversized	Metadata field
Python	FunctionDef, ClassDef	512 then recursive	symbol, lineno
Java	Method, class	512 then recursive	qualified_name
TypeScript	Function, class, export	512	file_path
Go	Function, struct	512	package

Tune with evals, not intuition: start at 512 for general docs, run recall@5 on 50 labeled questions, then sweep {256, 512, 768, 1024} and plot the curve—most corpora show a plateau then drop when chunks get too large.

🎯 Interview Tip

“What chunk size would you use?” — Answer with the use case: “For support FAQ I’d start at 400 tokens sentence-aware; for code, one function per chunk; I’d validate with recall@K on real queries before locking config.”

🏗️ Infrastructure

Expose chunk size and policy as Terraform variables or Helm values per environment— chunk_size_dev=256 for fast iteration, chunk_size_prod=512 after eval sign-off. Prevent accidental prod re-index with different params.

Metadata

Every chunk carries metadata for filtered retrieval, citation UI, and debugging. Without source, page, and section, users cannot trust answers—and you cannot scope queries by date or tenant.

Essential fields

Field	Example	Used for
source	refund-policy.pdf	Citation label, dedupe on re-ingest
page	12	Deep link to PDF viewer
section	Refunds	Filtered search, breadcrumb in UI
updated_at	2026-03-15T10:00:00Z	Recency boost, stale chunk purge
acl / tenant_id	team_42	Pre-filter before ANN search
chunk_policy	recursive_v3_512	Audit which split produced vector

from datetime import datetime, timezone

def enrich_metadata(chunk: dict, doc_meta: dict) -> dict:
    return {
        "source": chunk["source"],
        "page": doc_meta.get("page"),
        "section": chunk.get("section", ""),
        "updated_at": doc_meta.get("updated_at", datetime.now(timezone.utc).isoformat()),
        "tenant_id": doc_meta["tenant_id"],
        "tokens": token_len(chunk["text"]),
        "chunk_policy": "recursive_v3_512",
    }

# Pinecone upsert
index.upsert(vectors=[{
    "id": c["id"],
    "values": embed(c["text"]),
    "metadata": enrich_metadata(c, doc_meta),
} for c in chunks])

# pgvector: metadata in JSONB column alongside embedding

public Map<String, Object> enrichMetadata(TextChunk chunk, DocMeta doc) {
  return Map.of(
      "source", chunk.source(),
      "page", doc.page(),
      "section", chunk.section(),
      "updated_at", doc.updatedAt().toString(),
      "tenant_id", doc.tenantId(),
      "tokens", chunk.tokenCount(),
      "chunk_policy", "recursive_v3_512"
  );
}

// Query with metadata filter (Pinecone-style)
SearchRequest req = SearchRequest.builder()
    .queryVector(queryEmb)
    .topK(10)
    .filter("tenant_id = 'team_42' AND updated_at > '2026-01-01'")
    .build();

📦 Real World

Perplexity-style citations show title + URL from metadata, not hallucinated links. Stripe’s internal doc search filters by team ACL at the vector store layer—chunks without tenant_id never enter the index.

🔒 Security

Metadata filters are part of authorization—do not rely on post-retrieval filtering alone. A misconfigured ANN query without tenant filter leaks cross-customer chunks into the LLM context.

Parent–child chunks

Index small child chunks for precise retrieval; when a child hits, return the parent (section or page) to the LLM for generation. Best of both worlds: sharp search, rich context.

Pattern

Parent — logical unit (Markdown section, PDF page, full policy clause) stored in object storage or parent table.
Child — 128–256 token sub-splits embedded and indexed in the vector store.
Link — each child metadata includes parent_id.
Query — ANN on children → dedupe parent IDs → fetch parent text for prompt.

Layer	Typical size	Stored in	Role
Child	128–256 tokens	Vector DB (with embedding)	Similarity search
Parent	512–2048 tokens	Postgres / S3 / KV	LLM context assembly

def build_parent_child(sections: list[dict]) -> tuple[list[dict], list[dict]]:
    parents, children = [], []
    for sec in sections:
        pid = f"{sec['source']}#{sec['section']}"
        parents.append({"id": pid, "text": sec["text"], "source": sec["source"]})
        for i, child in enumerate(chunk_fixed(sec["text"], pid, size=256, overlap=32)):
            child["parent_id"] = pid
            children.append(child)
    return parents, children

def retrieve_with_parents(query: str, k: int = 5) -> list[str]:
    child_hits = vector_search(embed(query), k=k * 3)  # over-fetch
    parent_ids = list(dict.fromkeys(h["metadata"]["parent_id"] for h in child_hits))[:k]
    return [parent_store[pid]["text"] for pid in parent_ids]

public List<String> retrieveWithParents(float[] queryEmb, int k) {
  List<ChildHit> childHits = vectorStore.similaritySearch(queryEmb, k * 3);
  LinkedHashSet<String> parentIds = new LinkedHashSet<>();
  for (ChildHit hit : childHits) {
    parentIds.add(hit.metadata().get("parent_id"));
    if (parentIds.size() >= k) break;
  }
  return parentIds.stream().map(parentStore::getText).toList();
}

⚖️ Trade-off

Parent–child adds storage complexity and duplicate text (parent blob + child slices). Worth it when recall improves measurably on golden sets—especially for long policy sections where 512-token chunks dilute embeddings.

💡 Pro Tip

Deduplicate parent IDs before loading context—three child hits from the same section should not triple the parent text in the prompt.

What this looks like in production

Chunk policy is versioned configuration, not a notebook one-liner. Promote changes only after chunking eval on golden documents—labeled questions with expected chunk IDs or answer spans.

Chunking eval workflow

Golden docs — 20–50 representative sources (policies, APIs, runbooks) with known tricky boundaries.
Golden queries — real user questions; label gold_chunk_ids or acceptable parent IDs.
Sweep policies — fixed 512, sentence 400, recursive 512, header-aware; one variable at a time.
Metrics — recall@K, MRR, answer faithfulness when passed to LLM (optional end-to-end).
Promote — bump chunk_policy version; new collection; backfill job; alias swap.

Ingest pipeline checklist

Token-based limits aligned with embedding model tokenizer—not raw character counts
Structure-aware split before overlap tuning (headers, AST, sentences)
Metadata: source, section, updated_at, ACL on every chunk
Content-hash skip for unchanged docs on re-ingest
Chunk policy logged in collection metadata and chunk rows
Fallback chain: agentic → recursive → fixed-size on failure
Metrics: chunks_per_doc histogram, embed cost, ingest latency p95

Regression gate (CI)

Store golden queries in git; CI job chunks golden docs with candidate policy, embeds (or mocks), runs retrieval, fails if recall@5 drops more than 2 points vs baseline. Block merges that change chunk_size without updated eval artifacts.

def recall_at_k(gold_ids: set[str], retrieved_ids: list[str], k: int = 5) -> float:
    top = set(retrieved_ids[:k])
    return len(gold_ids & top) / len(gold_ids)

def eval_policy(chunks: list[dict], golden: list[dict], k: int = 5) -> float:
    scores = []
    for row in golden:
        hits = vector_search(embed(row["query"]), index=chunks, k=k)
        scores.append(recall_at_k(set(row["gold_chunk_ids"]), [h["id"] for h in hits], k))
    return sum(scores) / len(scores)

# Compare policies before prod promotion
for name, chunker in [("fixed512", chunk_fixed), ("sentence400", chunk_by_sentences)]:
    chunks = chunker(corpus_text, "golden.pdf")
    print(name, eval_policy(chunks, golden_queries))

public double recallAtK(Set<String> gold, List<String> retrieved, int k) {
  Set<String> top = retrieved.stream().limit(k).collect(Collectors.toSet());
  long hit = gold.stream().filter(top::contains).count();
  return (double) hit / gold.size();
}

@Test
void chunkPolicyRegression() {
  double baseline = evalPolicy(fixedChunker, goldenQueries);
  double candidate = evalPolicy(sentenceChunker, goldenQueries);
  assertThat(candidate).isGreaterThanOrEqualTo(baseline - 0.02);
}

⚠️ Pitfall

Changing chunk overlap alone changes chunk IDs—your golden gold_chunk_ids labels become invalid. Label by content span or parent section ID, not ephemeral chunk indices, when policies churn frequently.

📦 Real World

Teams store CHUNK_POLICY=v4_sentence_400 in config; ingest workers read it; PRs that modify chunk config must attach recall@5 from the regression set—same discipline as prompt versioning.

⚙️ Config

Example prod config block:

chunk_policy: markdown_header_recursive · child_size: 256 · parent_max: 2048 · collection: support_kb_v7