Long context windows do not mean long reliable context. This guide covers lost-in-the-middle placement, RAG assembly strategies, multi-turn memory, a ~5000-token budget table, LLMLingua compression, and the production audit checklist platform teams use before prompts hit peak traffic.
After reading, you should be able to: explain U-shaped recall and optimize RAG chunk order; implement relevance, recency, and hierarchical assembly; combine sliding windows, summarization, entity memory, and vector memory; enforce a five-bucket token budget with response reserve; choose compression vs retrieve-less with LLMLingua guardrails; and run a context audit before release.
developerplatformarchitectTrack 3LLMLinguatiktokenSpring AI
⌕
Lost in the middle: U-shaped attention and RAG placement
Lost in the middle (Liu et al., 2023) shows LLMs recall facts at the start and end of long contexts better than the middle. For RAG, that means chunk order is not cosmetic—it changes faithfulness.
You retrieved five perfect passages. You concatenated them in relevance order: best chunk first, worst in the middle,
second-best last. The model cited chunk five and ignored chunk two—even though chunk two contained the refund policy clause.
That is not a retrieval failure; it is a context placement failure driven by uneven attention over long prompts.
The U-shaped recall curve
Across many models and tasks, accuracy vs. position in context follows a U-shape:
Primacy — early tokens (system prompt, first RAG chunks) get strong attention.
Recency — tokens just before the user question get strong attention.
Middle slump — facts buried between ~20% and ~80% of context length are recalled least reliably.
Position in context
Typical recall
RAG implication
First 10–15%
High
Put policy blocks, safety rules, and highest-confidence citations here
Middle 50–70%
Lowest
Avoid placing sole copy of critical facts here
Last 10–15% (before query)
High
Repeat key constraints; place user question adjacent to freshest evidence
RAG-specific placement tactics
Sandwich critical chunks — duplicate the top-1 passage at start and end if it is the only source for a compliance answer.
Reorder after rerank — do not stuff by raw similarity score alone; interleave high-signal chunks at boundaries.
Shorter context beats clever order — if you need six chunks and only two matter, drop four instead of hoping middle attention improves.
Separate “rules” from “evidence” — system message holds immutable policy; RAG block holds citations—never mix so rules drift to the middle.
Measuring placement impact
Run an ablation eval: same retrieval set, three orderings (relevance, reverse, random). If faithfulness swings >10 points,
your pipeline is order-sensitive—fix assembly before buying a bigger context window.
Chunk reordering for boundary emphasis
def sandwich_context(chunks: list[str], max_tokens: int, tokenizer) -> str:
"""Place highest-relevance chunk at start AND end; fill middle with remainder."""
if not chunks:
return ""
primary, *rest = chunks
ordered = [primary] + rest + ([primary] if len(chunks) > 1 else [])
out, used = [], 0
for c in ordered:
t = len(tokenizer.encode(c))
if used + t > max_tokens:
break
out.append(c)
used += t
return "\n\n---\n\n".join(out)
public String sandwichContext(List chunks, int maxTokens, Tokenizer tok) {
if (chunks.isEmpty()) return "";
var primary = chunks.get(0);
var ordered = new ArrayList();
ordered.add(primary);
ordered.addAll(chunks.subList(1, chunks.size()));
if (chunks.size() > 1) ordered.add(primary);
var out = new ArrayList();
int used = 0;
for (var c : ordered) {
int t = tok.count(c);
if (used + t > maxTokens) break;
out.add(c);
used += t;
}
return String.join("\n\n---\n\n", out);
}
🔬 Under the Hood
Attention is not uniform over sequence length—training data bias (recent tokens matter for next-token prediction) and positional encoding limits create the U-curve. Long-context models reduce but do not eliminate the effect; eval on your average stuffed prompt length.
⚠️ Pitfall
Assuming a 128K window means 128K usable facts. Effective reliable context for multi-doc QA is often 8K–32K depending on model and task—test with needle-in-haystack probes at production token counts.
🎯 Interview Tip
“Why did RAG miss an obvious chunk?” — Check retrieval first, then ordering (lost in the middle), then compression artifacts. Mention sandwiching and shorter context as fixes.
📦 Real World
Legal copilots reorder clauses so defined terms appear in the first and last RAG slots; middle holds supporting case law only. Citation rate on defined terms jumped 18% vs relevance-only ordering in one internal A/B.
FAQ: How big should the RAG bucket be?
Start at 40–45% of your working input budget. If faithfulness fails on multi-doc questions, improve rerank before expanding RAG at the expense of history.
FAQ: Does prompt caching change budgeting?
Cached system tokens still occupy context window slots—they are cheaper in dollars but not in attention. See model selection & cost for cache economics.
FAQ: Entity memory vs fine-tuning?
Entity memory updates per session instantly; fine-tuning is for stable style and format. Use memory for user-specific facts fine-tuning should never see (PII).
Needle-in-haystack eval protocol
Build a synthetic eval: insert a unique “needle” sentence (e.g. NEEDLE-7f3a: refund cap is $250) into retrieved chunks at positions 1, 3, 5, and 8. Measure answer rate when user asks for the needle fact. Plot accuracy vs position—you should see a U-curve within 30 minutes of scripting.
Model family
Middle penalty (typical)
Mitigation priority
GPT-4 class
15–25% vs edges
Sandwich + shorten
Claude 3.5+
10–18%
XML source tags + recap
Open-weight 32K
25–40%
Aggressive retrieve-less
Interaction with citations
When forcing models to cite chunk IDs, place the citation instruction in the system message and repeat cite [source_id] in the final user wrapper. Models cite edge chunks more often—if the true source sat in the middle, users see wrong citations even when prose is partially correct.
Needle position benchmark
NEEDLE = "NEEDLE-7f3a: refund cap is $250 for enterprise tier."
def needle_eval(chunks: list[str], position: int, llm, ask: str) -> bool:
test = chunks[:]
test.insert(position, NEEDLE)
context = "\n\n".join(test)
ans = llm.answer(context, ask)
return "250" in ans and "7f3a" in ans
positions = [0, len(chunks)//2, len(chunks)-1]
for p in positions:
print(p, needle_eval(base_chunks, p, llm, "What is the enterprise refund cap?"))
String NEEDLE = "NEEDLE-7f3a: refund cap is $250 for enterprise tier.";
boolean needleEval(List chunks, int position, LlmClient llm, String ask) {
var test = new ArrayList<>(chunks);
test.add(position, NEEDLE);
var ans = llm.answer(String.join("\n\n", test), ask);
return ans.contains("250") && ans.contains("7f3a");
}
Assembly strategies: relevance, recency, and hierarchy
How you assemble retrieved chunks, summaries, and conversation history determines answer quality as much as retrieval itself. Production systems combine relevance ordering, recency bias, and hierarchical summary + detail layouts.
ContextBuilder is the underrated stage between rerank and generate. Teams obsess over embed models;
they ship with "\n\n".join(chunks) and wonder why answers feel stale or incomplete.
Assembly is where you encode what the model should read first and what can be compressed.
Relevance ordering
Default: sort by reranker score descending. Variants:
Strategy
Order
When to use
Strict relevance
Best → worst
Short context (<8K), single-fact lookup
Boundary sandwich
Best, middle fillers, best repeat
Compliance, one golden source
Diversity interleave
Round-robin across sources
Multi-doc synthesis, avoid one doc dominating
Metadata-first
Group by doc type (policy → examples → logs)
Support playbooks with layered content
Recency bias
For time-sensitive corpora (incidents, pricing, release notes), boost recent documents in assembly even if embedding score is slightly lower:
Apply score' = score + λ · exp(-age_days / τ) before final ordering.
Pin “current version” doc IDs in the first slot regardless of score.
Stamp each chunk with indexed_at in the prompt so the model can reason about freshness.
Hierarchical summary + details
Borrowed from RAPTOR and GraphRAG: give the model a map before terrain.
Executive summary block — 200–400 tokens synthesizing all retrieved themes (generated offline or at query time).
Section headers — XML or markdown headers per source: <source id="...">.
Leaf details — full chunks under headers; trim leaves if over budget before dropping summary.
Context assembly pipeline
@dataclass
class Chunk:
text: str
score: float
source_id: str
indexed_at: datetime
def assemble(chunks: list[Chunk], budget: int, tok) -> str:
now = datetime.utcnow()
def rank(c: Chunk) -> float:
age = (now - c.indexed_at).days
recency = math.exp(-age / 30.0)
return c.score + 0.15 * recency
ranked = sorted(chunks, key=rank, reverse=True)
summary = summarize_themes([c.text[:500] for c in ranked[:6]])
parts = [f"\n{summary}\n"]
used = len(tok.encode(parts[0]))
for i, c in enumerate(ranked):
block = f'\n{c.text}\n'
t = len(tok.encode(block))
if used + t > budget:
break
parts.append(block)
used += t
return "\n".join(parts)
public String assemble(List chunks, int budget, Tokenizer tok) {
var now = Instant.now();
var ranked = chunks.stream()
.sorted(Comparator.comparingDouble(c -> c.score() + 0.15 * recencyBoost(c.indexedAt(), now)).reversed())
.toList();
var summary = summarizeThemes(ranked.stream().limit(6).map(c -> c.text().substring(0, Math.min(500, c.text().length()))).toList());
var parts = new ArrayList();
parts.add("\n" + summary + "\n");
int used = tok.count(parts.get(0));
int i = 0;
for (var c : ranked) {
var block = "\n" + c.text() + "\n";
int t = tok.count(block);
if (used + t > budget) break;
parts.add(block);
used += t;
}
return String.join("\n", parts);
}
💡 Pro Tip
Log assembled context hash + chunk IDs per request. When users report wrong answers, replay assembly without re-running retrieval—isolates builder bugs in minutes.
⚖️ Trade-off
Query-time executive summaries cost an extra LLM call (~300–800ms). Precompute per-collection summaries at ingest for stable corpora; generate live only for dynamic multi-tool retrieval.
💰 Cost
Hierarchical assembly with live summarization adds 500–1500 input tokens per query. Cache summaries keyed by hash(chunk_ids) for repeat questions in the same session.
Anti-patterns in context assembly
Alphabetical source order — arbitrary; hides relevance signal from the model.
Chronological only — old but irrelevant docs crowd out fresh policy unless recency-weighted.
Single blob without headers — model cannot attribute facts; citation rate collapses.
Duplicate near-identical chunks — wastes tokens; use dedupe by minhash before assembly.
XML vs markdown wrappers
Format
Pros
Cons
XML tags
Claude-trained; strict boundaries
Verbose token overhead
Markdown headers
Compact; readable in logs
Weaker boundary on messy HTML RAG
JSON array
Machine-parseable
Higher escape overhead for prose
🎯 Interview Tip
“How do you assemble RAG context?” — Relevance + recency scoring, hierarchical summary, XML boundaries, dedupe, log assembly hash, lost-in-the-middle ordering.
Conversation management: windows, summaries, and memory
Multi-turn products must decide what to keep, compress, or retrieve from memory. Combine sliding windows, rolling summarization, structured entity memory, and vector memory for long sessions without blowing the token budget.
Every turn resends full chat history unless you trim. A 40-turn support thread can exceed 30K tokens before RAG arrives—
starving retrieval budget and triggering silent middle-attention loss. Conversation management is budget allocation across time.
Sliding window
Keep system prompt + last N turns or last T tokens. Simple, predictable, loses early facts.
Variant
Keeps
Drops
Risk
Turn window
Last 6 user/assistant pairs
Older dialogue
User references message 1
Token window
Newest messages until budget
Oldest variable content
Splits mid-conversation tool results
Role-priority
System + tools + recent user
Verbose assistant monologues
May drop useful assistant caveats
Summarization
When the window slides, compress dropped turns into a rolling conversation_summary system message.
Refresh summary every K turns or when history exceeds 50% of conversation budget.
Entity memory
Structured store of facts the user stated or the system extracted: account ID, product SKU, open ticket, preferences.
Inject as a compact JSON or bullet block—far denser than raw chat logs.
Extract with tool call or structured output after each user turn.
Merge with CRDT-style last-write-wins per key.
Cap entity block (e.g. 800 tokens); evict low-salience keys by LRU.
Vector memory
Embed past turns or summaries; retrieve top-k memories relevant to current query. Scales to months of history without linear token growth.
On each turn, embed user + assistant summary snippet.
Store in per-user namespace with TTL and consent flags.
At query time, retrieve 3–5 memory hits and prepend under <prior_context>.
public class ConversationMemory {
private final List messages = new ArrayList<>();
private String summary = "";
private final Map entities = new LinkedHashMap<>();
private int turnCount;
private final int maxHistoryTokens;
private final int summaryEvery;
public List buildContext(String query, MemoryIndex index) {
var memHits = index.search(query, 4);
var blocks = new ArrayList();
blocks.add(Message.system("Entities:\n" + toJson(entities)));
blocks.add(Message.system("Prior summary:\n" + summary));
blocks.add(Message.system(formatMemories(memHits)));
blocks.addAll(messages);
return blocks;
}
}
🔒 Security
Vector memory must be tenant-scoped. Never retrieve another user's embeddings—even if queries are similar. Encrypt memory at rest; honor deletion within 30 days for GDPR erasure requests.
📦 Real World
B2B copilots often use entity memory for account context (fixed 500 tokens) + vector memory for past meetings + sliding window for the current session—keeps p95 prompt size stable across 200+ turn threads.
🎯 Interview Tip
“How do you handle long chat history?” — Sliding window baseline, summarization for dropped content, structured entity store for IDs/facts, vector memory for scale. Mention token budget table.
Tool result context
Agentic flows stuff tool JSON into history—often the fastest budget overrun. Rules:
Truncate tool payloads to schema-relevant fields before append.
Summarize large tool results in a side channel message, not raw dump.
Expire tool results from history after the turn unless user references them.
Memory consent and retention
Memory type
Default TTL
User control
Sliding window
Session
Clear chat
Rolling summary
30 days
Export + delete
Entity store
Account lifetime
Per-field opt-out
Vector memory
90 days
Forget me deletes namespace
Sliding token window trimmer
def trim_to_budget(messages: list[dict], budget: int, tok) -> list[dict]:
system = [m for m in messages if m["role"] == "system"]
rest = [m for m in messages if m["role"] != "system"]
def count(msgs):
return sum(len(tok.encode(m["content"])) for m in msgs) + 4 * len(msgs)
kept = []
for msg in reversed(rest):
trial = system + list(reversed(kept)) + [msg]
if count(trial) > budget:
break
kept.insert(0, msg)
return system + kept
public List trimToBudget(List messages, int budget, Tokenizer tok) {
var system = messages.stream().filter(m -> "system".equals(m.role())).toList();
var rest = messages.stream().filter(m -> !"system".equals(m.role())).toList();
var kept = new ArrayList();
for (int i = rest.size() - 1; i >= 0; i--) {
var trial = new ArrayList();
trial.addAll(system);
trial.addAll(kept);
trial.add(rest.get(i));
if (tok.count(trial) > budget) break;
kept.add(0, rest.get(i));
}
var out = new ArrayList<>(system);
out.addAll(kept);
return out;
}
Token budget: allocating ~5000 tokens
Treat context as a fixed pie. For a typical 8K–16K effective window, this guide uses a ~5000 token working budget for input assembly—reserving room for model overhead, tools, and completion.
Models advertise 128K–200K context. Production prompts rarely should.use it all: cost scales linearly with input tokens,
latency rises, and lost-in-the-middle effects worsen. Platform teams publish an internal budget table every product must fit.
Reference budget (~5000 input tokens)
Bucket
Tokens
% of budget
Contents
System + policy
600–900
12–18%
Role, safety rules, output format, tool policy
RAG / knowledge
1800–2400
36–48%
Retrieved chunks, citations, collection preamble
Conversation history
800–1200
16–24%
Recent turns + rolling summary + entity block
Current user message
200–600
4–12%
Question, attachments excerpt, structured form fields
Response buffer
800–1200
16–24%
Reserved max_tokens—not sent as input but counts toward window
Overhead / tools
200–400
4–8%
JSON schemas, few-shot examples, markup wrappers
Adjusting by product shape
Product
Shift tokens toward
Shrink
FAQ bot
RAG (3000+)
History (400)
Long support thread
History + summary (1500)
RAG (1500) with better retrieval
Code assistant
User paste + tool results
RAG unless docs mode on
Agent with tools
Tool schemas + scratchpad
RAG (retrieve on demand)
Enforcement in code
Count tokens before the API call. Fail closed: trim RAG first (lowest salience chunks), then history, never system policy without explicit version bump.
Token budget enforcer
@dataclass
class TokenBudget:
total: int = 5000
system: int = 800
rag: int = 2200
history: int = 1000
user: int = 400
response_reserve: int = 1000
def validate(self, counts: dict[str, int], model_limit: int) -> None:
input_sum = sum(counts.values())
if input_sum + self.response_reserve > model_limit:
raise BudgetError(f"input {input_sum} + reserve {self.response_reserve} > {model_limit}")
for key, cap in [("system", self.system), ("rag", self.rag), ("history", self.history), ("user", self.user)]:
if counts.get(key, 0) > cap:
raise BudgetError(f"{key} {counts[key]} exceeds cap {cap}")
def fit_rag(chunks: list[str], cap: int, tok) -> list[str]:
kept, used = [], 0
for c in sorted(chunks, key=lambda x: score(x), reverse=True):
t = len(tok.encode(c))
if used + t > cap:
continue
kept.append(c)
used += t
return kept
public record TokenBudget(int total, int system, int rag, int history, int user, int responseReserve) {
public void validate(Map counts, int modelLimit) {
int inputSum = counts.values().stream().mapToInt(Integer::intValue).sum();
if (inputSum + responseReserve > modelLimit)
throw new BudgetException("over model limit");
if (counts.getOrDefault("rag", 0) > rag)
throw new BudgetException("rag over cap");
}
}
💰 Cost
At $2.50/1M input tokens, shaving 2000 tokens per request × 1M requests/month saves ~$5000/month. Budget tables turn cost conversations into engineering tickets with clear levers.
💡 Pro Tip
Expose bucket fill levels in debug UI for internal users—when RAG is consistently at 98% cap, retrieval is returning too much noise; tighten k or add compression.
⚠️ Pitfall
Forgetting tool definitions in the budget. A 12-tool agent can burn 3000+ tokens before the user speaks—account for tools in system/overhead or use dynamic tool loading.
Worked example: support bot turn
Model limit 8192; target input 5000; reserve 1000 for completion.
Bucket
Actual
Cap
Action
System
780
800
OK
RAG
2580
2200
Drop lowest rerank 2 chunks
History
920
1000
OK
User
310
400
OK
Dynamic reallocation
When user message is tiny (50 tokens), borrow 200 tokens into RAG for that request only—log budget_borrowed=true for analytics. Never borrow from response reserve.
📦 Real World
Gateways enforce per-tenant budget profiles—free tier gets 2500 input cap, enterprise 8000—same codebase, different TokenBudget config in JWT claims.
Context compression: LLMLingua and when to compress
LLMLingua and related techniques remove low-information tokens from prompts while preserving task performance. Know when to compress vs when to retrieve less—compression is not a substitute for bad retrieval.
You have 12 reranked chunks totaling 9000 tokens; budget allows 2200. Options: (a) drop chunks, (b) compress each chunk,
(c) extract sentences. Compression wins when chunks are individually valuable but verbose; dropping wins when whole documents are redundant.
LLMLingua family
LLMLingua — small LM scores token importance; drops tokens with controllable compression ratio (2×–10×).
LongLLMLingua — tiling for long inputs; preserves question-relevant spans better in RAG settings.
LLMLingua-2 — faster token classification; suited for online serving with GPU budget.
Approach
Latency
Faithfulness risk
Best when
Retrieve fewer chunks
Lowest
Miss relevant doc
Reranker already sharp; k was too high
Sentence extraction
Low
Breaks cross-sentence refs
FAQ paragraphs, policy sections
LLMLingua compress
Medium (GPU)
Drops rare tokens (IDs, codes)
Verbose logs, narrative docs
Abstractive summarize
High (LLM call)
Hallucinate summary facts
Many chunks same theme
Decision flow
If MRR@k is low → fix retrieval/chunking, do not compress garbage.
If chunks redundant → reduce k or cluster then summarize once.
If chunks unique and verbose → LLMLingua per chunk at 4× ratio; validate IDs survive.
If compliance requires verbatim quotes → no lossy compression; drop or split across turns.
Protecting structured tokens
Post-compression regex audit for SKU patterns, UUIDs, dollar amounts, and legal section numbers. If density drops >30%, fall back to sentence extraction for that chunk.
LLMLingua-style compression hook
def compress_context(text: str, target_ratio: float, compressor, tok) -> str:
"""target_ratio 0.25 = keep ~25% of tokens."""
original = len(tok.encode(text))
target = max(64, int(original * target_ratio))
compressed = compressor.compress_prompt(text, target_token=target)
# Guardrails: preserve alphanumeric codes
for token in extract_codes(text):
if token not in compressed:
compressed += f"\n[PRESERVED_CODE: {token}]"
return compressed
def should_compress(chunks: list[str], budget: int, tok) -> bool:
total = sum(len(tok.encode(c)) for c in chunks)
if total <= budget:
return False
# Compress when avg chunk > 400 tokens and rerank top-5 already included
return len(chunks) <= 8 and (total / len(chunks)) > 400
public String compressContext(String text, double targetRatio, Compressor compressor, Tokenizer tok) {
int original = tok.count(text);
int target = Math.max(64, (int) (original * targetRatio));
String compressed = compressor.compress(text, target);
for (String code : extractCodes(text)) {
if (!compressed.contains(code))
compressed += "\n[PRESERVED_CODE: " + code + "]";
}
return compressed;
}
🔬 Under the Hood
LLMLingua uses a budget controller (binary search on compression rate) guided by perplexity from a small LM—tokens that minimally increase perplexity when removed get dropped first.
⚖️ Trade-off
Compression adds serving complexity (GPU for local model or extra vendor). Retrieve-less is simpler—try dropping k by 2 and measure faithfulness before enabling LLMLingua in prod.
📦 Real World
Observability vendors compress log snippets 4× before RAG on incident queries; they keep retrieve-less for metrics docs under 300 tokens/chunk—hybrid pipeline per doc type.
LLMLingua deployment modes
Sidecar GPU — compress microservice between rerank and generate; 20–40ms at 4× on A10.
Batch pre-compress at ingest — store compressed + raw; serve compressed for browse, raw for legal export.
Vendor API — emerging; evaluate faithfulness on your domain before commit.
Compression vs retrieve-less matrix
Signal
Prefer retrieve-less
Prefer compress
High chunk overlap
✓
Verbose single doc
✓
Low MRR@k
Fix retrieval first
Never
IDs in every answer
✓
Risky
⚠️ Pitfall
Compressing already-short chunks (<200 tokens) often deletes negation words (“not eligible”)—skip compression below minimum chunk size.
Production: context audit checklist
Before shipping—or after a faithfulness regression—run this context audit. Then continue to advanced prompting for injection defenses and CI prompt testing.
Context audit checklist
Published token budget table per product surface (system / RAG / history / user / reserve).
Pre-flight token count with same tokenizer provider uses (tiktoken, Anthropic counter, etc.).
RAG assembly logged: chunk IDs, order, total tokens, compression ratio if any.
Lost-in-the-middle ablation: relevance vs sandwich order on golden set.
Conversation memory: entity store scoped per user; vector memory TTL documented.
Compression guardrails for IDs, codes, and regulated verbatim text.
Alert when p95 input tokens > 80% of model limit—headroom for tools and spikes.
Debug trace shows which bucket was trimmed when over budget (RAG vs history).
Signal
Threshold
Action
Empty RAG after trim
>1% of requests
Lower k at retrieve or raise RAG bucket
Summary staleness
>20 turns without refresh
Force rolling summary job
Memory retrieval miss
User repeats fact 3×
Entity extraction failure—fix parser
Compression code loss
Any regulated SKU miss
Disable lossy compression for that doc class
📦 Production checklist
Context engineering is production-ready when every request has a traceable budget breakdown and you can answer
“why was this chunk excluded?” without reproducing the user’s full thread.
🎯 Interview Tip
“Design context for a support bot with RAG.” — Budget table, sandwich ordering, entity memory for account ID, compress logs not policy, audit checklist.
🔒 Security
Audit that trimmed history never drops security instructions. System policy version must be pinned—not subject to LRU eviction in conversation manager.
Research notes on context length
Liu et al. (2023) tested multi-document QA with gold passages placed at varying positions—performance dropped sharply when gold was in the middle third. Replicate with your tokenizer and model because absolute numbers shift by family.
Anthropic and OpenAI long-context models improved middle recall but RAG systems still benefit from explicit ordering policy—do not rely on marketing context limits as engineering limits.
Operational metrics
Track context_tokens_by_bucket as Prometheus histograms. Slice by product and model—budget violations often cluster on one feature flag cohort.
Set SLO: 99% of requests fit budget without dropping system policy. Alert on compression ratio p95 > 6× (signals broken retrieval).