Context window management

Lost in the middle: U-shaped attention and RAG placement

Lost in the middle (Liu et al., 2023) shows LLMs recall facts at the start and end of long contexts better than the middle. For RAG, that means chunk order is not cosmetic—it changes faithfulness.

You retrieved five perfect passages. You concatenated them in relevance order: best chunk first, worst in the middle, second-best last. The model cited chunk five and ignored chunk two—even though chunk two contained the refund policy clause. That is not a retrieval failure; it is a context placement failure driven by uneven attention over long prompts.

The U-shaped recall curve

Across many models and tasks, accuracy vs. position in context follows a U-shape:

Primacy — early tokens (system prompt, first RAG chunks) get strong attention.
Recency — tokens just before the user question get strong attention.
Middle slump — facts buried between ~20% and ~80% of context length are recalled least reliably.

Position in context	Typical recall	RAG implication
First 10–15%	High	Put policy blocks, safety rules, and highest-confidence citations here
Middle 50–70%	Lowest	Avoid placing sole copy of critical facts here
Last 10–15% (before query)	High	Repeat key constraints; place user question adjacent to freshest evidence

RAG-specific placement tactics

Sandwich critical chunks — duplicate the top-1 passage at start and end if it is the only source for a compliance answer.
Reorder after rerank — do not stuff by raw similarity score alone; interleave high-signal chunks at boundaries.
Shorter context beats clever order — if you need six chunks and only two matter, drop four instead of hoping middle attention improves.
Separate “rules” from “evidence” — system message holds immutable policy; RAG block holds citations—never mix so rules drift to the middle.

Measuring placement impact

Run an ablation eval: same retrieval set, three orderings (relevance, reverse, random). If faithfulness swings >10 points, your pipeline is order-sensitive—fix assembly before buying a bigger context window.

def sandwich_context(chunks: list[str], max_tokens: int, tokenizer) -> str:
    """Place highest-relevance chunk at start AND end; fill middle with remainder."""
    if not chunks:
        return ""
    primary, *rest = chunks
    ordered = [primary] + rest + ([primary] if len(chunks) > 1 else [])
    out, used = [], 0
    for c in ordered:
        t = len(tokenizer.encode(c))
        if used + t > max_tokens:
            break
        out.append(c)
        used += t
    return "\n\n---\n\n".join(out)

public String sandwichContext(List chunks, int maxTokens, Tokenizer tok) {
  if (chunks.isEmpty()) return "";
  var primary = chunks.get(0);
  var ordered = new ArrayList();
  ordered.add(primary);
  ordered.addAll(chunks.subList(1, chunks.size()));
  if (chunks.size() > 1) ordered.add(primary);
  var out = new ArrayList();
  int used = 0;
  for (var c : ordered) {
    int t = tok.count(c);
    if (used + t > maxTokens) break;
    out.add(c);
    used += t;
  }
  return String.join("\n\n---\n\n", out);
}

🔬 Under the Hood

Attention is not uniform over sequence length—training data bias (recent tokens matter for next-token prediction) and positional encoding limits create the U-curve. Long-context models reduce but do not eliminate the effect; eval on your average stuffed prompt length.

⚠️ Pitfall

Assuming a 128K window means 128K usable facts. Effective reliable context for multi-doc QA is often 8K–32K depending on model and task—test with needle-in-haystack probes at production token counts.

🎯 Interview Tip

“Why did RAG miss an obvious chunk?” — Check retrieval first, then ordering (lost in the middle), then compression artifacts. Mention sandwiching and shorter context as fixes.

📦 Real World

Legal copilots reorder clauses so defined terms appear in the first and last RAG slots; middle holds supporting case law only. Citation rate on defined terms jumped 18% vs relevance-only ordering in one internal A/B.

FAQ: How big should the RAG bucket be?

Start at 40–45% of your working input budget. If faithfulness fails on multi-doc questions, improve rerank before expanding RAG at the expense of history.

FAQ: Does prompt caching change budgeting?

Cached system tokens still occupy context window slots—they are cheaper in dollars but not in attention. See model selection & cost for cache economics.

FAQ: Entity memory vs fine-tuning?

Entity memory updates per session instantly; fine-tuning is for stable style and format. Use memory for user-specific facts fine-tuning should never see (PII).

Needle-in-haystack eval protocol

Build a synthetic eval: insert a unique “needle” sentence (e.g. NEEDLE-7f3a: refund cap is $250) into retrieved chunks at positions 1, 3, 5, and 8. Measure answer rate when user asks for the needle fact. Plot accuracy vs position—you should see a U-curve within 30 minutes of scripting.

Model family	Middle penalty (typical)	Mitigation priority
GPT-4 class	15–25% vs edges	Sandwich + shorten
Claude 3.5+	10–18%	XML source tags + recap
Open-weight 32K	25–40%	Aggressive retrieve-less

Interaction with citations

When forcing models to cite chunk IDs, place the citation instruction in the system message and repeat cite [source_id] in the final user wrapper. Models cite edge chunks more often—if the true source sat in the middle, users see wrong citations even when prose is partially correct.

NEEDLE = "NEEDLE-7f3a: refund cap is $250 for enterprise tier."

def needle_eval(chunks: list[str], position: int, llm, ask: str) -> bool:
    test = chunks[:]
    test.insert(position, NEEDLE)
    context = "\n\n".join(test)
    ans = llm.answer(context, ask)
    return "250" in ans and "7f3a" in ans

positions = [0, len(chunks)//2, len(chunks)-1]
for p in positions:
    print(p, needle_eval(base_chunks, p, llm, "What is the enterprise refund cap?"))

String NEEDLE = "NEEDLE-7f3a: refund cap is $250 for enterprise tier.";
boolean needleEval(List chunks, int position, LlmClient llm, String ask) {
  var test = new ArrayList<>(chunks);
  test.add(position, NEEDLE);
  var ans = llm.answer(String.join("\n\n", test), ask);
  return ans.contains("250") && ans.contains("7f3a");
}

Assembly strategies: relevance, recency, and hierarchy

How you assemble retrieved chunks, summaries, and conversation history determines answer quality as much as retrieval itself. Production systems combine relevance ordering, recency bias, and hierarchical summary + detail layouts.

ContextBuilder is the underrated stage between rerank and generate. Teams obsess over embed models; they ship with "\n\n".join(chunks) and wonder why answers feel stale or incomplete. Assembly is where you encode what the model should read first and what can be compressed.

Relevance ordering

Default: sort by reranker score descending. Variants:

Strategy	Order	When to use
Strict relevance	Best → worst	Short context (<8K), single-fact lookup
Boundary sandwich	Best, middle fillers, best repeat	Compliance, one golden source
Diversity interleave	Round-robin across sources	Multi-doc synthesis, avoid one doc dominating
Metadata-first	Group by doc type (policy → examples → logs)	Support playbooks with layered content

Recency bias

For time-sensitive corpora (incidents, pricing, release notes), boost recent documents in assembly even if embedding score is slightly lower:

Apply score' = score + λ · exp(-age_days / τ) before final ordering.
Pin “current version” doc IDs in the first slot regardless of score.
Stamp each chunk with indexed_at in the prompt so the model can reason about freshness.

Hierarchical summary + details

Borrowed from RAPTOR and GraphRAG: give the model a map before terrain.

Executive summary block — 200–400 tokens synthesizing all retrieved themes (generated offline or at query time).
Section headers — XML or markdown headers per source: <source id="...">.
Leaf details — full chunks under headers; trim leaves if over budget before dropping summary.

@dataclass
class Chunk:
    text: str
    score: float
    source_id: str
    indexed_at: datetime

def assemble(chunks: list[Chunk], budget: int, tok) -> str:
    now = datetime.utcnow()
    def rank(c: Chunk) -> float:
        age = (now - c.indexed_at).days
        recency = math.exp(-age / 30.0)
        return c.score + 0.15 * recency
    ranked = sorted(chunks, key=rank, reverse=True)
    summary = summarize_themes([c.text[:500] for c in ranked[:6]])
    parts = [f"\n{summary}\n"]
    used = len(tok.encode(parts[0]))
    for i, c in enumerate(ranked):
        block = f'\n{c.text}\n'
        t = len(tok.encode(block))
        if used + t > budget:
            break
        parts.append(block)
        used += t
    return "\n".join(parts)

public String assemble(List chunks, int budget, Tokenizer tok) {
  var now = Instant.now();
  var ranked = chunks.stream()
      .sorted(Comparator.comparingDouble(c -> c.score() + 0.15 * recencyBoost(c.indexedAt(), now)).reversed())
      .toList();
  var summary = summarizeThemes(ranked.stream().limit(6).map(c -> c.text().substring(0, Math.min(500, c.text().length()))).toList());
  var parts = new ArrayList();
  parts.add("\n" + summary + "\n");
  int used = tok.count(parts.get(0));
  int i = 0;
  for (var c : ranked) {
    var block = "\n" + c.text() + "\n";
    int t = tok.count(block);
    if (used + t > budget) break;
    parts.add(block);
    used += t;
  }
  return String.join("\n", parts);
}

💡 Pro Tip

Log assembled context hash + chunk IDs per request. When users report wrong answers, replay assembly without re-running retrieval—isolates builder bugs in minutes.

⚖️ Trade-off

Query-time executive summaries cost an extra LLM call (~300–800ms). Precompute per-collection summaries at ingest for stable corpora; generate live only for dynamic multi-tool retrieval.

💰 Cost

Hierarchical assembly with live summarization adds 500–1500 input tokens per query. Cache summaries keyed by hash(chunk_ids) for repeat questions in the same session.

Anti-patterns in context assembly

Alphabetical source order — arbitrary; hides relevance signal from the model.
Chronological only — old but irrelevant docs crowd out fresh policy unless recency-weighted.
Single blob without headers — model cannot attribute facts; citation rate collapses.
Duplicate near-identical chunks — wastes tokens; use dedupe by minhash before assembly.

XML vs markdown wrappers

Format	Pros	Cons
XML tags	Claude-trained; strict boundaries	Verbose token overhead
Markdown headers	Compact; readable in logs	Weaker boundary on messy HTML RAG
JSON array	Machine-parseable	Higher escape overhead for prose

🎯 Interview Tip

“How do you assemble RAG context?” — Relevance + recency scoring, hierarchical summary, XML boundaries, dedupe, log assembly hash, lost-in-the-middle ordering.

Conversation management: windows, summaries, and memory

Multi-turn products must decide what to keep, compress, or retrieve from memory. Combine sliding windows, rolling summarization, structured entity memory, and vector memory for long sessions without blowing the token budget.

Every turn resends full chat history unless you trim. A 40-turn support thread can exceed 30K tokens before RAG arrives— starving retrieval budget and triggering silent middle-attention loss. Conversation management is budget allocation across time.

Sliding window

Keep system prompt + last N turns or last T tokens. Simple, predictable, loses early facts.

Variant	Keeps	Drops	Risk
Turn window	Last 6 user/assistant pairs	Older dialogue	User references message 1
Token window	Newest messages until budget	Oldest variable content	Splits mid-conversation tool results
Role-priority	System + tools + recent user	Verbose assistant monologues	May drop useful assistant caveats

Summarization

When the window slides, compress dropped turns into a rolling conversation_summary system message. Refresh summary every K turns or when history exceeds 50% of conversation budget.

Entity memory

Structured store of facts the user stated or the system extracted: account ID, product SKU, open ticket, preferences. Inject as a compact JSON or bullet block—far denser than raw chat logs.

Extract with tool call or structured output after each user turn.
Merge with CRDT-style last-write-wins per key.
Cap entity block (e.g. 800 tokens); evict low-salience keys by LRU.

Vector memory

Embed past turns or summaries; retrieve top-k memories relevant to current query. Scales to months of history without linear token growth.

On each turn, embed user + assistant summary snippet.
Store in per-user namespace with TTL and consent flags.
At query time, retrieve 3–5 memory hits and prepend under <prior_context>.

class ConversationMemory:
    def __init__(self, max_history_tokens: int, summary_every: int = 8):
        self.messages: list[dict] = []
        self.summary: str = ""
        self.entities: dict[str, str] = {}
        self.turn_count = 0
        self.max_history_tokens = max_history_tokens
        self.summary_every = summary_every

    def add_turn(self, user: str, assistant: str, llm, tok, memory_index):
        self.turn_count += 1
        self.messages.extend([
            {"role": "user", "content": user},
            {"role": "assistant", "content": assistant},
        ])
        self.entities.update(extract_entities(user, llm))
        memory_index.upsert(self.turn_count, summarize_pair(user, assistant))
        if self.turn_count % self.summary_every == 0:
            dropped = trim_to_budget(self.messages, self.max_history_tokens // 2, tok)
            self.summary = rolling_summary(llm, self.summary, dropped)
            self.messages = trim_to_budget(self.messages, self.max_history_tokens, tok)

    def build_context(self, query: str, memory_index) -> list[dict]:
        mem_hits = memory_index.search(query, k=4)
        entity_block = json.dumps(self.entities, indent=0)
        return [
            {"role": "system", "content": f"Entities:\n{entity_block}"},
            {"role": "system", "content": f"Prior summary:\n{self.summary}"},
            {"role": "system", "content": format_memories(mem_hits)},
            *self.messages,
        ]

public class ConversationMemory {
  private final List messages = new ArrayList<>();
  private String summary = "";
  private final Map entities = new LinkedHashMap<>();
  private int turnCount;
  private final int maxHistoryTokens;
  private final int summaryEvery;

  public List buildContext(String query, MemoryIndex index) {
    var memHits = index.search(query, 4);
    var blocks = new ArrayList();
    blocks.add(Message.system("Entities:\n" + toJson(entities)));
    blocks.add(Message.system("Prior summary:\n" + summary));
    blocks.add(Message.system(formatMemories(memHits)));
    blocks.addAll(messages);
    return blocks;
  }
}

🔒 Security

Vector memory must be tenant-scoped. Never retrieve another user's embeddings—even if queries are similar. Encrypt memory at rest; honor deletion within 30 days for GDPR erasure requests.

📦 Real World

B2B copilots often use entity memory for account context (fixed 500 tokens) + vector memory for past meetings + sliding window for the current session—keeps p95 prompt size stable across 200+ turn threads.

🎯 Interview Tip

“How do you handle long chat history?” — Sliding window baseline, summarization for dropped content, structured entity store for IDs/facts, vector memory for scale. Mention token budget table.

Tool result context

Agentic flows stuff tool JSON into history—often the fastest budget overrun. Rules:

Truncate tool payloads to schema-relevant fields before append.
Summarize large tool results in a side channel message, not raw dump.
Expire tool results from history after the turn unless user references them.

Memory consent and retention

Memory type	Default TTL	User control
Sliding window	Session	Clear chat
Rolling summary	30 days	Export + delete
Entity store	Account lifetime	Per-field opt-out
Vector memory	90 days	Forget me deletes namespace

def trim_to_budget(messages: list[dict], budget: int, tok) -> list[dict]:
    system = [m for m in messages if m["role"] == "system"]
    rest = [m for m in messages if m["role"] != "system"]
    def count(msgs):
        return sum(len(tok.encode(m["content"])) for m in msgs) + 4 * len(msgs)
    kept = []
    for msg in reversed(rest):
        trial = system + list(reversed(kept)) + [msg]
        if count(trial) > budget:
            break
        kept.insert(0, msg)
    return system + kept

public List trimToBudget(List messages, int budget, Tokenizer tok) {
  var system = messages.stream().filter(m -> "system".equals(m.role())).toList();
  var rest = messages.stream().filter(m -> !"system".equals(m.role())).toList();
  var kept = new ArrayList();
  for (int i = rest.size() - 1; i >= 0; i--) {
    var trial = new ArrayList();
    trial.addAll(system);
    trial.addAll(kept);
    trial.add(rest.get(i));
    if (tok.count(trial) > budget) break;
    kept.add(0, rest.get(i));
  }
  var out = new ArrayList<>(system);
  out.addAll(kept);
  return out;
}

Token budget: allocating ~5000 tokens

Treat context as a fixed pie. For a typical 8K–16K effective window, this guide uses a ~5000 token working budget for input assembly—reserving room for model overhead, tools, and completion.

Models advertise 128K–200K context. Production prompts rarely should.use it all: cost scales linearly with input tokens, latency rises, and lost-in-the-middle effects worsen. Platform teams publish an internal budget table every product must fit.

Reference budget (~5000 input tokens)

Bucket	Tokens	% of budget	Contents
System + policy	600–900	12–18%	Role, safety rules, output format, tool policy
RAG / knowledge	1800–2400	36–48%	Retrieved chunks, citations, collection preamble
Conversation history	800–1200	16–24%	Recent turns + rolling summary + entity block
Current user message	200–600	4–12%	Question, attachments excerpt, structured form fields
Response buffer	800–1200	16–24%	Reserved max_tokens—not sent as input but counts toward window
Overhead / tools	200–400	4–8%	JSON schemas, few-shot examples, markup wrappers

Adjusting by product shape

Product	Shift tokens toward	Shrink
FAQ bot	RAG (3000+)	History (400)
Long support thread	History + summary (1500)	RAG (1500) with better retrieval
Code assistant	User paste + tool results	RAG unless docs mode on
Agent with tools	Tool schemas + scratchpad	RAG (retrieve on demand)

Enforcement in code

Count tokens before the API call. Fail closed: trim RAG first (lowest salience chunks), then history, never system policy without explicit version bump.

@dataclass
class TokenBudget:
    total: int = 5000
    system: int = 800
    rag: int = 2200
    history: int = 1000
    user: int = 400
    response_reserve: int = 1000

    def validate(self, counts: dict[str, int], model_limit: int) -> None:
        input_sum = sum(counts.values())
        if input_sum + self.response_reserve > model_limit:
            raise BudgetError(f"input {input_sum} + reserve {self.response_reserve} > {model_limit}")
        for key, cap in [("system", self.system), ("rag", self.rag), ("history", self.history), ("user", self.user)]:
            if counts.get(key, 0) > cap:
                raise BudgetError(f"{key} {counts[key]} exceeds cap {cap}")

def fit_rag(chunks: list[str], cap: int, tok) -> list[str]:
    kept, used = [], 0
    for c in sorted(chunks, key=lambda x: score(x), reverse=True):
        t = len(tok.encode(c))
        if used + t > cap:
            continue
        kept.append(c)
        used += t
    return kept

public record TokenBudget(int total, int system, int rag, int history, int user, int responseReserve) {
  public void validate(Map counts, int modelLimit) {
    int inputSum = counts.values().stream().mapToInt(Integer::intValue).sum();
    if (inputSum + responseReserve > modelLimit)
      throw new BudgetException("over model limit");
    if (counts.getOrDefault("rag", 0) > rag)
      throw new BudgetException("rag over cap");
  }
}

💰 Cost

At $2.50/1M input tokens, shaving 2000 tokens per request × 1M requests/month saves ~$5000/month. Budget tables turn cost conversations into engineering tickets with clear levers.

💡 Pro Tip

Expose bucket fill levels in debug UI for internal users—when RAG is consistently at 98% cap, retrieval is returning too much noise; tighten k or add compression.

⚠️ Pitfall

Forgetting tool definitions in the budget. A 12-tool agent can burn 3000+ tokens before the user speaks—account for tools in system/overhead or use dynamic tool loading.

Worked example: support bot turn

Model limit 8192; target input 5000; reserve 1000 for completion.

Bucket	Actual	Cap	Action
System	780	800	OK
RAG	2580	2200	Drop lowest rerank 2 chunks
History	920	1000	OK
User	310	400	OK

Dynamic reallocation

When user message is tiny (50 tokens), borrow 200 tokens into RAG for that request only—log budget_borrowed=true for analytics. Never borrow from response reserve.

📦 Real World

Gateways enforce per-tenant budget profiles—free tier gets 2500 input cap, enterprise 8000—same codebase, different TokenBudget config in JWT claims.

Context compression: LLMLingua and when to compress

LLMLingua and related techniques remove low-information tokens from prompts while preserving task performance. Know when to compress vs when to retrieve less—compression is not a substitute for bad retrieval.

You have 12 reranked chunks totaling 9000 tokens; budget allows 2200. Options: (a) drop chunks, (b) compress each chunk, (c) extract sentences. Compression wins when chunks are individually valuable but verbose; dropping wins when whole documents are redundant.

LLMLingua family

LLMLingua — small LM scores token importance; drops tokens with controllable compression ratio (2×–10×).
LongLLMLingua — tiling for long inputs; preserves question-relevant spans better in RAG settings.
LLMLingua-2 — faster token classification; suited for online serving with GPU budget.

Approach	Latency	Faithfulness risk	Best when
Retrieve fewer chunks	Lowest	Miss relevant doc	Reranker already sharp; k was too high
Sentence extraction	Low	Breaks cross-sentence refs	FAQ paragraphs, policy sections
LLMLingua compress	Medium (GPU)	Drops rare tokens (IDs, codes)	Verbose logs, narrative docs
Abstractive summarize	High (LLM call)	Hallucinate summary facts	Many chunks same theme

Decision flow

If MRR@k is low → fix retrieval/chunking, do not compress garbage.
If chunks redundant → reduce k or cluster then summarize once.
If chunks unique and verbose → LLMLingua per chunk at 4× ratio; validate IDs survive.
If compliance requires verbatim quotes → no lossy compression; drop or split across turns.

Protecting structured tokens

Post-compression regex audit for SKU patterns, UUIDs, dollar amounts, and legal section numbers. If density drops >30%, fall back to sentence extraction for that chunk.

def compress_context(text: str, target_ratio: float, compressor, tok) -> str:
    """target_ratio 0.25 = keep ~25% of tokens."""
    original = len(tok.encode(text))
    target = max(64, int(original * target_ratio))
    compressed = compressor.compress_prompt(text, target_token=target)
    # Guardrails: preserve alphanumeric codes
    for token in extract_codes(text):
        if token not in compressed:
            compressed += f"\n[PRESERVED_CODE: {token}]"
    return compressed

def should_compress(chunks: list[str], budget: int, tok) -> bool:
    total = sum(len(tok.encode(c)) for c in chunks)
    if total <= budget:
        return False
    # Compress when avg chunk > 400 tokens and rerank top-5 already included
    return len(chunks) <= 8 and (total / len(chunks)) > 400

public String compressContext(String text, double targetRatio, Compressor compressor, Tokenizer tok) {
  int original = tok.count(text);
  int target = Math.max(64, (int) (original * targetRatio));
  String compressed = compressor.compress(text, target);
  for (String code : extractCodes(text)) {
    if (!compressed.contains(code))
      compressed += "\n[PRESERVED_CODE: " + code + "]";
  }
  return compressed;
}

🔬 Under the Hood

LLMLingua uses a budget controller (binary search on compression rate) guided by perplexity from a small LM—tokens that minimally increase perplexity when removed get dropped first.

⚖️ Trade-off

Compression adds serving complexity (GPU for local model or extra vendor). Retrieve-less is simpler—try dropping k by 2 and measure faithfulness before enabling LLMLingua in prod.

📦 Real World

Observability vendors compress log snippets 4× before RAG on incident queries; they keep retrieve-less for metrics docs under 300 tokens/chunk—hybrid pipeline per doc type.

LLMLingua deployment modes

Sidecar GPU — compress microservice between rerank and generate; 20–40ms at 4× on A10.
Batch pre-compress at ingest — store compressed + raw; serve compressed for browse, raw for legal export.
Vendor API — emerging; evaluate faithfulness on your domain before commit.

Compression vs retrieve-less matrix

Signal	Prefer retrieve-less	Prefer compress
High chunk overlap	✓
Verbose single doc		✓
Low MRR@k	Fix retrieval first	Never
IDs in every answer	✓	Risky

⚠️ Pitfall

Compressing already-short chunks (<200 tokens) often deletes negation words (“not eligible”)—skip compression below minimum chunk size.

Production: context audit checklist

Before shipping—or after a faithfulness regression—run this context audit. Then continue to advanced prompting for injection defenses and CI prompt testing.

Context audit checklist

Published token budget table per product surface (system / RAG / history / user / reserve).
Pre-flight token count with same tokenizer provider uses (tiktoken, Anthropic counter, etc.).
RAG assembly logged: chunk IDs, order, total tokens, compression ratio if any.
Lost-in-the-middle ablation: relevance vs sandwich order on golden set.
Conversation memory: entity store scoped per user; vector memory TTL documented.
Compression guardrails for IDs, codes, and regulated verbatim text.
Alert when p95 input tokens > 80% of model limit—headroom for tools and spikes.
Debug trace shows which bucket was trimmed when over budget (RAG vs history).

Signal	Threshold	Action
Empty RAG after trim	>1% of requests	Lower k at retrieve or raise RAG bucket
Summary staleness	>20 turns without refresh	Force rolling summary job
Memory retrieval miss	User repeats fact 3×	Entity extraction failure—fix parser
Compression code loss	Any regulated SKU miss	Disable lossy compression for that doc class

📦 Production checklist

Context engineering is production-ready when every request has a traceable budget breakdown and you can answer “why was this chunk excluded?” without reproducing the user’s full thread.

🎯 Interview Tip

“Design context for a support bot with RAG.” — Budget table, sandwich ordering, entity memory for account ID, compress logs not policy, audit checklist.

🔒 Security

Audit that trimmed history never drops security instructions. System policy version must be pinned—not subject to LRU eviction in conversation manager.

Research notes on context length

Liu et al. (2023) tested multi-document QA with gold passages placed at varying positions—performance dropped sharply when gold was in the middle third. Replicate with your tokenizer and model because absolute numbers shift by family.

Anthropic and OpenAI long-context models improved middle recall but RAG systems still benefit from explicit ordering policy—do not rely on marketing context limits as engineering limits.

Operational metrics

Track context_tokens_by_bucket as Prometheus histograms. Slice by product and model—budget violations often cluster on one feature flag cohort.

Set SLO: 99% of requests fit budget without dropping system policy. Alert on compression ratio p95 > 6× (signals broken retrieval).

Cross-track links

Token counting primitives: APIs, tokens & context. RAG assembly inputs: Retrieval strategies.

Output formatting that survives truncation: Reliable output formatting (previous guide).