sharpbyte.dev
← RAG
RAG · topic 12 of 13

RAG + cache-augmented generation (CAG)

Cache stable knowledge in KV memory; keep volatile facts on the retrieval path.

RAG vs CAG and cache-augmented generation

RAG vs CAG

RAG changed how we build knowledge-grounded systems, but it still has a weakness.

Every time a query comes in, the model often re-fetches the same context from the vector DB, which can be expensive, redundant, and slow. Cache-Augmented Generation (CAG) fixes this. It lets the model “remember” stable information by caching it directly in the model’s key-value memory. And you can take this one step ahead by fusing RAG and CAG as depicted below: Here’s how it works in simple terms:

● In a regular RAG setup, your query goes to the vector database, retrieves relevant chunks, and feeds them to the LLM.

● But in RAG + CAG, you divide your knowledge into two layers.

○ The static, rarely changing data, like company policies or reference guides, gets cached once inside the model’s KV memory.

RAG + CAG: cache stable knowledge in KV memory while dynamic facts still flow through retrieval.
RAG + CAG: cache stable knowledge in KV memory while dynamic facts still flow through retrieval.

○ The dynamic, frequently updated data, like recent customer interactions or live documents, continues to be fetched via retrieval. This way, the model doesn’t have to reprocess the same static information every time. It uses it instantly from cache, and supplements it with whatever’s new via retrieval to give faster inference. The key here is to be selective about what you cache. You should only include stable, high-value knowledge that doesn’t change often. If you cache everything, you’ll hit context limits, so separating “cold” (cacheable) and “hot” (retrievable) data is what keeps this system reliable. You can also see this in practice above. Many APIs like OpenAI and Anthropic already support prompt caching, so you can start experimenting right away.

Key takeaways

  • Separating stationary policies from live feeds cuts redundant vector trips.
  • Prompt-caching APIs mirror the “cache what repeats” mental model.