sharpbyte.dev
← RAG
RAG · topic 3 of 13

Vector databases in RAG

Static training corpora, private data, context windows—and where vector memory fits.

Why LLMs need an external memory layer

The purpose of vector databases in RAG At this point, one interesting thing to learn is how exactly LLMs take advantage of vector databases. The biggest confusion that people typically face is: Once we have trained our LLM, it will have some model weights for text generation. Where do vector databases fit in here? Let's understand this. To begin, we must understand that an LLM is deployed after learning from a static version of the corpus it was fed during training. For instance, if the model was deployed after considering the data until 31st Jan 2024, and we use it, say, a week after training, it will have no clue about what happened in those days. Repeatedly training a new model (or adapting the latest version) every single day on new data is impractical and cost-ineffective. In fact, LLMs can take weeks to train.

LLMs are frozen after training; their weights don’t automatically include events after the cutoff.
LLMs are frozen after training; their weights don’t automatically include events after the cutoff.
Daily drift: new data after deployment isn’t in the model unless you add a retrieval or training loop.
Daily drift: new data after deployment isn’t in the model unless you add a retrieval or training loop.

Also, what if we open-sourced the LLM and someone else wants to use it on their privately held dataset, which, of course, was not shown during training? As expected, the LLM will have no clue about it. But if you think about it, is it really our objective to train an LLM to know every single thing in the world? Not at all! That’s not our objective. Instead, it is more about helping the LLM learn the overall structure of the language, and how to understand and generate it. So, once we have trained this model on a ridiculously large enough training corpus, it can be expected that the model will have a decent level of language understanding and generation capabilities. Thus, if we could figure out a way for LLMs to look up new information they were not trained on and use it in text generation (without training the model again), that would be great!

Private or tenant-specific corpora were never in the base model—retrieval closes that gap.
Private or tenant-specific corpora were never in the base model—retrieval closes that gap.
Goal: strong general language skills in the weights; fresh facts supplied at inference via lookup.
Goal: strong general language skills in the weights; fresh facts supplied at inference via lookup.

One way could be to provide that information in the prompt itself. But since LLMs usually have a limit on the context window (number of words/tokens they can accept), the additional information can exceed that limit. Vector databases solve this problem. As discussed earlier, vector databases store information in the form of vectors, where each vector captures semantic information about the piece of text being encoded. Thus, we can maintain our available information in a vector database by encoding it into vectors using an embedding model. When the LLM needs to access this information, it can query the vector database using an approximate similarity search with the prompt vector to find content that is similar to the input query vector.

Whole-document prompts blow past context limits—vector stores hold the corpus compactly.
Whole-document prompts blow past context limits—vector stores hold the corpus compactly.
Encode documents once; at query time, match the question embedding to the nearest stored chunks.
Encode documents once; at query time, match the question embedding to the nearest stored chunks.

Once the approximate nearest neighbors have been retrieved, we gather the context corresponding to those specific vectors, which were stored at the time of indexing the data in the vector database (this raw data is stored as payload, which we will learn during implementation). The above search process retrieves context that is similar to the query vector, which represents the context or topic the LLM is interested in. We can augment this retrieved content along with the actual prompt provided by the user and give it as input to the LLM.

Approximate nearest neighbors return payloads (original text/metadata) paired with each vector.
Approximate nearest neighbors return payloads (original text/metadata) paired with each vector.
Retrieved passages are concatenated with the user message before the model answers.
Retrieved passages are concatenated with the user message before the model answers.

Key takeaways

  • Deployed LLMs reflect a frozen training snapshot; vector stores inject fresher or private evidence.
  • Similarity search retrieves payload text to pack into prompts when raw docs exceed context limits.