Static training corpora, private data, context windows—and where vector memory fits.
The purpose of vector databases in RAG At this point, one interesting thing to learn is how exactly LLMs take advantage of vector databases. The biggest confusion that people typically face is: Once we have trained our LLM, it will have some model weights for text generation. Where do vector databases fit in here? Let's understand this. To begin, we must understand that an LLM is deployed after learning from a static version of the corpus it was fed during training. For instance, if the model was deployed after considering the data until 31st Jan 2024, and we use it, say, a week after training, it will have no clue about what happened in those days. Repeatedly training a new model (or adapting the latest version) every single day on new data is impractical and cost-ineffective. In fact, LLMs can take weeks to train.
Also, what if we open-sourced the LLM and someone else wants to use it on their privately held dataset, which, of course, was not shown during training? As expected, the LLM will have no clue about it. But if you think about it, is it really our objective to train an LLM to know every single thing in the world? Not at all! That’s not our objective. Instead, it is more about helping the LLM learn the overall structure of the language, and how to understand and generate it. So, once we have trained this model on a ridiculously large enough training corpus, it can be expected that the model will have a decent level of language understanding and generation capabilities. Thus, if we could figure out a way for LLMs to look up new information they were not trained on and use it in text generation (without training the model again), that would be great!
One way could be to provide that information in the prompt itself. But since LLMs usually have a limit on the context window (number of words/tokens they can accept), the additional information can exceed that limit. Vector databases solve this problem. As discussed earlier, vector databases store information in the form of vectors, where each vector captures semantic information about the piece of text being encoded. Thus, we can maintain our available information in a vector database by encoding it into vectors using an embedding model. When the LLM needs to access this information, it can query the vector database using an approximate similarity search with the prompt vector to find content that is similar to the input query vector.
Once the approximate nearest neighbors have been retrieved, we gather the context corresponding to those specific vectors, which were stored at the time of indexing the data in the vector database (this raw data is stored as payload, which we will learn during implementation). The above search process retrieves context that is similar to the query vector, which represents the context or topic the LLM is interested in. We can augment this retrieved content along with the actual prompt provided by the user and give it as input to the LLM.