Three ways to add knowledge: full weights, adapters, or retrieval—pros, costs, and limits.
But this comes at the cost of increased latency and more LLM usage. Full-model Fine-tuning vs. LoRA vs. RAG All three techniques are used to augment the knowledge of an existing model with additional data.
1) Full fine-tuning
Fine-tuning means adjusting the weights of a pre-trained model on a new dataset for better performance. While this fine-tuning technique has been successfully used for a long time, problems arise when we use it on much larger models — LLMs, for instance, primarily because of: Their size. The cost involved in fine-tuning all weights. The cost involved in maintaining all large fine-tuned models.
2) LoRA fine-tuning
LoRA fine-tuning addresses the limitations of traditional fine-tuning. The core idea is to decompose the weight matrices (some or all) of the original model into low-rank matrices and train them instead. For instance, in the graphic below, the bottom network represents the large pre-trained model, and the top network represents the model with LoRA layers.
The idea is to train only the LoRA network and freeze the large model. Looking at the above visual, you might think: But the LoRA model has more neurons than the original model. How does that help? To understand this, you must make it clear that neurons don't have anything to do with the memory of the network. They are just used to illustrate the dimensionality transformation from one layer to another. It is the weight matrices (or the connections between two layers) that take up memory. Thus, we must be comparing these connections instead:
Looking at the above visual, it is pretty clear that the LoRA network has relatively very few connections.
3) RAG
Retrieval augmented generation (RAG) is another pretty cool way to augment neural networks with additional information, without having to fine-tune the model. This is illustrated below: There are 7 steps, which are also marked in the above visual:
Step 1-2: Take additional data, and dump it in a vector database after embedding.
(This is only done once. If the data is evolving, just keep dumping the
embeddings into the vector database. There’s no need to repeat this again for the entire data)
Step 3: Use the same embedding model to embed the user query.
Step 4-5: Find the nearest neighbors in the vector database to the embedded
query.
Step 6-7: Provide the original query and the retrieved documents (for more
context) to the LLM to get a response. In fact, even its name entirely justifies what we do with this technique: Retrieval: Accessing and retrieving information from a knowledge source, such as a database or memory. Augmented: Enhancing or enriching something, in this case, the text generation process, with additional information or context. Generation: The process of creating or producing something, in this context, generating text or language. Of course, there are many problems with RAG too, such as: