RAG · topic 10 of 13

Full fine-tuning vs LoRA vs RAG

Three ways to add knowledge: full weights, adapters, or retrieval—pros, costs, and limits.

Comparing augmentation strategies

But this comes at the cost of increased latency and more LLM usage. Full-model Fine-tuning vs. LoRA vs. RAG All three techniques are used to augment the knowledge of an existing model with additional data.

1) Full ﬁne-tuning

Three augmentation paths: full fine-tuning, LoRA adapters, and retrieval without weight updates.

Fine-tuning means adjusting the weights of a pre-trained model on a new dataset for better performance. While this ﬁne-tuning technique has been successfully used for a long time, problems arise when we use it on much larger models — LLMs, for instance, primarily because of: Their size. The cost involved in ﬁne-tuning all weights. The cost involved in maintaining all large ﬁne-tuned models.

2) LoRA ﬁne-tuning

LoRA ﬁne-tuning addresses the limitations of traditional ﬁne-tuning. The core idea is to decompose the weight matrices (some or all) of the original model into low-rank matrices and train them instead. For instance, in the graphic below, the bottom network represents the large pre-trained model, and the top network represents the model with LoRA layers.

Full fine-tuning adjusts every weight—powerful but expensive for large LLMs.

The idea is to train only the LoRA network and freeze the large model. Looking at the above visual, you might think: But the LoRA model has more neurons than the original model. How does that help? To understand this, you must make it clear that neurons don't have anything to do with the memory of the network. They are just used to illustrate the dimensionality transformation from one layer to another. It is the weight matrices (or the connections between two layers) that take up memory. Thus, we must be comparing these connections instead:

LoRA adds trainable low-rank matrices while keeping the backbone frozen.

Looking at the above visual, it is pretty clear that the LoRA network has relatively very few connections.

3) RAG

Retrieval augmented generation (RAG) is another pretty cool way to augment neural networks with additional information, without having to ﬁne-tune the model. This is illustrated below: There are 7 steps, which are also marked in the above visual:

Step 1-2: Take additional data, and dump it in a vector database after embedding.

(This is only done once. If the data is evolving, just keep dumping the

RAG augments at inference by pulling external passages into the prompt.

Seven-step RAG walkthrough labeled on the diagram: ingest, embed, store, query, retrieve, prompt, answer.

embeddings into the vector database. There’s no need to repeat this again for the entire data)

Step 3: Use the same embedding model to embed the user query.

Step 4-5: Find the nearest neighbors in the vector database to the embedded

query.

Step 6-7: Provide the original query and the retrieved documents (for more

context) to the LLM to get a response. In fact, even its name entirely justiﬁes what we do with this technique: Retrieval: Accessing and retrieving information from a knowledge source, such as a database or memory. Augmented: Enhancing or enriching something, in this case, the text generation process, with additional information or context. Generation: The process of creating or producing something, in this context, generating text or language. Of course, there are many problems with RAG too, such as:

Retrieval, augmentation, and generation spelled out as the RAG acronym implies.

Key takeaways

Fine-tuning bakes knowledge into weights; RAG keeps weights frozen and supplies evidence per query.
RAG excels at lookup QA but may miss global corpus understanding in one prompt window.