Chunk, embed, store, query, retrieve, re-rank, and generate—end-to-end pipeline anatomy.
Consequently, the LLM can easily incorporate this info while generating text because it now has the relevant details available in the prompt. Now that we understand the purpose, let's get into the technical details. Workflow of a RAG system To build a RAG system, it's crucial to understand the foundational components that go into it and how they interact. Thus, in this section, let's explore each element in detail. Here's an architecture diagram of a typical RAG setup:
Let's break it down step by step. We start with some external knowledge that wasn't seen during training, and we want to augment the LLM with:
#1) Create chunks
The first step is to break down this additional knowledge into chunks before embedding and storing it in the vector database. We do this because the additional document(s) can be pretty large. Thus, it is important to ensure that the text fits the input size of the embedding model. Moreover, if we don't chunk, the entire document will have a single embedding, which won't be of any practical use to retrieve relevant context.
#2) Generate embeddings
After chunking, we embed the chunks using an embedding model.
Since these are “context embedding models” (not word embedding models), models like bi-encoders are highly relevant here.
#3) Store embeddings in a vector database
These embeddings are then stored in the vector database: This shows that a vector database acts as a memory for your RAG application since this is precisely where we store all the additional knowledge, using which, the user's query will be answered. A vector database also stores the metadata and original content along with the vector embeddings. With that, our vector database has been created and information has been added. More information can be added to this if needed. Now, we move to the query step.
#4) User input query
Next, the user inputs a query, a string representing the information they're seeking.
#5) Embed the query
This query is transformed into a vector using the same embedding model we used to embed the chunks earlier in Step 2.
#6) Retrieve similar chunks
The vectorized query is then compared against our existing vectors in the database to find the most similar information. The vector database returns the k (a pre-defined parameter) most similar documents/chunks (using approximate nearest neighbor search).
It is expected that these retrieved documents contain information related to the query, providing a basis for the final response generation.
#7) Re-rank the chunks
After retrieval, the selected chunks might need further refinement to ensure the most relevant information is prioritized. In this re-ranking step, a more sophisticated model (often a cross-encoder) evaluates the initial list of retrieved chunks alongside the query to assign a relevance score to each chunk. This process rearranges the chunks so that the most relevant ones are prioritized for the response generation. That said, not every RAG app implements this, and typically, they just rely on the similarity scores obtained in step 6 while retrieving the relevant context from the vector database.
#8) Generate the final response