sharpbyte.dev
← RAG
RAG · topic 4 of 13

RAG system workflow

Chunk, embed, store, query, retrieve, re-rank, and generate—end-to-end pipeline anatomy.

Workflow of a RAG system

Consequently, the LLM can easily incorporate this info while generating text because it now has the relevant details available in the prompt. Now that we understand the purpose, let's get into the technical details. Workflow of a RAG system To build a RAG system, it's crucial to understand the foundational components that go into it and how they interact. Thus, in this section, let's explore each element in detail. Here's an architecture diagram of a typical RAG setup:

Architecture of a typical RAG stack: ingest, index, retrieve, then generate.
Architecture of a typical RAG stack: ingest, index, retrieve, then generate.
High-level flow from external knowledge to grounded model output.
High-level flow from external knowledge to grounded model output.

Let's break it down step by step. We start with some external knowledge that wasn't seen during training, and we want to augment the LLM with:

#1) Create chunks

The first step is to break down this additional knowledge into chunks before embedding and storing it in the vector database. We do this because the additional document(s) can be pretty large. Thus, it is important to ensure that the text fits the input size of the embedding model. Moreover, if we don't chunk, the entire document will have a single embedding, which won't be of any practical use to retrieve relevant context.

#2) Generate embeddings

After chunking, we embed the chunks using an embedding model.

Step 1 — chunk documents so each piece fits the embedding model and retrieval stays granular.
Step 1 — chunk documents so each piece fits the embedding model and retrieval stays granular.
Step 2 — embed chunks with a context or bi-encoder-style embedding model.
Step 2 — embed chunks with a context or bi-encoder-style embedding model.
Step 3 — store vectors plus metadata/payload in the vector database as the app’s memory.
Step 3 — store vectors plus metadata/payload in the vector database as the app’s memory.

Since these are “context embedding models” (not word embedding models), models like bi-encoders are highly relevant here.

#3) Store embeddings in a vector database

These embeddings are then stored in the vector database: This shows that a vector database acts as a memory for your RAG application since this is precisely where we store all the additional knowledge, using which, the user's query will be answered. A vector database also stores the metadata and original content along with the vector embeddings. With that, our vector database has been created and information has been added. More information can be added to this if needed. Now, we move to the query step.

#4) User input query

Step 4 — the live user query enters the retrieval stage.
Step 4 — the live user query enters the retrieval stage.
Step 5 — embed the query with the same model used for document chunks.
Step 5 — embed the query with the same model used for document chunks.

Next, the user inputs a query, a string representing the information they're seeking.

#5) Embed the query

This query is transformed into a vector using the same embedding model we used to embed the chunks earlier in Step 2.

#6) Retrieve similar chunks

The vectorized query is then compared against our existing vectors in the database to find the most similar information. The vector database returns the k (a pre-defined parameter) most similar documents/chunks (using approximate nearest neighbor search).

Step 6 — ANN search returns the top-k most similar chunks for the prompt.
Step 6 — ANN search returns the top-k most similar chunks for the prompt.
Step 7 (optional) — cross-encoder re-ranking to reorder chunks by finer relevance.
Step 7 (optional) — cross-encoder re-ranking to reorder chunks by finer relevance.
Step 8 — the LLM reads query + retrieved context and writes the final answer.
Step 8 — the LLM reads query + retrieved context and writes the final answer.

It is expected that these retrieved documents contain information related to the query, providing a basis for the final response generation.

#7) Re-rank the chunks

After retrieval, the selected chunks might need further refinement to ensure the most relevant information is prioritized. In this re-ranking step, a more sophisticated model (often a cross-encoder) evaluates the initial list of retrieved chunks alongside the query to assign a relevance score to each chunk. This process rearranges the chunks so that the most relevant ones are prioritized for the response generation. That said, not every RAG app implements this, and typically, they just rely on the similarity scores obtained in step 6 while retrieving the relevant context from the vector database.

#8) Generate the final response

End-to-end picture before the dedicated chunking strategies section.
End-to-end picture before the dedicated chunking strategies section.
Why chunking is always step one: size limits, embedding quality, and retrieval precision.
Why chunking is always step one: size limits, embedding quality, and retrieval precision.
Prompt assembly: user question plus selected chunks fed to the generator.
Prompt assembly: user question plus selected chunks fed to the generator.

Key takeaways

  • Indexing embeds manageable chunks; querying reuses the same encoder for symmetric search.
  • Re-ranking is optional but sharpens which evidence reaches the generator.