Skip to content

How does RAG (Retrieval Augmented Generation) work?

Info

Want to know what is RAG (Retrieval Augmented Generation)? Read this article

Retrieval Augmented Generation (RAG) is a way to give knowledge to LLMs. RAG works by retrieving information and then passing it to the LLM. But how exactly?

RAG Pipeline

Here's how RAG works in detail:

RAG-Pipeline-1 Your data will be preprocessed (usually chunked into smaller chunks). Then, your data will be converted into an embedding and stored in a vector database.

RAG-Pipeline-2

When a user asks a query, the system will convert the query into an embedding. This embedding will be used to search for most similiar embedding from the vector database.

RAG-Pipeline-3 The most relevant chunks will be retrieved from the vector database then passed to LLM.

RAG-Pipeline-4 LLM will then generate a response based on the retrieved chunks.

Whole RAG Pipeline

RAG-Pipeline-Complete Here's the whole process in one picture!

  1. Your documents are split into pieces and turned into embeddings
  2. When someone asks a question, it gets turned into an embedding too
  3. The system finds the most matching pieces from your documents
  4. The AI uses these pieces to give you a helpful answer

Implement your first RAG System

Now that you understand how RAG works, let's implement your first RAG system! Read the next tutorial to learn how to build a RAG system from scratch.