AI

What is RAG (Retrieval-Augmented Generation)? A Comprehensive Explanation from Mechanisms to Implementation

RAG is a technology that allows large language models to generate accurate answers by referencing external data. This article comprehensively explains its mechanisms, benefits, implementation methods, and specific examples.

10 min read Reviewed & edited by the SINGULISM Editorial Team

What is RAG (Retrieval-Augmented Generation)? A Comprehensive Explanation from Mechanisms to Implementation
Photo by Growtika on Unsplash

What is RAG (Retrieval-Augmented Generation)?

RAG (Retrieval-Augmented Generation), known in Japanese as “Retrieval-Augmented Generation,” is a technical method that dramatically improves the answer accuracy of Large Language Models (LLMs). Proposed in 2020 by researchers at Meta AI (then Facebook AI Research), it is now widely adopted as an essential architecture for the practical application of generative AI.

In short, RAG is “a method that, in response to a user’s question, first searches for relevant information from external data sources, combines it with the LLM’s input, and generates a more accurate and reliable answer.”

Conventional LLMs generate answers based solely on knowledge contained in their training data, so they could not accurately answer questions about events that occurred after training or information specific to a particular organization. RAG is the key technology that solves this fundamental problem.

Background of RAG’s Growing Attention

With the spread of generative AI, companies and developers want to integrate LLMs into their services. However, using an LLM alone presents several problems:

Hallucination (Hallucination) Problem An LLM may generate information that is contrary to facts as if it were true. This is a problem inherent in the model’s mechanism of probabilistically predicting the most appropriate next token.

Stale Knowledge LLM training data has a cutoff date, and information after that date is not included. LLMs alone cannot keep up with rapidly changing markets or the latest news.

Lack of Organization-Specific Information Answers are also sought for private information not included in the LLM’s training data, such as a company’s internal documents, FAQs, and product catalogs.

Lack of Transparency It is difficult to trace the basis on which an LLM generated an answer. This problem is particularly serious in industries like finance and healthcare, where the basis for answers must be clearly stated.

RAG solves these challenges with the simple yet effective approach of “referencing external data.”

How RAG Works: Three Main Steps

The processing flow of RAG is broadly composed of three steps.

Step 1: Indexing (Data Preprocessing and Storage)

The first step in building a RAG system is to format the external data you want to reference into a form usable by the LLM and store it in a searchable database.

Specifically, the following processing is performed:

First, documents (PDFs, web pages, text files, database records, etc.) are split into small chunks (text fragments). This is done to fit within the LLM’s context window (maximum allowed input characters) and to efficiently search only highly relevant information.

Next, each chunk is vectorized. Vectorization is the process of converting the meaning of text into an array of numbers (a vector), usually done using an embedding model. A characteristic is that texts with similar meanings are placed closer together in vector space.

Finally, these vectors are stored in a vector database. Representative vector databases include Pinecone, Weaviate, Milvus, Chroma, and Qdrant.

Step 2: Search (Retrieval)

When a user enters a question, the RAG system performs the following processing:

First, the user’s question is also vectorized. Then, it searches the vector database for the document chunks whose vectors are closest to (i.e., semantically most relevant to) the question’s vector. This method is called “semantic search.”

Unlike traditional keyword search, semantic search is based not only on word matches but also on context and meaning similarity. For example, for the question “When is the salary transfer date?”, it can correctly search for documents with the expression “wage payment deadline.”

To improve search accuracy, a “hybrid search” approach that combines semantic search and keyword search is also widely used.

Step 3: Generation (Generation)

The content of the relevant documents retrieved from the search and the user’s question are combined and input as a prompt to the LLM. Based on this information, the LLM generates an accurate and contextually appropriate answer.

At this time, the LLM summarizes and integrates the content of the external data while creating a natural language response. It is also possible to include which document the information was obtained from in the output, thereby making the basis for the answer explicit.

Technical Components Required for RAG Implementation

The following technical components are necessary to build a RAG system.

Embedding Model

A model that converts text into vectors. Representative examples include OpenAI’s text-embedding-ada-002, Cohere’s embed, and models from Sentence Transformers. The choice of embedding model directly affects search accuracy, so it is important to select a model that fits the characteristics of your own documents.

Vector Database

A database for storing vectorized data and performing similarity searches at high speed. Options range from cloud services like Pinecone to open-source solutions like Chroma and Milvus. Selection should be made considering data volume, search speed, cost, and ease of operation.

LLM (Large Language Model)

The LLM used to generate answers. There are many options, such as OpenAI’s GPT-4 and GPT-3.5, Anthropic’s Claude, Google’s Gemini, and Meta’s Llama (open source). The appropriate model is selected based on use case, cost, and privacy requirements.

Orchestration Framework

A framework that coordinates the various steps of RAG. Representative frameworks include LangChain, LlamaIndex (formerly GPT Index), and Haystack. They allow for efficient implementation of the entire process from document splitting, embedding, search, prompt construction, to LLM invocation.

Benefits of RAG

Improved Answer Accuracy

Because answers are generated based on external data, more accurate information can be provided than with an LLM alone. It can also handle the latest information and organization-specific information.

Suppression of Hallucinations

Since the “basis” for the LLM’s answer generation is explicitly stated as external data, the risk of generating factually incorrect answers is significantly reduced.

Transparency and Reliability

Because the source of the answer can be presented, it is easier for users to judge the reliability of the information. This is particularly valuable in industries like finance, law, and healthcare, where stating the basis is important.

Maintaining Knowledge Freshness

Simply updating the content of the external database keeps the knowledge of the RAG system up to date. Re-learning (fine-tuning) of the LLM is not required.

Cost Efficiency

Since it is not necessary to train an LLM from scratch or fine-tune it, a high-quality AI system can be built at a relatively low cost.

Privacy Protection

By not including internal data in the LLM’s training data and using it only as reference information, the risk of leaking confidential information to an external LLM can be reduced.

Drawbacks and Challenges of RAG

Dependence on Search Quality

The quality of RAG answers largely depends on the quality of information retrieved in the search step. If search accuracy is low, the LLM will generate answers based on inappropriate information, degrading the overall quality of the RAG system.

Document Preprocessing is Required

To build an effective RAG system, proper chunking of documents is crucial. If chunks are too large, irrelevant information is mixed in; if too small, context is lost. Developing an optimal splitting strategy requires effort.

Increased Latency

In addition to calling the LLM, the vector search step is added, which tends to increase response time. This point needs to be considered for applications requiring real-time performance.

Infrastructure Complexity

The overall system architecture becomes complex, involving the operation and management of the vector database, the selection and operation of the embedding model, and integration with the LLM.

Cost Considerations

Costs accumulate from the cloud usage fees for the vector database, the cost of generating embeddings, and the API usage fees for the LLM, making cost management important for large-scale operations.

RAG Implementation Examples and Use Cases

Enterprise Chatbot (Internal Knowledge Base)

Utilizing a company’s internal documents (internal regulations, FAQs, manuals, meeting minutes, etc.) as RAG data sources, a chatbot is built that allows employees to search for internal information using natural language. This is effective for supporting new employee training and automating IT help desks.

Customer Support Automation

Using product manuals, past inquiry histories, and knowledge bases as RAG data sources, it automatically generates accurate answers to customer questions. This enables faster response times and reduces the workload on operators.

Legacy Code Explanation System

Utilizing a large legacy codebase as a RAG data source, a system is built where developers can ask questions like “What does this function do?” or “Where is this variable used?” in natural language.

Market Research and Competitive Analysis

A system that uses news articles, press releases, and industry reports as RAG data sources to automatically analyze industry trends and competitor movements.

Using laws, case precedents, and internal policies as RAG data sources, a system is built that allows legal and compliance departments to quickly check relevant information.

Best Practices to Improve RAG Accuracy

Optimization of Chunk Splitting

Select an appropriate chunk size and splitting method according to the nature of the document. Generally, chunks of about 200 to 1000 tokens are used, but splitting that considers the document structure (sections, paragraphs, tables, etc.) is effective. Additionally, assigning metadata (document name, creation date, author, etc.) to chunks contributes to improved search accuracy.

Combining semantic search and keyword search allows you to enjoy the benefits of both. This is particularly effective for queries containing proper nouns or technical terms, where keyword search is used in conjunction.

Introduction of Re-ranking

After retrieving search results, using a re-ranking model like a cross-encoder to re-sort them by relevance further improves the quality of the context passed to the LLM.

Prompt Engineering

Crafting the prompt to the LLM can significantly improve answer quality. For example, adding instructions like “Please answer based on the following reference information. If the information is not mentioned in the reference information, please answer ‘Information not found,’” can suppress hallucinations.

Setting the Number of Candidate Answers

It is also important to appropriately set the number of relevant documents retrieved in the search step (Top-K). Too many introduce noise; too few lead to insufficient information. Generally, start with about 3 to 5 and adjust based on actual answer quality.

Comparison of RAG with Other Approaches

RAG vs. Fine-tuning

Fine-tuning is a method that re-trains the parameters of the LLM itself for specific documents or tasks. In contrast, RAG is a mechanism that references external data without changing the LLM’s parameters.

Fine-tuning is suitable for optimizing for specific writing styles or output formats, but knowledge updates require re-learning. RAG allows for easy knowledge updates and can also make the source explicit. A hybrid approach combining both is also effective.

RAG vs. Prompt Engineering (In-Context Learning)

The technique of including information in the input context of an LLM (In-Context Learning) could be considered part of RAG. However, RAG refers to the entire framework that systematizes and automates this technique, making it applicable to large-scale data sources.

Conclusion

RAG (Retrieval-Augmented Generation) is a technology that effectively solves major challenges in the practical application of generative AI, such as the limitations of LLM knowledge, the hallucination problem, and privacy concerns. By dynamically referencing external data, it enables the generation of accurate, transparent, and up-to-date answers.

Its use cases range from leveraging enterprise knowledge bases and customer support to development support. RAG technology is expected to continue evolving rapidly, and it is an important concept that all organizations seeking to apply generative AI to real business should understand.

Frequently Asked Questions

What is the difference between RAG and a standard LLM?
A standard LLM generates answers based solely on its trained data, whereas RAG dynamically searches for relevant information from an external database and uses it for answer generation. This allows it to accurately answer questions about the latest information or organization-specific information and also makes it possible to state the source.
What technical skills are required to implement RAG?
Python programming experience, experience with API integration, and basic knowledge of vector databases are recommended. Using frameworks like LangChain or LlamaIndex makes it relatively easy to create a prototype, but production deployment also requires skills in optimizing search accuracy and infrastructure management.
What file formats can be used as data sources for RAG?
Almost all formats are applicable, including PDF, Word, Excel, HTML, Markdown, text files, and database records. However, for images or scanned documents, OCR processing must be performed in advance. It is important to implement text extraction processing appropriately for each file format.
How much does it cost to build a RAG system?
It varies greatly depending on scale and requirements. A small-scale prototype using cloud services can be built from tens of thousands of yen per month, but large-scale production operations will incur costs for vector database usage, LLM API usage fees, and infrastructure costs. Utilizing open-source components can also significantly reduce costs.
Source: Singulism

Comments

← Back to Home