Large Language Models (LLMs) have a data freshness problem. Even some of the most powerful models, like GPT-4, have no idea about recent events.
The world, according to LLMs, is frozen in time. They only know the world as it appeared through their training data.
That creates problems for any use case that relies on up-to-date information or a particular dataset. For example, you may have internal company documents you’d like to interact with via an LLM.
The first challenge is adding those documents to the LLM, we could try training the LLM on these documents, but this is time-consuming and expensive. And what happens when a new document is added? Training for every new document added is beyond inefficient — it is simply impossible.
So, how do we handle this problem? We can use retrieval augmentation. This technique allows us to retrieve relevant information from an external knowledge base and give that information to our LLM.
The external knowledge base is our “window” into the world beyond the LLM’s training data. In this chapter, we will learn all about implementing retrieval augmentation for LLMs using LangChain.
Creating the Knowledge Base
We have two primary types of knowledge for LLMs. The parametric knowledge refers to everything the LLM learned during training and acts as a frozen snapshot of the world for the LLM.
The second type of knowledge is source knowledge. This knowledge covers any information fed into the LLM via the input prompt. When we talk about retrieval augmentation, we’re talking about giving the LLM valuable source knowledge.
(You can follow along with the following sections using the Jupyter notebook here!)
Getting Data for our Knowledge Base
To help our LLM, we need to give it access to relevant source knowledge. To do that, we need to create our knowledge base.
We start with a dataset. The dataset used naturally depends on the use case. It could be code documentation for an LLM that needs to help write code, company documents for an internal chatbot, or anything else.
In our example, we will use a subset of Wikipedia. To get that data, we will use Hugging Face datasets like so:
In[2]:
from datasets import load_dataset
data = load_dataset("wikipedia", "20220301.simple", split='train[:10000]')
data
Out[2]:
Downloading readme: 0%| | 0.00/16.3k [00:00, ?B/s]
Dataset