Retrieval Augmented Generation (RAG) is a method to augment the relevance and transparency of Large Language Model (LLM) responses. In this approach, the LLM query retrieves relevant documents from a database and passes these into the LLM as additional context. RAG therefore helps improve the relevancy of responses by including pertinent information in the context, and also improves transparency by letting the LLM reference and cite source documents.

While RAG is an increasingly popular method, it requires textual documents as references and cannot directly use the wealth of audio data that many organizations have or that is available online, like meeting recordings, lectures, webinars, and more.
In this tutorial, we will demonstrate how to perform Retrieval Augmented Generation with audio data by leveraging AssemblyAI’s document loader for LangChain, a popular framework that provides building blocks for LLM-based applications.
The source code for this tutorial can be found in this repo.
Getting started
To follow this tutorial, you’ll need an AssemblyAI API key. You can get one for free here if you don’t already have one. Additionally, we’ll be using GPT 3.5 for this tutorial, so you’ll need an OpenAI API key as well – sign up here if you don’t have one already.
Setting up the virtual environment
In a terminal, create a directory for this project and navigate into it:
mkdir ragaudio && cd ragaudio
Now, enter the following command to create a virtual environment called venv
python -m venv venv
Next, activate the environment. If you’re on MacOS/Linux, enter
source ./venv/bin/activate
If you are on Windows, enter
.venvScriptsactivate.bat
Next, install the libraries we’ll need for this tutorial:
pip install assemblyai langchain openai python-dotenv chromadb sentence-transformers
Setting up the environment file
We’ll use python-dotenv
to load environment variables for our project. In order to do so, we’ll need to store our environment variables in a file that the package can read. In your project directory, create a file called .env
and paste your API keys as the values for the corresponding environment variables:
ASSEMBLYAI_API_KEY=your-key-here
OPENAI_API_KEY=your-key-here
Note, it is extremely important to not share this file or check it into source control. Anybody who has access to these keys can use your respective accounts, so it is a good idea to create a .gitignore
file to make sure that this does not accidentally happen. Additionally, we can exclude the virtual environment from source control. The .gitignore
file should therefore contain the following lines:
.env
venv
Writing the application
Now we’re ready to write the application. Overall, the application will work as follows:
- First, we will load our documents using AssemblyAI’s document loader for LangChain
- Next, we will split these documents in order to have smaller chunks that can be retrieved by LangChain during RAG
- We will then embed these chunks with HuggingFace, yielding one vector for each chunk
- After that, we will store these embeddings in a Chroma vector database
- Finally, we will use LangChain’s built in QA chain to write a simple loop that lets us query OpenAI’s GPT 3.5. We will be shown the answer along with the texts retrieved for RAG when generating the answer
We can see this process pictorially as below:
Imports and environment variables
Open a new file in your project directory called main.py
. First, we’ll add all imports required for this project. Additionally, we’ll use the load_dotenv()
function in order to load the contents of our .env
file as environment variables.
from dotenv import load_dotenv
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import AssemblyAIAudioTranscriptLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
load_dotenv()
Loading the documents
The AssemblyAIAudioTranscriptLoader
allows us to load audio files, local or remote, into LangChain applications. In this example, we will use several audio files from LangChain’s webinar series on YouTube, but feel free