I’ve been reading more and more about LLM-based applications recently, itching
to build something useful as a learning experience. In this post, I want to
share a Retrieval Augmented Generation (RAG) system I’ve built in 100% Go and
some insights I learned along the way.
Some limitations of current LLMs
Let’s take OpenAI’s API as an example; for your hard-earned dollars, it gives
you access to powerful LLMs and related models. These LLMs have some
limitations as a general knowledge system [1]:
- They have a training cutoff date somewhere in the past; recently, OpenAI
moved the cutoff of their GPT models from 2021 to April 2023, but it’s
still not real-time. - Even if LLMs develop more real-time training, they still only have
access to public data. They aren’t familiar with your internal documents,
which you may want to use them on. - You pay per token, which is
about 3/4 of a word; there can be different pricing for input tokens and
output tokens. The prices are low if you’re only experimenting, but can grow
fast if you’re working at scale. This may limit how many tokens you want an
LLM to crunch for you in each request.
Retrieval Augmented Generation
One of the most popular emerging techniques to address these limitations
is Retrieval Augmented Generation (RAG). Here’s a useful diagram borrowed
from a GCP blog post:
The idea is:
- We want the LLM to “ingest” a large body of text it wasn’t trained on, and
then chat to it about it - Even if the full body of text fits the LLM’s context window, this may be
too expensive for each query [2] - Therefore, we’ll run a separate information retrieval stage, finding
the most relevant information for our query - Finally, we’ll add this information as the context for our query and chat
with the LLM about it
The third step in the list above is the trickiest part – finding the most
“relevant” information is difficult in the general case. Are we supposed to
build a search engine? Well, that would be one approach! Powerful full-text
search engines exist and could be helpful here, but there’s a better way
using embeddings. Read on to see how it works.
Implementing RAG in Go
In the course of my research on the subject, I wrote a bunch of Python code to
perform RAG, and then ported it to Go. It was easier to find Python samples
online, but once everything clicked in my head, porting to Go was trivial.
This process led me to the following observation:
LLM-based applications like RAG are a data-pipeline task, not
a machine-learning task.
What I mean by this is that the application doesn’t crunch matrices, doesn’t
explore the best loss function or gradient update, doesn’t
train and evaluate models. It simply hooks up textual tools together; LLMs
are one such textual tool, embeddings are another. Therefore, Go is very well
suited for such applications! Go is much faster than Python, just as capable
with text processing, and its easy concurrency is helpful for applications that
spend a long time waiting for network I/O.
The motivating problem
When I started hacking on thi