In recent months, it’s been hard to miss all the news about Large Language Models and the rapidly developing set of technologies around them. Although proprietary, closed-source models like GPT-4 have drawn a lot of attention, there has also been an explosion in open-source models, libraries, and tools. With all these developments, it can be hard to see how all the pieces fit together. One of the best ways to learn is by example, so let’s set ourselves a goal and see what it takes to accomplish it. We’ll summarize the technology and key ideas we use along the way. Whether you’re a language model newbie or a seasoned veteran, hopefully you can learn something as we go. Ready? Let’s dive in!
The Goal
Let’s set a well-defined goal for ourselves: building a tool that can summarize information into a shorter representation. Summarization is a broad topic, with different properties for models that would be good at summarizing news stories, academic papers, software documentation, and more. Rather than focusing on a specific domain, let’s create a tool that can be used for various summarization tasks, while being willing to invest computing power to make it work better in a given subdomain.
Let’s set a few more criteria. Our tool should:
- Be able to pull from a variety of kinds of data to improve performance on a specific sub-domain of summarization
- Run on our own devices (including possibly VMs in the cloud that we’ve specified)
- Allow us to experiment using only a single machine
- Put us on the path to scale up to a cluster when we’re ready
- Be capable of leveraging state-of-the-art models for a given set of compute constraints
- Make it easy to experiment with different configurations so we can search for the right setup for a given domain
- Enable us to export our resulting model for usage in a production setting
Sounds intimidating? You might be surprised how far we can get if we know where to look!
Fine-tuning
Looking at our goal of being able to achieve good performance on a specific sub-domain, there are a few options that might occur to you. We could:
- Train our own model from scratch
- Use an existing model “off the shelf”
- Take an existing model and “tweak” it a bit for our custom purposes
Training a “near state of the art” model from scratch can be complex, time consuming, and costly. So that option is likely not the best. Using an existing model “off the shelf” is far easier, but might not perform as well on our specific subdomain. We might be able to mitigate that somewhat by being clever with our prompting or combining multiple models in ingenious ways, but let’s take a look at the third option. This option, referred to as “fine-tuning,” offers the best of both worlds: we can leverage an existing powerful model, while still achieving solid performance on our desired task.
Even once we’ve decided to fine-tune, there are multiple choices for how we can perform the training:
- Make the entire model “flexible” during training, allowing it to explore the full parameter space that it did for its initial training
- Train a smaller number of parameters than were used in the original model
While it might seem like we might need to do the first to achieve full flexibility, it turns out that the latter can be both far cheaper (in terms of time and resource costs) and just as powerful as the former. Training a smaller number of parameters is generally referred to by the name “Parameter Efficient Fine Tuning,” or “PEFT” for short.
LoRA
There are several mechanisms for PEFT, but one method that seems to achieve some of the best overall performance as of this writing is referred to as “Low Rank Adaptation,” or LoRA. If you’d like a detailed description, here’s a great explainer. Or if you’re academically inclined, you can go straight to the original paper on the technique.
Modern language models have many layers that perform different operations. Each one takes the the output tensors of the previous layers to produce the output tensors for the layers that follow. Many (though not all) of these layers have one or more trainable matrices that control the specific transformation they will apply. Considering just a single such layer with one trainable matrix W, we can consider our fine-tuning to be looking for a matrix we can add to the original, 𝚫W , to get the weights for the final model: W’ = W + 𝚫W.
If we just looked to find 𝚫W directly, we’d have to use just as many parameters as were in the original layer. But if we instead define 𝚫W as the product of two smaller matrices 𝚫W = A X B, we can potentially have far fewer parameters to learn. To see how the numbers work out, let’s say 𝚫W is an NxN matrix. Given the rules of matrix multiplication, A must have N rows, and B must have N columns. But we get to choose the number of columns in A and the number of rows in B as we see fit (so long as they match up!). So A is an Nxr matrix and B is an rxN matrix. The number of parameters in 𝚫W is N², but the number of parameters in A & B is Nr + rN = 2Nr. By choosing an r that’s much less than N, we can reduce the number of parameters we need to learn significantly!
So why not just always choose r=1? Well, the smaller r is, the less “freedom” there is for what 𝚫W can look like (formally, the less independent the parameters of 𝚫W will be). So for very small r values, we might not be able to capture the nuances of our problem domain. In practice, we can typically achieve significant reductions in learnable parameters without sacrificing performance on our target problem.
As one final aside down this technical section (no more math after this, I promise!), you could imagine that after tuning we might want to actual