Will Thompson
@July 23, 2023
We are in the midst of yet another AI Summer where the possible futures enumerated in the Press seem both equally amazing and terrifying. LLMs are predicted to both create immeasurable wealth for society as well as potentially compete with (or deprecate?) knowledge workers. While bond markets are trying to read the tea leaves on future Fed rate hikes, equity markets are bullish on all things AI. Many companies are rapidly adopting some form of AI play in order to appease shareholder FOMO. A large percentage of YC cohort members are, unsurprisingly, generative AI startups now. All the “MAANG”s (whatever they are called these days) seem to have some form of giant LLM they are building now. It’s as though crypto was forgotten overnight; the public imagination appears singularly captivated with what possibilities AI may usher forth.
The madness of crowds aside, it is worth reflecting on what we concretely know about LLMs at this point in time and how these insights sparked the latest AI fervor. This will help put into perspective the relevance of current research efforts and the possibilities that abound.
When people say “Large Language Models”, they typically are referring to a type of deep learning architecture called a Transformer. Transformers are models that work with sequence data (e.g. text, images, time series, etc) and are part of a larger family of models called Sequence Models. Many Sequence Models can also be thought of as Language Models, or models that learn a probability distribution of the next word/pixel/value in a sequence: .
What differentiates the Transformer from its predecessors is it’s ability to learn the contextual relationship of values within a sequence through a mechanism called (self-) Attention. Unlike the Recurrent Neural Network (RNN), where the arrow of time is preserved by processing each time step serially within a sequence, Transformers can read the entire sequence at once and learn to “pay attention to” only the values that came earlier in time (via “masking”). This allows for faster training times (i.e. the whole sequence in parallel) and larger model parameter sizes. Transformers were once considered “large” when they were ~ 100MM+ parameters; today, published models are ~500B-1T parameters in size. Anecdotally, several papers have reported a major inflection point in Transformer behavior around ~100B+ parameters. (Note: these models are generally too large to fit into a single GPU and require the model to be broken apart and distributed across multiple nodes).
Transformers can be generally categorized into one of three categories: “encoder only” (a la BERT); “decoder only” (a la GPT); and having an “encoder-decoder” architecture (a la T5). Although all of these architectures can be rigged for a broad range of tasks (e.g. classification, translation, etc), encoders are thought to be useful for tasks where the entire sequence needs to be understood (such as sentiment classification), whereas decoders are thought to be useful for tasks where text needs to be completed (such as completing a sentence). Encoder-decoder architectures can be applied to a variety of problems, but are most famously associated with language translation.
Decoder-only Transformers such as ChatGPT & GPT-4 are the class of LLM that are ubiquitously referring to as “generative AI”.
Since the debut of the OG Transformer paper ~6 years ago, we’ve gleamed a couple interesting properties about this class of models.
Generalization 🧠
We learned that the the same trained LLM could figure out how to complete many different tasks with only being shown a few examples for each task; that is, LLMs are few-shot learners. This meant that whatever the LLM had learned about language in it’s (pre-)training task (which is usually predicting the next word in a sequence), it could translate to new tasks without needing to be trained from scratch to do said task (and with only a handful of examples).
That is, we discovered LLMs’ capacity to generalize.
Power Laws in Performance 🚨
We also learned that LLMs had predictable (power law) scaling behavior. With larger training datasets, models could scale up in parameter size and become more data efficient, ultimately leading to better performance on benchmarks.
Given a dataset size and chosen model size, we can (seemingly magically) predict the performance of the model prior to (pre-)training it.
Research Trends 📈
Given these observations, a large research trend in LLMs was training progressively larger and larger LLMs and measuring their performance on benchmarks (although, some papers such as the CLIP paper call into question whether benchmark performance actually reflected generalizability, part of a nuanced observation called “The Cheating Hypothesis”). This required splitting these models across many GPUs/TPUs (i.e. model parallelism) due in large part to a model tweak provided from the Megatron paper, innovations in model/pipeline sharding, and packages such as DeepSpeed. Quantization also reduced the memory and computational footprint of these models. And since the traditional self-attention mechanism at the core of the Transformer is in space and time complexity, naturally research into faster mechanisms such as Flash Attention was of considerable interest. Further, innovations like Alibi allowed for variable context windows. This opened the door to larger context windows and is the reason why today’s LLMs have as large as 100k token context windows.
And given the size of these (very large) LLMs, there was interest in how to fine-tune them to problems more efficiently. Innovation in Parameter-Efficient Fine-Tuning (PEFT) such as Adapters and LoRA allowed for faster fine-tuning since there are fewer parameters to adjust in these paradigms. Combined with the advent of 4- and 8-bit quantization, it’s now even possible to fine-tune a model on CPU! (note: most models are trained using 16 or 32 bit floats)
[Note: This is not a comprehensive overview of research trends. For instance, there was considerable research into other areas such as LLMs’ ability to regurgitate information, adversarial attacks, and domain specific LLMs such as Codex (for writing code) as well as early-stage multimodal LLMs (i.e. Transformers that understand images, text, etc). Further, RETRO and webGPT showed that smaller LLMs could perform the same as larger models with efficient querying/ information retrieval].
[And some of these papers like Flash Attention and LoRA came chronologically after the papers discussed in the next few sections].
Yet, a major breakthrough in our understanding of LLM behavior was made with the release of the instructGPT paper.
GPT-3 (particularly the large parameter kind) already demonstrated the ability to follow natural language instructions (i.e. “prompts”), although these instructions typically needed to be carefully worded to get a desired output.
Yet, the output tended to be regurgitated language found deep in the dark corners of the Int