OpenAI CEO Sam Altman (left) and Meta AI chief Yann LeCun (right) have differing views on the future … [+] of large language models.
In case you haven’t heard, artificial intelligence is the hot new thing.
Generative AI seems to be on the lips of every venture capitalist, entrepreneur, Fortune 500 CEO and journalist these days, from Silicon Valley to Davos.
To those who started paying real attention to AI in 2022, it may seem that technologies like ChatGPT and Stable Diffusion came out of nowhere to take the world by storm. They didn’t.
Back in 2020, we wrote an article in this column predicting that generative AI would be one of the pillars of the next generation of artificial intelligence.
Since at least the release of GPT-2 in 2019, it has been clear to those working in the field that generative language models were poised to unleash vast economic and societal transformation. Similarly, while text-to-image models only captured the public’s attention last summer, the technology’s ascendance has appeared inevitable since OpenAI released the original DALL-E in January 2021. (We wrote an article making this argument days after the release of the original DALL-E.)
By this same token, it is important to remember that the current state of the art in AI is far from an end state for AI’s capabilities. On the contrary, the frontiers of artificial intelligence have never advanced more rapidly than they are right now. As amazing as ChatGPT seems to us at the moment, it is a mere stepping stone to what comes next.
What will the next generation of large language models (LLMs) look like? The answer to this question is already out there, under development at AI startups and research groups at this very moment.
This article highlights three emerging areas that will help define the next wave of innovation in generative AI and LLMs. For those looking to remain ahead of the curve in this fast-changing world—read on.
1) Models that can generate their own training data to improve themselves.
Consider how humans think and learn. We collect knowledge and perspective from external sources of information—say, by reading a book. But we also generate novel ideas and insights on our own, by reflecting on a topic or thinking through a problem in our minds. We are able to deepen our understanding of the world via internal reflection and analysis not directly tied to any new external input.
A new avenue of AI research seeks to enable large language models to do something analogous, effectively bootstrapping their own intelligence.
As part of their training, today’s LLMs ingest much of the world’s accumulated written information (e.g., Wikipedia, books, news articles). What if these models, once trained, could use all the knowledge that they have absorbed from these sources to produce new written content—and then use that content as additional training data in order to improve themselves? Initial work suggests that this approach may be possible—and powerful.
In one recent research effort, aptly titled “Large Language Models Can Self-Improve,” a group of Google researchers built an LLM that can come up with a set of questions, generate detailed answers to those questions, filter its own answers for the most high-quality output, and then fine-tune itself on the curated answers. Remarkably, this leads to new state-of-the-art performance on various language tasks. For instance, the model’s performance increased from 74.2% to 82.1% on GSM8K and from 78.2% to 83.0% on DROP, two popular benchmarks used to evaluate LLM performance.
Another recent work builds on an important LLM method called “instruction fine-tuning,” which lies at the core of products like ChatGPT. Whereas ChatGPT and other instruction fine-tuned models rely on human-written instructions, this research group built a model that can generate its own natural language instructions and then fine-tune itself on those instructions. The performance gains are dramatic: this method improves the performance of the base GPT-3 model by 33%, nearly matching the performance of OpenAI’s own instruction-tuned model.
In a thematically related work, researchers from Google and Carnegie Mellon show that if a large language model, when presented with a question, first recites to itself what it knows about the topic before responding, it provides more accurate and sophisticated responses. This can be loosely analogized to a human in conversation who, rather than blurting out the first thing that comes to mind on a topic, searches her memory and reflects on her beliefs before sharing a perspective.
When people first hear about this line of research, a conceptual objection often arises—isn’t this all circular? How can a model produce data that the model can then consume to improve itself? If the new data came from the model in the first place, shouldn’t the “knowledge” or “signal” that it contains already be incorporated in the model?
This objection makes sense if we conceive of large language models as databases, storing information from their training data and reproducing it in different combinations when prompted. But—uncomfortable or even eerie as it may sound—we are better off instead conceiving of large language models along the lines of the human brain (no, the analogy is of course not perfect!).
We humans ingest a tremendous amount of data from the world that alters the neural connections in our brains in imponderable, innumerable ways. Through introspection, writing, conversation—sometimes just a good night’s sleep—our brains can then produce new insights that had not previously been in our minds nor in any information source out in the world. If we internalize these new insights, they can make us wiser.
The idea that LLMs can generate their own training data is particularly important in light of the fact that the world may soon run out of text training data. This is not yet a widely appreciated problem, but it is one that many AI researchers are worried about.
By one estimate, the world’s total stock of usable text data is between 4.6 trillion and 17.2 trillion tokens. This includes all the world’s books, all scientific papers, all news articles, all of Wikipedia, all publicly available code, and much of the rest of the internet, filtered for quality (e.g., webpages, blogs, social media). Another recent estimate puts the total figure at 3.2 trillion tokens.
DeepMind’s Chinchilla, one of today’s leading LLMs, was trained on 1.4 trillion tokens.
In other words, we may be well within one order of magnitude of exhausting the world’s entire supply of useful language training data.
If large language models are able to generate their own training data and use it to continue self-improving, this could render irrelevant the looming data shortage. It would represent a mind-bending leap forward for LLMs.
2) Models that can fact-check themselves.
A popular narrative these days is that ChatGPT and conversational LLMs like it are on the verge of replacing Google Search as the world’s go-to source for information, disrupting the once-mighty tech giant like Blockbuster or Kodak were disrupted before it.
This narrative badly oversimplifies things. LLMs as they exist today will never replace Google Search. Why not? In short, because today’s LLMs make stuff up.
As powerful as they are, large language models regularly produce inaccurate, misleading or false information (and present it confidently and convincingly).
Examples abound of ChatGPT’s “hallucinations” (as these misstatements are referred to). This is not to single out ChatGPT; every generative language model in existence today hallucinates in similar ways.
To give a few examples: it recommends books that don’t exist; it insists that the