TL;DR: Recent advances have enabled AI models to transform text into other modalities. This article overviews what we’ve seen, where we are now, and what’s next.
You’re reading text right now–it’s serving as a medium for me to communicate a sequence of thoughts to you. Ever since humanity became a band of degenerates that actually wrote things down instead of using their memories, we’ve been using sets of signs to transmit information. Under some definitions, you might call all of this “text.”
Today, and over the past centuries, we have encoded our knowledge of the world, our ideas, our fantasies, into writing. That is to say, much of human knowledge is now available in the form of text. We communicate in other ways too–with body language, images, sounds. But text is the most abundant medium we have of recorded communications, thoughts, and ideas because of the ease with which we can produce it.
When GPT-3 was fed the internet, it consumed our observations about the world around us, our vapid drama, our insane arguments with one another, and much more. It learned to predict next words in sequences of the tokenized chaos of human expression. In learning how we form sequences of words to communicate, a large language model learns to mimic (or “parrot”) how we joke, commiserate, command. GPT-3 kicked off something of a “revolution” by being extremely good at “text-to-text”: prompted with examples of a task (like finishing an analogy) or the beginning of a conversation, the generative model can (often) competently learn the task or continue the conversation.
There is almost a “universality” to the ways we use text, and we have only recently reached a point where AI systems can be put together in order to exploit how we use language to describe other modalities. The progress that enabled powerful text generation also enabled text-conditioned multimodal generation. “Text-to-text” became “text-to-X.”
In “text-to-text,” you could ask a model to riff on a description of a dog. In text-to-image, you could turn that description into its visual counterpart. Text-to-image models afforded a new ability not present in existing image generation systems. Existing models like GANs were trained to, given noise inputs (and class information for class-conditional image generation) generate realistic images. But these models did not offer the level of controllability that DALL-E 2, Imagen, and their ilk provide users: you could ask for a photo of a kangaroo with sunglasses, standing in front of a particular building, holding a sign bearing a particular phrase. Your wish was the algorithm’s command.
Quickly after text-to-image became effective, it was followed by more: text-to-video was one of the first sequels. Text-to-audio already existed, but text-to-motion and text-to-3d are just a few examples of the ways in which text is being transformed into something else.
This article is about the “Year of Text-to-Everything.” Recent developments have enabled much more effective ways of converting text to other modalities at a rapid pace. This is exciting and promises to enable a great deal of applications, products, and more over the coming years. But we should also remember that there are limits to the “world of text”–the disembodied musings that merely describe the world without actually interacting with it. I will discuss the advancements that led to today’s moment, and also spend time considering the limitations of text-to-everything if the “representations” of textual information remain in the world of text alone.
Of course, things technically start with GPT-3. I’ll abbreviate the story since it’s been told so many times: OpenAI trains big language model based off of the transformer architecture. That model is much bigger and is trained with much more data than its predecessor, GPT-2 (175 billion parameters vs about 1.5 billion; 40 TB of data vs. 40 GB), which OpenAI had thought was too dangerous to release at the time. It can do things like write JavaScript code that’s not entirely horrendous. Some people say: “wow, cool.” Some people say: “wow, very not cool.” Some people say: “eh.” Startups are built on the new biggest model every, news and academic articles are written praising and criticizing the new model, countries that are not the USA develop their own big language models to compete.
In January 2021, OpenAI introduced a new AI model called CLIP, which boasted zero-shot capabilities similar to those of GPT-3. CLIP was a step towards connecting text and other modalities–it proposed a simple, elegant method to train an image and text model together so that, when queried, the full system could match an image with the corresponding caption among a selection of possible captions.
DALL-E, probably the first system that was “good” at producing images from text, was released on the same day as CLIP. CLIP was not used in DALL-E’s first iteration, but played an important role in its successor. Of course, given its ability to generate plausible images given text prompts, DALL-E made multiple headlines.
While some AI pioneers have lamented that deep learning is not the way to go if we want to achieve “actual” general intelligence, text-to-image is undoubtedly a problem that is amenable to the powers of deep neural networks. A number of complementary advances in deep learning models enabled text-to-image models to make further leaps: diffusion models were found to achieve impressive image sample quality, as papers such as “Diffusion Models Beat GANs on Image Synthesis” found.
DALL-E 2, released a little over a year after DALL-E, leveraged advances in diffusion models to create images even more photorealistic than DALL-E’s. DALL-E 2 was soon upstaged by Imagen and Parti–the former used diffusion models to achieve state-of-the-art performance on benchmarks, while the latter explored a complementary autoregressive approach to image generation.
This was not the end of the story. Midjourney, a commercial diffusion model for image generation, was released by a research lab of the same name. Stable Diffusion, a model that leveraged new research on