OpenAI’s groundbreaking model DALL-E 2 hit the scene at the beginning of the month, setting a new bar for image generation and manipulation. With only short text prompt, DALL-E 2 can generate completely new images that combine distinct and unrelated objects in semantically plausible ways, like the images below which were generated by entering the prompt “a bowl of soup that is a portal to another dimension as digital art”.

DALL-E 2 can even modify existing images, create variations of images that maintain their salient features, and interpolate between two input images. DALL-E 2’s impressive results have many wondering exactly how such a powerful model works under the hood.
In this article, we will take an in-depth look at how DALL-E 2 manages to create such astounding images like those above. Plenty of background information will be given and the explanation levels will run the gamut, so this article is suitable for readers at several levels of Machine Learning experience. Let’s dive in!
How DALL-E 2 Works: A Bird’s-Eye View
Before diving into the details of how DALL-E 2 works, let’s orient ourselves with a high-level overview of how DALL-E 2 generates images. While DALL-E 2 can perform a variety of tasks, including image manipulation and interpolation as mentioned above, we will focus on the task of image generation in this article.

At the highest level, DALL-E 2’s works very simply:
- First, a text prompt is input into a text encoder that is trained to map the prompt to a representation space.
- Next, a model called the prior maps the text encoding to a corresponding image encoding that captures the semantic information of the prompt contained in the text encoding.
- Finally, an image decoding model stochastically generates an image which is a visual manifestation of this semantic information.
From a bird’s eye-view, that’s all there is to it! Of course, there are plenty of interesting implementation specifics to discuss, which we will get into below. If you want a bit more detail without getting into the nitty-gritty, or you prefer to watch your content rather than read it, feel free to check out our video breakdown of DALL-E 2 here:
How DALL-E 2 Works: A Detailed Look
Now it’s time to dive into each of the above steps separately. Let’s get started by looking at how DALL-E 2 learns to link related textual and visual abstractions.
Step 1 – Linking Textual and Visual Semantics
After inputting “a teddy bear riding a skateboard in Times Square”, DALL-E 2 outputs the following image:

How does DALL-E 2 know how a textual concept like “teddy bear” is manifested in the visual space? The link between textual semantics and their visual representations in DALL-E 2 is learned by another OpenAI model called CLIP (Contrastive Language-Image Pre-training).
CLIP is trained on hundreds of millions of images and their associated captions, learning how much a given text snippet relates to an image. That is, rather than trying to predict a caption given an image, CLIP instead just learns how related any given caption is to an image. This contrastive rather than predictive objective allows CLIP to learn the link between textual and visual representations of the same abstract object. The entire DALL-E 2 model hinges on CLIP’s ability to learn semantics from natural language, so let’s take a look at how CLIP is trained to understand its inner workings.
CLIP Training
The fundamental principles of training CLIP are quite simple:
- First, all images and their associated captions are passed through their respective encoders, mapping all objects into an m-dimensional space.
- Then, the cosine similarity of each (image, text) pair is computed.
- The training objective is to simultaneously maximize the cosine similarity between N correct encoded image/caption pairs and minimize the cosine similarity between N2 – N incorrect encoded image/caption pairs.
This training process is visualized below:
0:00
/
Additional Training Details
More information about the CLIP training process can be found below.
- Cosine Similarity
- The Cosine Similarity of two vectors is simply the dot product of two vectors scaled by the product of their magnitudes. It measures the angle between two vectors in a vector space; and, in the context of Machine Learning, determines how “similar” two vectors are to each other. If we consider each “direction” in the vector space as having a meaning, then the cosine similarity between two encoded vectors measures how “alike” the concepts represented by the vectors.
- Training Data
- CLIP is trained on the WebImageText dataset, which is composed of 400 million pairs of images and their corresponding natural language captions (not to be confused with Wikipedia-based Image Text)
- Parallelizability
- The parallelizability of CLIP’s training process is immediately evident – all of the encodings and cosine similarities can be computing in parallel.
- Text Encoder Architecture
- The text encoder is a Transformer
- Image Encoder Architecture
- The image encoder is a Vision Transformer
Significance of CLIP to DALL-E 2
CLIP is important to DALL-E 2 because it is what ultimately determines how semantically-related a natural language snippet is to a visual concept, which is critical for text-conditional image generation.
Additional Information
CLIP’s contrastive objective allows it to understand semantic information in a way that convolution models that learn only feature maps cannot. This disparity can easily be observed by contrasting how CLIP, used in a zero-shot manner, performs across datasets relative to an ImageNet-trained ResNet-101. In particular, contrasting how these models compare on ImageNet vs ImageNet Sketch reveals this disparity well.

CLIP and an ImageNet-trained ResNet-101 perform with similar accuracy on ImageNet, but CLIP outperforms the ResNet-101 significantl