I got access to Dall·E 2 yesterday. Here are some pretty pictures!
My goal was to try to understand what things DE2 could do well, and what things it had trouble understanding or generating. My general hypothesis is that it would do a better job with things that are easy to find on the internet (cute animals, digital scifi things, famous art) and less well with more abstract or more unusual things.
Here’s how it works: you put in a description of a picture, and it thinks for ~20 seconds and then produces 10 photos that are variations on that description. The diversity varies quite a bit depending on the prompt.
Let’s see some puppies!

One thing to be aware of when you see amazing pictures that DE2 generates, is that there is some cherry picking going on. It often takes a few prompts to find something awesome, so you might have looked at dozens of images or more.
Still, this is pretty great! Those are recognizably goldendoodle puppies, mostly in something approximating play position.
You can see that the proportions in the generated images are not quite right, and some of the detail is off if you look closely. For instance, the front legs are too long here, the face isn’t quite right, and the ears are a bit weird.

Still, it’s pretty amazing given that it generated this from scratch. Check out how realistic the grass looks. I also like that the background is blurred, though not quite in the way that a camera would do it — the transition is too abrupt.
Ok but the point of this isn’t that they have a great image generation transformer, though it’s clearly that. The key thing is is its magical ability to actually follow instructions or descriptions of images. Particularly interesting is compositionality — can it combine concepts to generate something it’s never seen before? Answer: yes!

The concept of “kitten” is pretty simply, though note that a kitten can be rendered in a ton of ways, from line drawings to cute art to photorealistic. Pop art is more complicated: it’s a celebration of everyday images, and one of the most commonly known versions is Warhol’s collection of repeated images in a grid with neon colors that vary per cell. And it mostly gets those things right.

What about weird things? You can put in any input and it’ll do something.

None of those are twitter worthy, but with some trial and error you can get things that are interesting.

“Digital style” is one of the suggestions for getting better images.
X in Y style is fun, that’s a lot of the images you see out in the world. Weirdly it’s pretty sensitive to exactly the order you put things in.
Back to puppies, you get pretty different results depending on the placement of “surrealistic” even though the rephrasings seem semantically identical or at least very similar.



One place where DE2 clearly falls down is in generating people. I generated an image for [four people playing poker in a dark room, with the table brightly lit by an ornate chandelier], and people didn’t look human — more like the typical GAN-style images where you can see the concept but the details are all wrong.
Update: image removed because the guidelines specifically call out not sharing realistic human faces.
Anything involving people, small defined objects, and so on, looks much more like the previous systems in this area. You can tell that it has all the concepts, but can’t translate them into something realistic.
This could be deliberate, for safety reasons — realistic images of people are much more open to abuse than other things. Porn, deep fakes, violence, and so on are much more worrisome with people. They also mentioned that they scrubbed out lots of bad stuff from the training data; possibly one way they did that was removing most images with people.
Things look much better with animals, and better again with an artistic style.

The cards aren’t right. Dice seem to be a lot easier.
People can also be pretty good if you don’t see faces, though the hands are definitely not right.

Stlalm Anit is my new slogan.
In general all writing I’ve seen is bad. I think this is less likely to be about safety, and more that it’s hard to learn language by looking at a lot of images. However, since DE2 is trained on text, it clearly knows a lot about language at some level — I would expect there’s plenty of data to put out coherent text. Instead it outputs nonsense, focusing on getting the fonts and the background right.


I definitely see serifs! I do not see sense.
Overall this is more powerful, flexible, and accurate than the previous best systems. It still is easy to find holes in it, with with some patience and willingness to iterate, you can make some amazing images.
In conclusion, generating a lot of images from a new state-of-the-art image generation system is fun, thanks for reading. If there’s interest, I can also explore in-painting and Here are a few more gratuitous pics!





Reader requests:

