I recently decided I should update the profile picture on my webpage.

As a Computer Science Professor, I figured the easiest way to produce a high-quality photo would be to generate it using DALL-E2. So I wrote a simple prompt, “Picture of a Professor named Maneesh Agrawala” and DALL-E2 made an image that is … well … stunning.

From the text prompt alone it generated a person who looks to be of Indian origin, dressed him in “professorly” attire and placed him in an academic conference room. At a lower level, the objects, the lighting, the shading and the shadows are coherent and appear to form a single unified image. I won’t quibble about the artifacts — the fingers don’t really look right, one temple of the glasses seems to be missing and of course I was hoping to look a bit cooler and younger. But overall, it is absolutely amazing that a generative AI model can produce such high-quality images, as quickly as we can think of prompt text. This is a new capability that we have never had before in human history.
And it is not just images. Modern generative AI models are black boxes that take a natural language prompt as input and transmute it into surprisingly high-quality text (GPT-4, ChatGPT), images (DALL-E2, Stable Diffusion, Midjourney), video (Make-A-Video), 3D models (DreamFusion) and even program code (Copilot, Codex).
So let’s use DALL-E2 to make another picture. This time I’d like to see what Stanford’s main quad would look like if it appeared in the style of the film, Blade Runner. When I think of Stanford’s main quad I think about the facade of Memorial Church and palm trees. When I think I of Blade Runner, I think of neon signs, crowded night markets, rain, and food stalls. I start with a simple prompt, “stanford memorial church with neon signage in the style of bladerunner”.

At this first iteration the resulting images don’t really show the Stanford quad with its palm trees. So I first add “and main quad” to the prompt for iteration 2 and after inspecting those results I add “with palm trees” for iteration 3. The resulting images look more like the Stanford quad, but don’t really look like the rainy nighttime scenes of Blade Runner. So I cyclically revise the prompt, inspect the DALL-E2 generated images and then update the prompt, to try and find a combination of prompt words that produce something like the image I have in mind. At iteration 21, after several hours of somewhat randomly trying different prompt words, I decide to stop.

The resulting image isn’t really what I had in mind. Even worse, it is unclear to me how to change the prompt to move the image towards the image I want. This is frustrating.
In fact, finding effective prompts is so difficult that there are websites and forums dedicated to collecting and sharing prompts (e.g. PromptHero, Arthub.ai, Reddit/StableDiffusion). There are also marketplaces for buying and selling prompts (e.g. PromptBase). And there is a cottage industry of research papers on prompt engineering.
To understand why writing effective prompts is hard, I think it is instructive to remember an anecdote from Don Norman’s classic book, The Design of Everyday Things. The story is about a two-compartment refrigerator he owned, but found extremely difficult to set the temperature for properly. The temperature controls looked something like this:
Separate controls for the freezer and fresh food compartments suggest that each one has its own independent cooling unit. But this conceptual model is wrong. Norman explains that there is only one cooling unit; the freezer controls sets the cooling unit’s temperature while the fresh food control sets a valve that directs the cooling to the two compartments. The true system model couples the controls in complicated way.

With an incorrect conceptual model users cannot predict how the input controls produce the output temperature values. Instead they have to resort to an iterative, trial-and-error process of (i) setting the controls, (ii) waiting 24 hours for the temperature to stabilize and (iii) checking the resulting temperature. If the stabilized temperature is still not right they must going back to step (i) and try again. This is frustrating.
For me there are two main takeaways from this anecdote.
-
Well designed interfaces let users build a conceptual model that can predict how the input controls affect the output.
-
When a conceptual model is not predictive, users are forced into using trial-and-error.
The job of an interface designer is to develop an interface that lets users build a predictive conceptual model.
Generative AI black boxes are terrible interfaces because they do not provide users with a predictive conceptual model. It is unclear how the AI converts an input natural language prompt into the output result. Even the designers of the AI usually can’t explain how this conversion occurs in a way that would allow users to build a predictive conceptual model.
I went back to DALL-E2 to see if I could get it to produce an even better picture of me, using the following prompt, “Picture of a cool, young Computer Science Professor named Maneesh Agrawala”.
But I have no idea how the prompt affects the picture. Does the word “cool” produce the sports coat and T-shirt combination, or do they come from the word “young”? How does the term “Computer Science” affect the result? Does the word “picture” imply the creation of a realistic photograph rather than an illustration? Without a predictive conceptual model I cannot answer these questions. My only recourse is trial-and-error to find the prompt that generates the image I want.
One goal of AI is to build models that are indistinguishable from humans. You might argue that natural language is what we use to work with other humans and obviously humans are good interfaces. I disagree. Humans are also terrible interfaces for many generative tasks. And humans are terrible for exactly the same reasons that AI black boxes are terrible. As users we often lack a conceptual model that can precisely predict how another human will convert a natural language prompt into output co