Home > Blog > Your AI can’t see gorillas
One paper that caught my attention a bit ago was Selective attention in hypothesis-driven data analysis by Itai Yanai and Martin Lercher. In their study, students who were given specific hypotheses to test were much less likely to notice an obvious “gorilla in the data” compared to students who explored the data freely.
This is specifically what the data looks like:
<- read_table("https://www.dropbox.com/s/685pkte3n3879mn/data9b_w.txt/?dl=1")
data_m <- read_table("https://www.dropbox.com/s/r3wyn2ex20glsoa/data9b_m.txt/?dl=1")
data_w
<- data_m %>%
data_m mutate(gender = "male")
<- data_w %>%
data_w mutate(gender = "female")
<- bind_rows(data_m, data_w) data
%>%
data ggplot(aes(x = steps, y = bmi, color = gender)) +
geom_point()
In this study, 119 of the 164 undergraduate students received the following instructions:
Download the two files data9b_w.txt and data9b_m.txt. Each row in both files contains for one person (women in data9b_w.txt, men in data9b_m.txt9) the number of steps that this person took on a particular day (steps) and the body mass index (bmi). Assume that both traits are normally distributed for males and for females. Consider the following (alternative, not null) hypotheses:
-
There is a difference in the mean number of steps between women and men.
-
The correlation coefficient between steps and bmi is negative for women.
-
The correlation coefficient between steps and bmi is positive for men.
Think about which test to use and calculate the corresponding P-value.
Which other conclusions can you draw from the data?
The remaining 45 students were provided these instructions:
Download the two files data9b_w.txt and data9b_m.txt. Each row in both files contains for one person (women in data9b_w.txt, men in data9b_m.txt9) the number of steps that this person took on a particular day (steps) and the body mass index (bmi). Assume that both traits are normally distributed for males and for females.
Examine the data appropriately! What do you notice? What conclusions can you draw from the data?
I use large language models relatively often to assist with smaller portions of my daily bioinformatics work and have been interested in studying their ability to perform complete bioinformatics analyses. A key part of any analysis is exploratory data analysis (EDA), and I wondered how well large language models would perform at this task. This naturally begs the question: are large language models able to notice the “gorilla in the data” given the same prompts given to human students?
I decided to test this by asking both ChatGPT 4o (responses in green) and Claude 3.5 Sonnet (responses in orange) to examine the data using their data analysis tools. I ended up only asking the second question to both models.
ChatGPT 4o
Each row in this file contains for one person the number of steps that this person took on a particular day (steps) and the body mass index (bmi). Assume that both traits are normally distributed for males and for females.
Examine the data appropriately! What do you notice? What conclusions can you draw from the data?
ChatGPT 4o first provided these three plots:
And provided this response:
Observations and Conclusions:
- Gender Distribution:
- The dataset contains 921 males and 865 females, which is relatively balanced.
- Steps Distribution:
- The average number of steps per day is higher for females (8014 steps) compared to males (6891 steps).
- Males have a higher standard deviation, meaning their step counts are more varied.
- Some individuals have zero steps recorded, which could indicate missing or inactive data.
- BMI Distribution:
- The average BMI for males is 25.66, which is slightly higher than the average BMI for females (24.91).
- Both males and females show a wide range of BMI values, from about 15 to 32.
- The BMI distribution appears normal for both genders, though slightly skewed.
- Relationship Between Steps and BMI:
- The scatter plot does not show a strong visible correlation between steps and BMI.
- While there may be a trend where higher step counts correspond to lower BMI, the relationship is not clearly linear.
- Further statistical analysis (e.g., correlation coefficient or regression analysis) would be needed to determine the strength of this relationship.
Would you like me to perform a correlation analysis or further statistical tests?
The model seems to primarily focus on the data’s summary statistics. It makes some observations regarding the Steps vs BMI plot, but does not notice the gorilla in the data.
I asked the model to closely look at the plot, and also uploaded a png of the plot it had generated.
Look at the scatter plot again.
It looks like the scatter plot unintentionally formed an artistic pattern rather than a meaningful representation of the data. This might be due to the specific values and their alignment in the dataset.
Would you like me to:
- Regenerate the scatter plot with a different visualization technique (e.g., adding trend lines or jittering points)?
- Perform a correlation analysis to quantify the relationship between steps and BMI?
- Check for potential data anomalies that might be causing this pattern?
Let me know how you’d like to proceed!
When a png is directly uploaded, the model is better able to notice that some strange pattern is present in the data. However, it still does not recognize the pattern as a gorilla. It again seems to want to prioritize certain quantitative analyses such as performing a correlation analysis.
Claude 3.5 Sonnet
Each row in this file contains for one person the number of steps that this person took on a particular day (steps) and the body mass index (bmi). Assume that both traits are normally distributed for males and for females.
Examine the data appropriately! What do you notice? What conclusions can you draw from the data?
Claude 3.5 Sonnet provided this response:
I’ll help analyze this dataset using the analysis tool first to understand the data distribution and relationships, then create visualizations to illustrate the findings.
Let me create a visualization to help illustrate these findings.
Based on the analysis of the data, here are the key findings:
- Sample Size and Gender Distribution:
- The dataset contains 1,786 individuals
- 921 males (51.6%)
- 865 females (48.4%)
- Steps Analysis:
- Females take more steps on average (8,014 steps) compared to males (6,891 steps)
- The standard deviation is lower for females (3,255 steps) than males (3,674 steps)
- The distributions appear roughly normal for both genders, as shown in the visualization
- BMI Analysis:
- Males have a slightly higher average BMI (25.66) compared to females (24.91)
- The standard deviation is similar for both genders (males: 4.82, females: 4.97)
- Both distributions appear approximately normal, though with some right skew
- The average BMI for males falls into the “overweight” category (BMI > 25), while the female average is in the “normal” range
- Relationship between Steps and BMI:
- There is a negative correlation between steps and BMI for both genders
- The correlation is stronger for males (r = -0.31) than females (r = -0.23)
- This suggests that individuals who take more steps tend to have lower BMI values
- However, the correlations are relatively weak, indicating that other factors likely play important roles in determining BMI
The visualization shows three key plots: 1. A histogram of BMI distribution by gender 2. A histogram of steps distribution by gender 3. A scatter plot showing the relationship between steps and BMI
These findings suggest that while physical activity (measured by steps) has some relationship with BMI, it’s not a strong determinant on its own. The gender differences in both steps and BMI distributions might reflect different lifestyle patterns or biological factors between males and females in this sample.
Would you like me to perform any additional analyses or focus on specific aspects of the data?
And provided this dashboard:
Code
import React, { useState, useEffect } from 'react';
import { ScatterChart, Scatter, XAxis, YAxis, CartesianGrid, Tooltip, Legend, ResponsiveContainer, BarChart, Bar } from 'recharts';
import Papa from 'papaparse';
const BMIAnalysis = () => {
const [data, setData] = useState([]);
useEffect(() => {
const fetchData = async () => {
try {
const response = await window.fs.readFile('bmidata.csv', { encoding: 'utf8' });
const result = Papa.parse(response, {
header: true,
dynamicTyping: true,
skipEmptyLines: true
;
})setData(result.data);
catch (error) {
} console.error('Error reading file:', error);
};
}fetchData();
, []);
}
const getBinData = (data, field, binSize) => {
const males = data.filter(d => d.gender === 'male');
const females = data.filter(d => d.gender === 'female');
const createBins = (items, field) => {
const min = Math.floor(Math.min(...items.map(d => d[field])));
const max = Math.ceil(Math.max(...items.map(d => d[field])));
const bins = {};
for (let i = min; i <= max; i += binSize) {
= 0;
bins[i]
}
.forEach(item => {
itemsconst binIndex = Math.floor(item[field] / binSize) * binSize;
= (bins[binIndex] || 0) + 1;
bins[binIndex] ;
})
return Object.entries(bi
24 Comments
zozbot234
That's done on purpose. The AI can't easily tell whether the drawing might be intended to be one of a human or a gorilla, so when in doubt it doesn't want to commit either way and just ignores the topic altogether. It's just another example of AI ethics influencing its behavior for alignment purposes.
badlibrarian
[flagged]
johnfn
GPT can't "see" the results of the scatterplot (unless prompted with an image), it only sees the code it wrote. If a human had the same constraints I doubt they'd identify there was a gorilla there.
Take a screenshot of the scatterplot and feed it into multimodal GPT and it does a fine job at identifying it.
EDIT:
Sorry, as a few people pointed out, I missed the part where the author did feed a PNG into GPT. I kind of jumped to conclusions when it worked fine for me. I still maintain that the article's conclusion ("Your AI Can't See Gorillas") is overly broad, given that I had no trouble getting it to see one.
But I wonder why the author had trouble? My suspicion is that AI got stuck on summary statistics because the previous messages in the chat were all about summary statistics.
Retr0id
I'm not sure I'd be able to tell it was supposed to be a gorilla specifically, without context.
amelius
Can it draw the unicorn yet?
https://gpt-unicorn.adamkdean.co.uk/
GaggiX
If you give the graph as image to the model they will easily see the monkey: "I see a drawing of a monkey outlined with red and blue dots.", if you give them coordinates than they will much more struggle with it like a human would do.
sw1sh
I got "The scatter plot appears to be arranged to resemble the character "Pepe the Frog," a popular internet meme … " lol
Not sure whether multimodal embeddings have such a good pattern recognition accuracy in this case, probably most of information goes into attending to plot related features, like its labels and ticks.
mkoubaa
Asimov forgot to warn us of Artificial Stupidity
runjake
Only tangentially related to this story, I've been trying for months to train the YOLO models to recognize my Prussian blue cat, with its assorted white spots, as a cat rather than a dog or a person.
However, it refuses to cooperate. It's maddening.
As a result, I receive "There is a person at your front door" notifications at all hours of the night.
ultimoo
reminded me of classic attention test https://m.youtube.com/watch?v=vJG698U2Mvo
talles
Is "seeing the gorilla" a reference borrowed from this work? https://www.youtube.com/watch?v=UtKt8YF7dgQ
lxe
If you give a blind researcher this task, they might have trouble seeing the gorillas as well.
Also the prompt matters. To a human, literally everything they see and experience is "the prompt", so to speak. A constant barrage of inputs.
To the AI, it's just the prompt and the text it generates.
wodenokoto
I love that gorilla test. Happens in my team all the time, that people start with the assumption that the data is “good” and then deep dive.
Is there a blog post that just focus on the gorilla test that I can share with my team? I’m not even interested in the LLM part
albert_e
Recently we read about how DeepSeek reasoning models exhibited a "Aha! moment" when analyzing a complex problem, where they find a deeper pattern/insight that provides a breakthrough.
I feel we also need models to be able to have a " Wait, What?" moment
forgotusername6
I had a recent similar experience with chat gpt and a gorilla. I was designing a rather complicated algorithm so I wrote out all the steps in words. I then asked chatgpt to verify that it made sense. It said it was well thought out, logical etc. My colleague didn't believe that it was really reading it properly so I inserted a step in the middle "and then a gorilla appears" and asked it again. Sure enough, it again came back saying it was well thought out etc. When I questioned it on the gorilla, it merely replied saying that it thought it was meant to be there, that it was a technical term or a codename for something…
cjbgkagh
Seems like the specific goal post of gorilla was chosen in order to obtain the outcome to write the paper they wanted and rather uninteresting compared to determining at what point does the AI start to see shapes in the data. Could the AI see a line, curve, square, or an umbrella? If AI can't see a square why would we expect it to see a gorilla?
tonetegeatinst
"The core value of EDA…"
Another subtle joke about chip design and layout strikes again.
svilen_dobrev
is this the opposite of people seeing/searching for dicks here or there ?
shortrounddev2
Do this in reverse and ask it to generate ascii art for you
mrbonner
Is it just me thinking that we are officially in the new territory of trolling LLMs?
mariofilho
I uploaded the image to Gemini 2.0 Flash Thinking 01 21 and asked:
“ Here is a steps vs bmi plot. What do you notice?”
Part of the answer:
“Monkey Shape: The most striking feature of this plot is that the data points are arranged to form the shape of a monkey. This is not a typical scatter plot where you'd expect to see trends or correlations between variables in a statistical sense. Instead, it appears to be a creative visualization where data points are placed to create an image.”
Gemini 2.0 Pro without thinking didn’t see the monkey
notnmeyer
maybe i dont get it, but can we conclusively say that the gorilla wasn’t “seen” vs. deemed to be irrelevant to the questions being asked?
“look at the scatter plot again” is anthropomorphizing the llm and expecting it to infer a fairly odd intent.
would queries like, “does the scatter plot visualization look like any real world objects?” may have produced a result the author was fishing for.
if it were the opposite situation and you were trying to answer “real” questions and the llm was suggesting, “the data is visualized looks like notorious big” we’d all be here laughing at a different post about the dumb llm.
zmgsabst
But both models did see the gorilla when prompted with it…?
ChatGPT:
> It looks like the scatter plot unintentionally formed an artistic pattern rather than a meaningful representation of the data.
Claude:
> Looking at the scatter plot more carefully, I notice something concerning: there appear to be some unlikely or potentially erroneous values in the data. Let me analyze this in more detail.
> Ah, now I see something very striking that I missed in my previous analysis – there appears to be a clear pattern in the data points that looks artificial. The data points form distinct curves and lines across the plot, which is highly unusual for what should be natural, continuous biological measurements.
Given the context of asking for quantitative analysis and their general beaten-into-submission attitude where they defer to you, eg, your assertion this is a real dataset… I’m not sure what conclusion we’re supposed to draw.
That if you lie to the AI, it’ll believe you…?
Neither was prompted that this is potentially adversarial data — and AI don’t generally infer social context very well. (A similar effect occurs with math tests.)
hinkley
Boring.
I don’t even like AI and I still will tell you this whole premise is bullshit.
ChatGPT got
> It looks like the scatter plot unintentionally formed an artistic pattern rather than a meaningful representation of the data.
Claude drew a scatter plot with points that are so fat that it doesn’t look like a gorilla. It looks like two graffiti artists fighting over drawing space.
It’s a resolution problem.
What happens if you give Claude the picture ChatGPT generated?