Stuart Ritchie
Over a decade has passed since scientists realized many of their studies were failing to replicate. How well have their attempts to fix the problem actually worked?
An empty room with a large cardboard box in the center. A group of 102 undergrad students. They’re split into three groups, and asked to sit either in the box, beside the box or in the room with the box removed. They complete a task that’s supposed to measure creativity — coming up with words that link together three seemingly unrelated terms.
The results of this experiment? The students who sat beside the box had higher scores on the test than the ones in the box or those with no box present. That’s because — according to the researchers — sitting next to the box activated in the students’ minds the metaphor “thinking outside the box.” And this, through some unknown psychological mechanism, boosted their creativity.
You might be laughing at this absurd-sounding experiment. You might even think I just made it up. But I didn’t: It was published as part of a real study — one that the editors and reviewers at one of the top psychology journals, Psychological Science, deemed excellent enough to publish back in 2012.
To my knowledge, nobody has ever attempted to replicate this study — to repeat the same result in their own lab, with their own cardboard box. That’s perhaps no surprise: After all, psychology research is infamous for having undergone a “replication crisis.” That was the name that came to describe the realization — around the same time that the cardboard box study was published — that hardly any psychologists were bothering to do those all-important replication studies. Why check the validity of one another’s findings when, instead, we could be pushing on to make new and exciting discoveries?
Developments in the years 2011 and 2012 made this issue hard to ignore. A Dutch psychology professor, Diederik Stapel, was found to have faked dozens of studies across many years, and nobody had noticed, in part because barely anyone had tried to replicate his work (and in part because it’s really awkward to ask your boss if he’s made up all his data). Psychologists published a provocative paper that showed that they could find essentially any result they wished by using statistics in biased ways — ways that were almost certainly routinely used in the field. And one of those hen’s-teeth replication attempts found that a famous study from “social priming,” the same social psychology genre as the cardboard box study — in which merely seeing words relating to old people made participants walk more slowly out of the lab — might have been an illusion.
Similar stories followed. As psychologists got their act together and tried replicating one another’s work, sometimes in large collaborations where they chose many studies from prominent journals to try to repeat, they found that approximately half the time, the older study wouldn’t replicate (and even when it did, the effects were often a lot smaller than in the original claim). Confidence in the psychological literature started to waver. Many of those “exciting discoveries” psychologists thought they’d made were potentially just statistical flukes — products of digging through statistical noise and seeing illusory patterns, like the human face people claimed to see on the surface of Mars. Worse, some of the studies might even have been entirely made up.
The replication crisis, alas, applies to a lot more of science than just silly social psychology research. Research in all fields was affected by fraud, bias, negligence and hype, as I put it in the subtitle of my book Science Fictions. In that book, I argued that perverse incentives were the ultimate reason for all the bad science: Scientists are motivated by flashy new discoveries rather than “boring” replication studies — even though those replications might produce more solid knowledge. That’s because for scientists, so much hinges on getting their papers published — particularly getting published in prestigious journals, which are on the lookout for groundbreaking, boundary-pushing results. Unfortunately, standards are so low that many of the novel results in those papers are based on flimsy studies, poor statistics, sloppy mistakes or outright fraud.
I think it’s fair to predict with confidence that, were the cardboard box study to be repeated, the results would be different. It’s the kind of study—based on tenuous reasoning about how language affects thought, with statistical tests that, when looked at in detail, are right on the very edge of being considered “statistically significant”—that would be a prime candidate for a failed replication, should anyone ever try. It’s the kind of research that psychologists now look back on with embarrassment. Of course, a decade later we’ve learned our lesson, and definitely don’t do unreplicable studies like that any more.
Right?
Adrian Forrow
The problems of fraud, bias, negligence and hype in science aren’t going away anytime soon. But we can still ask to what extent things have gotten better. Are researchers doing better studies — by any measure — than they were in 2012? Has anything about the perverse publishing dynamics changed? Have all the debates (what actually counts as a replication?), criticisms (are common statistical practices actually ruining science?), and reforms (should we change the way we publish research?) that have swirled around the idea of the replication crisis made science — in psychology, or indeed in any field — more reliable? Fundamentally, how much more can we trust a study published in 2022 compared to one from 2012?
If you jumped ten years forward in time from 2012, what would you notice that’s different about the way science is published? Certainly you’d see a lot of unfamiliar terms. For instance, unless you were a clinical trialist, you likely wouldn’t recognize the term “preregistration.” This involves scientists planning out their study in detail before they collect the data, and posting the plan online for everyone to see (the idea is that this stops them “mucking about” with the data and finding spurious results). And unless you were a physicist or an economist, you might be surprised by the rise of “preprints” — working papers shared with the community for comment, discussion and even citation before formal publication. These id