On November 24, 2022, a fatal residential apartment building fire broke out in Urumqi, China, killing several residents who had been locked into their apartments from outside as part of China’s zero-COVID effort. Protests against China’s COVID-19 restrictions broke out in various cities in China, including Urumqi, Shanghai, Beijing, and Guangzhou. As is often the case, people turned to Twitter to find out more. However, when searching for the Chinese names of these cities, many people observed that news about the protests was difficult to find due to a deluge of spam and adult content using those city names as hashtags. Several analysts and news outlets suggested that this was a deliberate campaign — likely by the Chinese government — to drown out legitimate content with a flood of shady spam.
I argue that much of this “surge” in spam is illusory, due to both data bias and cognitive bias. I also argue that while the spam did drown out legitimate protest-related content, there is no evidence that it was designed to do so, nor that it was a deliberate effort by the Chinese government. To explain why, we must look at how historical social media data is biased.
Data Biases, Cognitive Biases
An underappreciated facet of social media analysis is that it is extremely dependent on when the data was gathered, because past data has often been removed or altered by users or the platform itself. When examining newly gathered data, spam and other platform policy-violating behavior will very often appear to have just recently increased, because content enforcement, particularly for low-risk content, is not immediate — it can take several days to detect, write enforcement rules or train models, and then remove content. Enforcement also does not occur linearly — content will be taken down in batches. Distortions can also occur due to the fact that gathering data via APIs can take multiple days — content may have changed or been removed by the time the data gathering mechanism even reaches it.
This same effect appears in various types of data (I discuss a similar effect with user creation dates here), and becomes more pronounced the more likely the data is to contain inauthentic behavior. In a sense, historical queries are prone to a form of survivorship bias: past data has potentially been moderated, while very recent data has not. Basing an analysis on the “surviving” content can distort what really happened.
This data bias is compounded by cognitive bias: the recency illusion, i.e. the perception that recently noticed things are more prevalent. For someone with no tendency to spend time searching hashtags of Chinese cities (or in Chinese-language Twit