On December 20, 2021 I read an interesting paper from CSPI written by Leif Rasmussen. My main takeaway was that among National Science Foundation grant award abstracts there is what Rasmussen calls “Constriction of the space of ideas.” In other words, the language of NSF grant abstracts is becoming increasingly similar.
The constriction is demonstrated by a decrease in cosine similarity among word embeddings and word frequency vectors between the years 1990 and 2020. Now, I know that sentence was perfectly clear, but, just for fun, let’s break that down a little.
Background
Suppose we were going to rate words on a scale of one to ten in two dimensions. First, we are going to rate how old fashioned a word is. Second, we will rate how funny the word seems to us. For each word we will get a pair of numbers.
For example, if we wanted to rate “poppycock” we might say that it’s a 9 for “old fashioned” and an 8 for “Funny”. Poppycock would be 9, 8. Another word, like snail, isn’t especially old fashioned or funny. Though, the word has been around a while – we could call it a 5, 2.
Now that we can turn any word into a pair of numbers we can also turn a sequence of words, called a “document”, into a pair of numbers. Just convert all the words in the document into pairs of numbers and then find the average of those values.
My snail speaks poppycock. = 7, 4
That swain is up to some jiggery-pokery. = 8, 3
We can treat the pair of numbers as a coordinate in two dimensional space. We can imagine an arrow drawn from the origin (0, 0) with an arrow head at our coordinate. A document, which could be anything from a single word, to a simple sentence, or even a full length novel, can be an arrow in 2-d space. Two documents become two arrows.
We can then find the cosine similarity which is the cosine of the angle between the two arrows. The cosine similarity gives us a measure, between zero and one, of how similar the two vectors are.
In our example the cosine similarity between our two vectors is 0.987. That tells us that our d