In this blog post, we explore the topic of fake GitHub stars. We will share our approach for identifying them and invite you to run this analysis on repos you are interested in. Click here to skip the background story and jump right to the code.
And if you enjoy this article, head on over to the Dagster repo and give us a real GitHub star!
- Why buy stars on GitHub?
- Let’s go star shopping…
- How can we identify these fake stars?
- Identifying obvious fakes
- Identifying sophisticated fakes
- Clustering intuition
- Improving the clustering
- The results
- Try this for yourself
GitHub stars are one of the main indicators of social proof on GitHub. At face value, they are something of a vanity metric, with no more objectivity than a Facebook “Like” or a Twitter retweet. Yet they influence serious, high stakes decisions, including which projects get used by enterprises, which startups get funded, and which companies talented professionals join.
Naturally, we encourage people interested in the Dagster project to star our repo, and we track our own GitHub star count along with that of other projects. So when we spotted some new open-source projects suddenly racking up hundreds of stars a week, we were impressed. In some cases, it looked a bit too good to be true, and the patterns seemed off: some brand-new repos would jump by several hundred stars in a couple of days, often just in time for a new release or other big announcement.
We spot-checked some of these repositories and found some suspiciously fake-looking accounts.
We were curious that most GitHub star analysis tools or articles that cover that topic fail to address the issue of fake stars.
We knew there were dubious services out there offering stars-for-cash, so we set up a dummy repo (frasermarlow/tap-bls
) and purchased a bunch of stars. From these, we devised a profile for fake accounts and ran a number of repos through a test using the GitHub REST API (via pygithub) and the GitHub Archive database.
So where does one buy stars? No need to go surfing the dark web. There are dozens of services available with a basic Google search.
In order to draw up a profile of a fake GitHub account used by these services, we purchased stars from the following services:
- Baddhi Shop – a specialist in low-cost faking of pretty much any online publicly influenceable metric. They will sell you 1,000 fake GitHub stars for as little as $64.
- GitHub24, a service from Möller und Ringauf GbR, is much more pricey at €0.85 per star.
To give them credit, the stars were delivered promptly to our repo. GitHub24 delivered 100 stars in 48 hours. Which, if nothing else, was a major giveaway for a repo that, up until then, had only three stars. Baddhi Shop had a bigger ask as we ordered 500 stars, and these arrived over the course of a week.
That said, you get what you pay for. A month later, all 100 GitHub24 stars still stood, but only three-quarters of the fake Baddhi Shop stars remained. We suspect the rest were purged by GitHub’s integrity teams.
We wanted to figure out how bad the fake star problem was on GitHub. To get to the bottom of this, we worked with Alana Glassco, a spam & abuse expert, to dig into the data, starting by analyzing public event data in the GitHub Archive database.
You might be tempted to frame this up as a classical machine learning problem: simply buy some fake stars, and train a classifier to identify real vs fake stars. However, there are several problems with this approach.
- Which features? Spammers are adversarial and are actively avoiding detection, so the obvious features to classify on – name, bio, etc – are generally obfuscated.
- Label timeliness. To avoid detection, spammers are constantly changing their tactics to avoid detection. Labeled data may be hard to come by, and even data that is labeled may be out-of-date by the time a model is retrained.
In spam detection, we often use heuristics in conjunction with machine learning to identify spammers. In our case, we ended up with a primarily heuristics-driven approach.
After we bought the fake GitHub stars, we noticed that there were two cohorts of fake stars:
- Obvious fakes. One cohort didn’t try too hard to hide their activity. By simply looking at their profiles it was clear that they were not a real account.
- Sophisticated fakes. The other cohort was much more sophisticated, and created lots of real-looking activity to hide the fact that they were fake accounts.
We ended up with two separate heuristics to identify each cohort.
During our fake star investigation, we found lots of one-off profiles: fake GitHub accounts created for the sole purpose of “starring” just one or two GitHub repos. They show activity on one day (the day the account was created, which matches the day the target repo was starred), and nothing else.
We used the GitHub API to gather more information about these accounts, and a clear pattern emerged. These accounts were characterized by extremely limited activity:
- Created in 2022 or later
- Followers <=1
- Follow
=1li>