25/11/2017

Fake it ‘til you make it: How we identified a large number of fake followers on Instagram

 

By Timo Grossenbacher, SRF Data

In Switzerland, being a so-called influencer is the dream job of the moment for a lot of young people. Getting a wealth of free products, or even bare cash, in exchange for an Instagram post is enticing, and the advertising industry seems to have discovered a new, effective form of approaching target groups. Somehow, this phenomenon has gotten a lot of media attention in Switzerland over the past the summer. At the same time, accusations started appearing that the followers of many influencers are, in fact, fake. Could these accusations be true? There has never been a  systematic study on the subject – nobody, neither in or outside of Switzerland, has ever tried to thoroughly quantify the fake follower problem on Instagram (as far as we know).

So that’s what we at SRF Data did. We are a small data journalism team working for the Swiss Public Broadcast, consisting of two reporters and one frontend engineer. Last October, we published an investigation on Swiss Instagram Influencers, showing that around a third of the seven million Instagram accounts we investigated were very likely fake, that is, of no actual value for advertisers.

How we conducted the analysis

Our work process can be divided into three main steps:

  • Identify a representative sample of Swiss influencers
  • Download key metrics of the audience of these influencers
  • Given such key metrics, classify the audience into fake or real

The first step was rather straightforward. We asked Le Guide Noir for a list of the top 100 Swiss influencers, which they kindly provided to us. We scanned the list and removed celebrities (for example, Roger Federer) who were famous regardless of their Instagram profiles – we only wanted to target influencers who gained influence thanks to their activity on Instagram. We also added a couple of more influencers from other sources. In the end, we had a list of 115 Swiss influencers, their audience totalling to seven million people.

Because we couldn’t manually go through seven million Instagram profiles (that would have taken more than five years of our precious time…), we had a statistical learning model in mind, which would use artificial intelligence to do the classification work for us. In order to do so, the model needed to be given so-called ‘features’, attributes or key metrics of the profiles to be classified. Turning back to our list of seven million followers, we scraped their features like ‘number of people following’ or ‘number of posts’ from their Instagram pages (an example here). This also worked for private profiles, but took quite a while. From the downloaded metrics, in the first table below, we engineered several other features, see the second table below.

We then fed these features into our statistical model, called ‘random forest’. The model went through each of the  seven million rows of our follower database and, taking only seconds, classified them into fake or not fake.

How was our model about to do this? We decided to try the random forest method straight away because it has been shown to return good results for many binary classification tasks). Then,we manually classified 1000 Instagram profiles into fake or real and trained the model with 700 of them. We retained 300 to assess the model’s accuracy – for these 300, we already knew the answer and compared it against the model’s. Surprisingly, we achieved an overall accuracy of almost 95%, which we didn’t expect in the first place. That means that 95% of 300 test accounts were correctly classified. Given such high performance, we were confident using the same model on the real data, that is, the seven million Instagram profiles.

What determines whether an account is considered fake or not

We also looked at what features were determining whether an account was coming up fake or not. The following plot gives an indication. As you can see, the ratio between how many other accounts a profile is following and how many posts it has made is the most important discriminant factor, followed by the overall number of posts made. Whether an account is private or not, on the other hand, doesn’t seem to have a lot of discriminative power.

Who has the most fake followers?

In the course of the investigation, we decided to publish a list of all surveyed influencers and their fake-follower-ratio. According to our journalistic codex we confronted all of these influencers and gave them the possibility to leave a statement, which some of them did.

Interestingly, we found a certain amount of fake followers in the audience of every influencer. It seems that some sort of base rate is normal. We suspect that these fake followers are spam accounts that randomly follow people with a high overall number of followers. However, above 25-30% fake followers, things start to get fishy…

Our results were published in a series of articles on our website srf.ch/data, in an evening TV magazine and as a web video on Youtube and Facebook. The resonance was huge – the scene was literally shocked.

We conducted our analysis in the statistical software R and published the code on GitHub, as we normally do with our research. If you are interested, you can have a look at the methodology here. To conclude: while the investigation was successful and very fun too, it was also quite tedious and time-consuming. We spent a total of 300 hours on the project, not counting the hours of our intern who was also heavily involved in the project.

About the author

Timo Grossenbacher is a data journalist and lead reporter with SRF Data, the data driven research unit of the Swiss Public Broadcast (SRF). He is specialized in conducting analyses in the statistical software R and teaches data journalism at the University of Zurich as well as the University of the Arts Zurich. Visit website.

You can watch the investigation’s web video here.

Comments