18/7/2016

Scraping 27,000 tweets for presidential sentiment analysis

 

Forget about the weather, "So, how about that Trump guy?" has become the preferred way of starting small talk for people around the globe.

The US Presidential Primaries were nothing short of a news sensation. At the end of May, right before the Democratic Party announced their presumptive nominee for the general election, I was working on a way to collect and perform sentiment analysis of tweets.

After retrieving 27,000 tweets from 3 different news networks (CNN, Fox News, MSNBC) I found some interesting similarities between Hillary Clinton and Donald Trump – the winners of the primary race.

In this article, I will show you how I collected the tweets for sentiment analysis and how I used two different APIs, - iPython notebooks and python - to analyze the tweets. For example, I found that Trump was mentioned 60% of the time across all media channels, receiving the most attention.

Scraping data and using a sentiment analysis tool

I used two APIs to assist with the data collectionng and text analysis. Since this was a political data science project – I needed quick and free data tools to do all of the heavy lifting for me. I used ParseHub, a visual web scraping tool, to extract the tweets for me and text-processing.com to analyze the sentiment of all of the tweets.

Twitter only keeps the last 800 tweets from each account online, so I scraped snapbird.org which has an archive of the latest 3,000 tweets from every account. I opened snapbird.org after downloading the ParseHub desktop app and set up instructions to get ParseHub to log in to Twitter and search for twitter accounts.

snapbird.png

After logging in, I set up another set of instructions to get ParseHub to search for various twitter accounts and to load all of the 3,000 tweets for each. I clicked on the date and text of each tweet and ParseHub automatically detected all of the tweets for each account, and extracted the data.

parsehub.png

Ultimately, I downloaded a JSON file with 27,000 tweets collected by ParseHub that I could load into an iPython notebook using ParseHub’s API options and start analyzing.

json.png

I then sent each tweet to the text-processing.com sentiment analysis API with an HTTP POST. In return, I received the tone of each tweet, and the probability that each tweet was positive, negative or neutral.

http.png

Trend 1: Clinton and Trump had the most media attention

Clinton and Trump received more media attention than any of the candidates they were running against. This was the case for all 6 news networks.

trumpclinton.png

Two of the twitter accounts had tweets that went back in time far enough to see how many times the last three Republican Party candidates were mentioned before they dropped out of the race. Between February 10th and May 10th, Trump received more coverage than the three other republican candidates combined – Ted Cruz, Marco Rubio and John Kasich.

repub.png

The gap between Clinton and Sanders was not quite as dramatic. In fact, MSNBC reported on Sanders nearly as much as Clinton. You can see that the mentions Clinton and Sanders received from MSNBC over time follow a similar pattern throughout April and May.

democrats.png

Meanwhile, Fox News reported on Sanders much less than the other two networks. In the graph below, you can see that Clinton consistently received more coverage from Fox News than her opponent during the month of May.

fox.png

Similar to how Trump was receiving more coverage than his opponents right before they dropped out of the race, so too was Clinton before she defeated Sanders. This inequality in media coverage seems to be an indicator of a candidate’s political popularity, and I will be monitoring this trend throughout the general election to confirm.

Trend 2: Clinton and Trump had tweets with stronger sentiment

When Clinton and Trump were mentioned in a tweet, the tweet was more likely to have a strong sentiment (either positive or negative) than any other candidate. Reporters are required to present the news in a neutral tone, but that only accounts for their own words. The tweets I analyzed also include quotations. These quotations could have come from a candidate, been about a candidate, or both. This indicates that Trump and Clinton are both polarizing figures; they capture people’s attention with what they say and what is said about them.

To test this further, I also analyzed the sentiment of their last 3,000 tweets and compared it to the tweets of Sanders, who lost the democratic nomination. His tweets were more likely to have a neutral tone than the other two candidates.

sent.png

What may surprise you is that Donald Trump was very likely to send a positive tweet, despite being notorious for insulting others over twitter. However, his tweets had a negative tone more often than a positive one whenever he mentioned the name of another candidate.

donaldtrump.png

We have all heard the old adage that “all publicity is good publicity", so could the strong tone of Trump and Clinton’s tweets have grabbed the attention of the public and lead to their political popularity? This is another trend that I will monitor throughout the general election. It may reveal that news stations are bringing awareness to a candidate’s campaign even when they are reporting on the controversies surrounding the candidate.

To learn more about the process I used for this analysis have a look at:

Read all of the Presidential Campaign findings here.

Image: Vince Alongi.

Comments