Retrieve and archive Twitter data.

twarc is a command line tool and Python library for archiving Twitter JSON data.

twarc runs in a number of different modes, which are each calibrated to work within the Twitter API's rate limits:

  • Search: retrieves tweets from Twitter's search API that match a particular keyword, hashtag or other query.
  • Filter stream: leverages Twitter's filter tream API to identify tweets that meet a selection of keyword, user and location filters.
  • Sample stream: twarc listens to Twitter's sample stream API for a random sample of public statuses.
  • User timeline: collects the most recent tweets posted by a particular user through Twitter's user timeline API.
  • User lookup: harnesses Twitter's user lookup API to collect fully hydrated user objects for up to 100 users per request as specified by a list of one or more user screen names:
  • Hydrate: uses Twitter's lookup API to fetch the full JSON for each tweet identifier and write it to stdout as line-oriented JSON.

Unlike twarc's other modes, which focus on retrieving data, hydrate ensures the reproducibility of data analysis, without having to make the raw data available in breach of  the Twitter API's terms of service.

In addition to retrieving tweets in real time, twarc can also operate as a library to periodically collect data matching a particular search query. Of particular value, this feature can be scheduled to conform with Twitter's API seven day search window.

A case study: #elxn42 and the 2015 Canadian Election

Using twarc, Nick Ruest and Ian Milligan captured and analyzed about 4 million tweets during the 2015 Canadian Federal Election, referred to on Twitter as #elxn42.

Data was collected using twarc's search and stream modes, focusing on tweets related to #elxn42.

Following twarc's initial collection of #elxn42 tweets, Ruest and Milligan then used twarc to complete additional queries into this corpus and delve deeper into the election's Twitter presence through retweets, geotags, URLs, and more.

By identifying the most tweeted URLs, for instance, it is possible to gain an idea of the information sources that were prominent informers of the Twitter conversation.


Image: Top 10 URLs tweeted.

"From this we can get a sense of how social media shapes what people share", write Ruest and Milligan, "although legacy media was surprisingly well-represented in the Canadian context: the Canada Broadcast Corporation (especially their election day dashboard), the two highest-circulation newspapers the Globe and Mail and Toronto Star, and popular television networks CTV and Global News. While the Huffington Post‘s Canadian edition made an appearance, we were surprised by the degree to which traditional media dominated."

Likewise, twarc can help journalists investigate stories through Twitter conversations, whittled down through its command line utilities.

Visit the twarc GitHub page here.