What I Learned about Data this Week


While thinking about this blog and the best way of learning the basics of data journalism, I was re-reading Paul Bradshaw’s excellent (and essential) e-book Scraping for Journalists, where he says “the best advice for anyone seeking to learn scraping or data journalism is this: find a problem to solve first. Quite right. But that may pre-suppose that you have time to find and solve the problem before your editor starts making dark threats and reminding you of the impending deadline. Perhaps you haven’t yet overcome the fear of the spreadsheet. (You may of course have been sent a spreadsheet which you didn’t ask for, and are now wondering what to do next).

I suspect most of the growing band of data journalism trainers have their own favourite spreadsheet on which to break people in gently. I use the summary of donations to UK political parties at the 2010 General Election. It’s not big enough to be intimidating, the data is pretty clean, and there are some fun questions to be answered. When training in the classroom I soon found that it was more fun to let people loose on the questions rather than try to show them the answers first. After the session they are given a copy to work on at home and refresh their memory of how they got their answers (one of several bits of good practice I learned at the NICAR bootcamp and wished I’d thought of first!)

The questions and suggested methodology are given below if you want to have a go now. Answers next week.

Another great way of getting started is to look at the Guardian’s Datablog; choose a story that interests you, but don’t read more than the headline. Download the data – there’s always a link to it. Then see what stories you can find in the data (including the one the Guardian wrote) before returning to their blog to see how your findings compare with theirs.

Finally – a useful reminder of two things I did this week. Working in Serbia, and briefing some young Serbian journalists about the possibilities of data, I used my new favourite Google operator “inurl:hrvatska” to find the word meaning Croatia in any url on the Serbian domain (site:rs) and limited results to filetype:xls – the shortish list of results included a complete list of projects building cooperation between Serbia and Croatia, with the names of the experts on both sides of the border. Then we filtered using Excel’s text filter – which works well in an inflected language like Serbian, where names will change according to the context, but you can quite easily ask for text containing the key part of the name which will not change – so Split can be Splita or Splitu, but you can search for anything containing the string “Split”. There wasn’t a single number in the spreadsheet, just words. Another useful reminder that data journalism doesn’t have to be scarily numerical!

Questions about the political donations spreadsheet:

  1. What was the largest single donation?
  2. Who donated it? Which party received it?
  3. Did trade unions give only to Labour?
  4. What was the total value of donations for the election?
  5. What was the biggest non-cash donation? Who received it? How much money was involved in donating helicopter flights?
  6. Which party got the most? Which party saw a sudden increase in donations in week?
  7. Did anyone donate more than once?