31/10/2011

How to scrape Toronto data: a basic tutorial

 

By Momoko Price. Originally published on the BuzzData blog. This post is republished with permission.

 

Since my last tutorial on graphing Toronto’s water usage data, I and a few others noticed that water accounts in Toronto don’t dictate the number of people using them (see comment here). As such, this makes visualizing water usage per account (which we did last time) possibly an inaccurate representation of water consumption habits per ward. 

kat_water.jpg

A better way to determine comparable water use habits per ward would be to compare water consumption per person per ward (kittens not included).

To do this, we need two kinds of data:

  1. Total water consumption per ward
  2. Total population per ward

 

By dividing 1. by 2. for each ward, we would get a reasonable estimate of water consumption per person in each ward (assuming that population size has not fluctuated too much from year to year).

The population size of each ward is not included in our water billing dataset. We need to get that data elsewhere. Specifically, we need census data!

It turns out Toronto’s ward population data from the 2006 Statistics Canada census is on the City of Toronto website. Unfortunately the data is not available as a downloadable dataset, instead it is spread across separate ward profile webpages. Blerg!

toronto_map.jpg

UPDATE: the awesome peeps at toronto.ca/open (who are also on BuzzData) just let me know that population data (and all kinds of other cool data) is available at map.toronto.ca/wellbeing. You don’t need to code and can mix, match and export all kinds of different indicators (screenshot above).

But if you still want to learn to scrape, which is actually kind of fun, keep reading!

 

WEB SCRAPING: WELCOME TO GEEK TERRITORY!

Since there are only 44 wards, one option is to manually copy and paste the ward names and population sizes into your own dataset on Excel. It wouldn’t take you too long, but it would be kind of annoying.

Another option is to attempt to “scrape” the data, meaning to code a script to copy the parts of the page text that you want and organize them into your own datafile. Because of the simple structure of Toronto’s websites, this is a perfect opportunity to learn how to do this. Let’s get started.

 

FIRST THINGS FIRST: PROGRAM INSTALLS

First, you need to make sure you have the right programming language installed on your computer so that when you “run” the script we’re going to write, your computer can read it. We’re going to write our script in a popular language called Python. To install it, follow the instructions for your computer’s particular OS (ie: version of Windows, Mac or Linux) here.

If you’re not used to installing programs on your computer (especially Windows computers), you can run into the occasional snag and get stressed out. Don’t. Google is your friend. Most of the time you can find a web page where whatever issue you’re dealing with has been discussed and resolved. For example, here’s a good step-by-step guide on installing Python for different OSes.

Next, you need to install a program called Beautiful Soup; it parses (ie: reads) webpages for you. Find the installation directions here, and make sure to save the “BeautifulSoup.py” file in the same directory you plan to save the script file. (Again, if you run into problems, Google is your friend! Don’t give up.)

Last, you need a good text editor for coding, if you don’t already having one. A sweet free one is Sublime Text 2 (still in beta). It comes with pretty colours. Install that, too!

Phew. That’s a lot of installs. Sorry about that. Now the fun starts.

Open Sublime Text 2 and save the open document as “ward_population.py,” and make sure it’s in the same folder as “BeautifulSoup.py.”

 

EXAMINING THE URL AND WEBPAGE STRUCTURE

Now open a web browser and go to this URL:

http://app.toronto.ca/wards/jsp/wards.jsp

Scroll down and you’ll see the ward profiles listed. Click on the Ward 1 list item. Can you find where the ward’s 2006 population is written? Take note of that.

Now let’s check a different ward profile and see if the population data is listed similarly. To do this, *don’t* just click to a different page. Check the original URL:

http://www.toronto.ca/wards2000/ward1.htm

Hm. Okay, what happens if you change it to …

http://www.toronto.ca/wards2000/ward2.htm ?

Cool! Looks like every ward profile URL is structured the exact same way! The only thing that changes in it is its ward number. This is going to be very useful for when we write our script.

Find out where the population for Ward 2 is listed. Is it structured the same as Ward 1? Yes. This is also very important.

Now we have to start writing out our script. Go back to the Sublime Text document you saved called “ward_population.py.” Open it up.

 

TIME TO CODE!

The first thing we have to do is import the right libraries with the right methods into our script so that your computer knows what to do when it comes across specific terms in your code. Sound complicated? It’s not. All you do is write the following:

import urllib2

from BeautifulSoup import BeautifulSoup

Not too hard. Okay, now to get to the fun part. If you follow these next clips, I’ll walk you through a translation of each line of code (as best I can), one line at a time, and show you how it relates to the webpage you want to scrape. (Apologies if I’m less than eloquent, I’m new to this and was surprised at how hard it is to explain code off verbally!). And make sure to change these videos to full-screen, otherwise you’ll struggle to see what I’m talking about.

Let’s got through the first few steps:

 

Following so far? Awesome. Now we need to take a look at the actual source code of the ward profile pages we’re going to scrape. Here we go:

 

Make sense? I hope so. Let’s take it back to the script and see how exactly to code the extraction of the snippet we want.

 

Phew! We’re close now. Now we use a nifty little Python method called ‘split’ to break up the sentence and pick out only the words we want.

(Note: this particular video clip was made using an older version of the script, so there are a few comments included that don’t apply. Disregard lines 17, 18 and 30!)

 

Finally we come to really fun part: running the script and publishing our dataset!

TA-DAAAAAAAAH!

Okay, wait, so why did we want this data again? Oh yeah! To visualize average water consumption per person per ward. Well, I think you can handle that on your own at this point, don’t you? Try graphing a bar chart in Excel of water consumption per person per ward in 2006. Which wards stand out?

 

NEXT UP: we’ll do some GIS mapping (sorry I didn’t get to it this week; the opportunity to demonstrate basic scraping as part of an existing project was one I didn’t want to skip.)

Want to learn more? Here are some helpful references to follow up with:

 

AMAZINGLY ACCESSIBLE DATA-VIZ BOOK:

Visualize This by Nathan Yau (guy behind the FlowingData blog)

(I started learning how to scrape using Python by reading this book, and followed Yau’s general approach — with permission — while coding the above script. I can’t say enough good things about VT if you’re new to coding and want to make visualizations. Fantastic book.)

 

COMMAND LINE SHORTCUT CHEAT SHEET:

http://lifehacker.com/5743814/become-a-command-line-ninja-with-these-time+saving-shortcuts

 

POPULAR “SOMEBODY-FOR-THE-LOVE-OF-GOD-HELP-ME” Q&A SITE:

http://stackoverflow.com/

(Lots of good questions and solutions about BeautifulSoup in there)

 

Teaser image by: Texas Web Developers.

Comments