From Commutes to Megaregions, with open data and open source software


The modern world is defined by connection and disconnection. We live in a global context where connectivity is everything, or so we’re told. Not being connected - to wifi, to friends, to music - is considered cruel and unusual punishment. A recent study by Parag Khanna - Connectography - picked up on the concept of connectivity and wrote at length about its importance. Missing from this and other analyses, however, is an in-depth empirical analysis of exactly how places are connected or not, and the strength of these connections.

In a recent study, Garrett Dash Nelson and I took up this theme by looking at commuting in the United States in order to explore economic connectivity at the ‘megaregion’ level. To do so, we used commuting data for more than 130 million people and 4 million individual point-to-point commutes. The results gave us a new view of the economic geography of the United States: one defined by connectivity. This isn’t just an intellectual curiosity; it can help us plan new infrastructure (e.g. high speed rail), organise economic activities (e.g. in retail planning) and it can define the functional economic geography we live by, in contrast to the well-understood administrative boundaries we see on maps. A final map from our study is shown below - made using open data from the American Community Survey, open source software (QGIS and Combo), plus cloud computing (Amazon Web Services).


Image: Commuter-based megaregions of the United States (Garrett Dash Nelson and Alasdair Rae).

I often say, somewhat light heartedly, that I have data journalism envy. But it’s actually true, because the frontiers of data journalism are where some of the most exciting developments in digital storytelling and data analysis are taking place. In many ways, as John Burn-Murdoch of the Financial Times has said, data journalism is ‘social science on a deadline’ and the best of it can have a massive impact in challenging, shaping or contesting our opinions on important topics.

I’ve collaborated on a few data journalism projects over the past few years, but the cold hard truth is that I’m actually an academic working at the University of Sheffield in the UK. Although I do teach some journalism students, I work in the Department of Urban Studies and Planning, and my work mainly focuses on the analysis of geodata - from the neighbourhood to national level. Using a geographical lens, I look at things like the uneven distribution of mortgage lending, or national commuting patterns. I scaled up my UK work to look at commuting in the United States in 2015, and published a working paper on it that summer. I also published the dataset and blogged my results, as I normally do.

This data was then picked up by Garrett Dash Nelson, then a PhD candidate at the University of Wisconsin-Madison. Garrett attempted to find ‘natural’ communities within the commute data, using an algorithm called Combo, developed by researchers at MIT. This method tries to identify places that are most connected, based on the strength of connections between them, rather than by using geographical distance. The whole dataset was too big to work with so he looked at Massachusetts on its own and then posted his results online (extract below).


Image: Natural boundaries of Massachusetts (Garrett Dash Nelson).

I contacted Garrett via Twitter and asked whether he wanted to work with me to attempt this analysis for the whole United States. After a few direct messages, e-mails, and an initial Skype call, we began. The only problem is that the commuting dataset we used had 74,000 origins and destinations. 74,000 x 74,000 is A LOT (about 5.5 billion) so it would require some serious computing power - beyond what either of us could access. So, for the serious number crunching we used Amazon Web Services cloud computing running Linux, at about $20 per day over 5 separate runs. Here’s what the data looked like when we mapped raw commute flows in QGIS. This looks quite interesting, but it’s effectively just a map of the urban United States, plus an indication of the main transport corridors that bind them together. To get a sense of the functional economic geography of the United States, we had to go a step further and use a ‘community detection algorithm’ - in our case we chose Combo as it is efficient and effective, and open source.


Image: Commuting patterns in the United States (Garrett Dash Nelson and Alasdair Rae).

As with all data projects, there was considerable cleaning up to do and lots of manual checking and validation, but after the fifth Combo iteration we arrived at a result we were happy with. In doing so, we were keen to stress that our belief in algorithms is not blind - they must be validated and assessed using our finely tuned human faculties as well, particularly if they are to have any purchase in the real world. Nonetheless, we were able to identify 50 separate functional economic zones - or megaregions - across the United States. These varied in size and scale but they make sense from a functional point of view - and when interpreted visually. Our final results are not perfect, of course, but we think they provide a good conversation starter for understanding the true nature of economic connectivity in the United States. The map below shows the different megaregions around the Twin Cities area in Minnesota, with different colours indicating separate regions. Following this, we can see the final results of our work, with the United States divided up into named megaregions. The only exception is the sparsely populated area to the north west. In this case, no large megaregions could be defined.


Image: Megaregions of the Twin Cities area (Garrett Dash Nelson and Alasdair Rae).


Image: Megaregions of the United States (Garrett Dash Nelson and Alasdair Rae).

When we published our work in open access journal PLOS ONE on 30 November 2016, we thought it might be of interest to a few like-minded researchers, and hoped it would help people think about the different geographies of the United States. We’ve been slightly overwhelmed by the response, as the paper now has more than 100,000 views, has reached millions of people on Twitter and has an Altmetric Attention Score of more than 730 (anything above 20 is considered good in academic circles). It has also featured in stories on National Geographic, WIRED, CityLab, and the Washington Post. These write-ups, from some leading data journalists, really help bring home the message of our research, but also the benefits of open science and open publishing. In a world where a popular academic paper in a traditional journal might receive 500 views, we are very pleased to see this kind of response.

One further piece of open publishing that we need to share here is that all the data - plus lots more maps - are available on Figshare - a data repository now essential for researchers who want to share their data and allow others to use it.

Garrett and I worked on this paper for about nine months, using open data from the American Community Survey, open source software (QGIS and Combo), plus a little paid-for cloud computing (Amazon Web Services). We wrote it collaboratively in Google Docs and had regular catch-up calls via Skype. But we have not yet met in person. That might be stage two of this project. Until then, we hope that readers will be interested in our project, the results, and how it can help change our understanding of the economic geography of the United States.


Image: Megaregions of the Midwest (Garrett Dash Nelson and Alasdair Rae).

Read the full research paper here.