Getting Data: A Five Minute Field Guide
This post by Brian Boyer (Chicago Tribune), John Keefe (WNYC), Friedrich Lindenberg (Open Knowledge Foundation), Jane Park (Creative Commons), Chrys Wu (Hacks/Hackers), is an excerpt from the Data Journalism Handbook (chapter 4: Getting Data), freely available online under a Creative Commons Attribution-ShareAlike license.
Datacatalogs.org (Open Knowledge Foundation)
Looking for data on a particular topic or issue? Not sure what exists or where to find it? Don’t know where to start? In this section we look at how to get started with finding public data sources on the web.
Streamlining Your Search
While they may not always be easy to find, many databases on the web are indexed by search engines, whether the publisher intended this or not. Here are a few tips:
- When searching for data, make sure that you include both search terms relating to the content of the data you’re trying to find as well as some information on the format or source that you would expect it to be in. Google and other search engines allow you to search by file type. For example, you can look only for spreadsheets (by appending your search with ‘filetype:XLS filetype:CSV’), geodata (‘filetype:shp’), or database extracts (‘filetype:MDB, filetype:SQL, filetype:DB’). If you’re so inclined, you can even look for PDFs (‘filetype:pdf’).
- You can also search by part of a URL. Googling for ‘inurl:downloads filetype:xls’ will try to find all Excel files that have “downloads” in their web address (if you find a single download, it’s often worth just checking what other results exist for the same folder on the web server). You can also limit your search to only those results on a single domain name, by searching for, e.g. ‘site:agency.gov’.
- Another popular trick is not to search for content directly, but for places where bulk data may be available. For example, ‘site:agency.gov Directory Listing’ may give you some listings generated by the web server with easy access to raw files, while ‘site:agency.gov Database Download’ will look for intentionally created listings.
Browse Data Sites and Services
Over the last few years a number of dedicated data portals, data hubs and other data sites have appeared on the web. These are a good place to get acquainted with the kinds of data that is out there. For starters you might like to take a look at:
- Official data portals. The government’s willingness to release a given dataset will vary from country to country. A growing number of countries are launching data portals (inspired by the U.S.'s data.gov and the U.K.'s data.gov.uk) to promote the civic and commercial re-use of government information. An up-to-date, global index of such sites can be found at datacatalogs.org. Another handy site is the Guardian World Government Data, a meta search engine that includes many international government data catalogues.
- The Data Hub. A community-driven resource run by the Open Knowledge Foundation that makes it easy to find, share and reuse openly available sources of data, especially in ways that are machine-automated.
- ScraperWiki. An online tool to make the process of extracting "useful bits of data easier so they can be reused in other apps, or rummaged through by journalists and researchers." Most of the scrapers and their databases are public and can be re-used.
The World Bank and United Nations data portals provide high-level indicators for all countries, often for many years in the past.
A number of startups are emerging, that aim to build communities around data sharing and re-sale. This includes Buzzdata — a place to share and collaborate on private and public datasets — and data shops such as Infochimps and DataMarket.
DataCouch — A place to upload, refine, share & visualize your data.
An interesting Google subsidiary, Freebase, provides "an entity graph of people, places and things, built by a community that loves open data."
- Research data. There are numerous national and disciplinary aggregators of research data, such as the UK Data Archive. While there will be lots of data that is free at the point of access, there will also be much data that requires a subscription, or which cannot be reused or redistributed without asking permission first.
Ask a Forum
Search for existing answers or ask a question at Get The Data or on Quora. GetTheData is Q&A site where you can ask your data related questions, including where to find data relating to a particular issue, how to query or retrieve a particular data source, what tools to use to explore a data set in a visual way, how to cleanse data or get it into a format you can work with.
Ask a Mailing List
Mailing lists combine the wisdom of a whole community on a particular topic. For data journalists, the Data Driven Journalism List and the NICAR-L lists are excellent starting points. Both of these lists are filled with data journalists and Computer Assisted Reporting (CAR) geeks, who work on all kinds of projects. Chances are that someone may have done a story like yours, and may have an idea of where to start, if not a link to the data itself. You could also try Project Wombat (“a discussion list for difficult reference questions”), the Open Knowledge Foundation’s many mailing lists, mailing lists at theInfo, or searching for mailing lists on the topic, or in the region that you are interested in.
Hacks/Hackers is a rapidly expanding international grassroots journalism organization with dozens of chapters and thousands of members across four continents. Its mission is to create a network of journalists ("hacks") and technologists ("hackers") who rethink the future of news and information. With such a broad network — you stand a strong chance of someone knowing where to look for the thing you seek.
Ask an Expert
Professors, public servants and industry folks often know where to look. Call them. Email them. Accost them at events. Show up at their office. Ask nicely. “I’m doing a story on X. Where would I find this? Do you know who has this?”
Learn About Government IT
Understanding the technical and administrative context in which governments maintain their information is often helpful when trying to access data. Whether it’s CORDIS, COINS or THOMAS — big-acronym databases often become most useful once you understand a bit about their intended purpose.
Find government organizational charts and look for departments/units with a cross-cutting function (e.g. reporting, IT services), then explore their web sites. A lot of data is kept in multiple departments and while for one, a particular database may be their crown jewels, another may give it to you freely.
Look out for dynamic infographics on government sites. These are often powered by structured data sources/APIs that can be used independently (e.g. flight tracking applets, weather forecast java apps).
Search Again Using Phrases and Improbable Sets of Words You’ve Spotted Since Last Time
When you know more about what you are looking for, you may have a bit more luck with search engines!
Write an FOI Request
If you believe that a government body has the data you need, a Freedom of Information request may be your best tool. See next chapter for more information on how to file one.