10 tools that can help data journalists do better work, be more efficient


Originally published by Troy Thibodeaux on Poynter.org on 10 October 2011. This excerpt is republished with permission.


It’s hard to be equally good at all of the tasks that fall under data journalism. To make matters worse (or better, really), data journalists are discovering and applying new methods and tools all the time.

As a beginning data journalist, you’ll want to develop a sense of the tools others are using to do the work you admire. You won’t be able to learn them all at once, and you shouldn’t try. You should, however, develop a sort of ambient awareness of the tools in use (something like the knowledge Facebook gives you about the lives of your high-school classmates). Keep a list of tools to check out. Watch the demos and browse the documentation or code. Then, when your projects create the need, you’ll remember enough to get you started.

More immediately, though, choose one or two tools and make them part of your DNA. Pick a tool and wring from it everything you can. Read everything you can find about it. Learn every idiosyncrasy and optimization. Buy a coffee mug with the shortcut keys on it. Just be ready to pick up a new tool when you feel the pinch that says there must be an easier way. Below are 10 tools that are part of nearly every data journalist’s tool belt.


1. The spreadsheet

Almost every data journalist begins with the spreadsheet. (Disclosure: I’m an exception here, as are some other programmer-journalists. I learned to use spreadsheets to work with my colleagues who rely on them.)

The spreadsheet is a nearly universal data format, particularly if you save your data as a plain-text delimited file, such as a comma-separated values file. Everyone either has a commercial spreadsheet program already or can easily download a free one, and modern spreadsheet applications are remarkably versatile.

There are several sites and courses available to help you develop spreadsheet skills. Start with sorting, filtering and subtotals, and move on to more advanced formulas. As you learn to use formulas, try at times to type them in directly, rather than using the wizards. This practice will give you more intimate knowledge of the formulas you’re using, and it will also help you begin to express your ideas in code, which will come in handy as you pick up other tools.


2. SQL


After a while, you may begin to feel the pinch from the limitations of spreadsheets. Many data journalists move toward a relational database manager (e.g. SQLite, MySQL, PostgreSQL, Access) when they have more than two spreadsheets to join or very large data sets to query. SQL allows you to describe exactly the subset of data you want to extract or the exact changes you want to make, and it allows you to perform these queries across related data sets. You can also save your commands as a script, so you can document everything you’ve done with the data, and you can automatically repeat those steps on a future data set.

Pretty much every relational database program uses some flavor of SQL, so once you’ve learned the basics (a couple dozen key words and some punctuation), you can query databases in any number of systems, both free and commercial. Also, relational databases are frequently used to store the data in Web applications, so your knowledge of SQL can be directly useful in Web development.

Here’s a tutorial to get you started.


3. Data cleaning tools

All data sets are “dirty.” Repeat that to yourself three times whenever you open your laptop.

To clean the data and get it into a useful format, you’ll probably use a variety of tools. My favorite is Google Refine, which looks a bit like a spreadsheet but is meant for things like standardizing names so you can create reliable counts. (You may want “John Smith,” “Smith, John” and “John Q. Smith” to be counted as one person, for example, rather than three). Using Google Refine Expression Language, you’ll be able to do sophisticated data transformations, and you’ll take another step in expressing yourself in code. (Data Wrangler is a new tool with some functionality similar to Refine’s that is also worth checking out.)

Wrangler Demo Video from Stanford Visualization Group

You should also become aware of the tools in your operating system that can help manage files and the data within them. If you’re on OSX or Linux, you have sed, awk, grep and find. (There are ports for Windows, as well.) Using these utilities, you can begin to explore and massage your data without even bothering to open a spreadsheet or database program.

And while you’re looking at command-line tools, check out CSVKit, an amazing suite of tools — developed by journalists — that will help you work magic in that common format.


Read the full article on Poynter.org



Image credits: Patrick Powers