11/1/2016

How to tell if your data has been jeopardized a human

 

There is no worse way to screw up data than to let a single human type it in. For example, I once acquired the complete dog licensing database for Cook County, Illinois. Instead of requiring the person registering their dog to choose a breed from a list, the creators of the system had simply given them a text field to type into. As a result this database contained at least 250 spellings of Chihuahua. Even with the best tools available, data this messy can't be saved. It is effectively meaningless. It's not that important with dog data, but you don't want it happening with wounded soldiers or stock tickers. Beware human-entered data.

Data has been manually edited

Manual editing is almost the same as data that was entered by humans except that it happens after the fact and often with good intentions. In fact, data are often manually edited in an attempt to fix data that was originally entered by humans. Problems start to creep in when the person doing the editing doesn't have complete knowledge of the original data. I once saw someone spontaneously "correct" a name in a dataset from Smit to Smith. Was that person's name really Smith? I don't know, but I do know that value is now a problem. Without a record of that change, it's impossible to verify what it should be.

Issues with manual editing are one reason why you always want to ensure your data has well-documented provenance. A lack of provenance can be a good indication that someone may have monkeyed with it. Academics often get data from the government, monkey with it and then redistribute it to journalists. Without any record of their changes it's impossible to know if the changes they made were justified. Whenever feasible always try to get the primary source or at least the earliest version you can and then do your own analysis from that.

Spelling is inconsistent

Spelling is one of the most obvious ways of telling if data has been compiled by hand. Don't just look at people's names—those are often the hardest place to detect spelling errors. Instead look for places where city names or states aren't consistent. (Los Angelos is one very common mistake.) If you find those, you can be pretty sure the data was compiled or edited by hand and that is always a reason to be skeptical of it. Data that has been edited by hand is the most likely to have mistakes. This doesn't mean you shouldn't use it but you may need to manually correct those mistakes or otherwise account for them in your reporting.

Name order is inconsistent

Does your data have Middle Eastern or East Asian names in it? Are you sure the surnames are always in the same place? Is it possible anyone in your dataset uses a mononym? These are the sorts of things that data creators habitually get wrong. If you're working with a list of ethnically diverse names—which is any list of names—then you should do at least a cursory review before assuming that joining the first_name and last_name columns will give you something that is appropriate to publish.

Date formats are inconsistent

Which date is in September:

10/9/15
9/10/15

If the first one was written by a European and the second one by an American then they both are. Without knowing the history of the data you can't know for sure. Know where your data came from and be sure that it was all created by folks from the same continent.

This article was complied from Quartz's Bad Data Guide, republished with permission. Read the full Guide here.

Photo: Pink Sherbet Photography

Comments