15/4/2013

Setting the Record(s) Straight: Dealing with Bad Data

 

By Abbott Katz, London-based Excel instructor and freelance writer, author of Excel 2010 Made Simple and spreadsheetjournalism.com.

If you've ever received two enveloped pleas from one charity with an eye on your checkbook – and you probably have – you've taken a hit from a database that’s misfired, a loose cannon that’s been spraying surplus names all over your postal code. You know what's gone wrong, too: the database has listed you twice, either by sanctioning identical entries of your name and address, or by registering two variants of your name and/or place of residence, gulling the database into believing that you're in fact two different persons.

Either way, the point is made. Spreadsheet data redundancies and discrepancies need to be ruthlessly shown the door, lest they infiltrate the data and lay all your lovely, crafty formulas to waste. The need for data integrity is integral, a primordial piece of spreadsheet infrastructure without which your analytical engine is doomed to stall; and while infrastructure is boring, it’s necessary, too. So let’s consider three data integrity issues, all of which I’ve encountered in real-world spreadsheets, and see how they might be resolved.

Start with this very simple batch of data, representing say, serial contributions to a political campaign:

Screen_shot_2013-04-10_at_10.29.55_AM.png

Feed these into a pivot table, and you’ll get something like this:

Screen_shot_2013-04-10_at_10.31.08_AM.png


Now that doesn’t look right, does it? You’ll find among the first principles of pivot tables the axiom that items gathered into the Row (or Column) Labels area appear once each – and Mr. Bowie’s gotten himself invited twice. What went wrong?

The problem - and again I’ve bumped up against this problem more than once – is that the two Bowie entries were in fact respectively keyed:

David Bowie

David Bowie[space]

And those entries are simply incomparable. Typing David Bowie[space] is, relative to David Bowie, tantamount to having typed Barack Obama – they’re two, irretrievably separate items. The obvious workaround? Delete the space. That will work, but this remedy isn’t overwhelmingly practicable if you’re faced with 20,000 rows of names, all of which need to be vetted for precisely the same potential anomaly. The real remedy, rather, lies in the workings of a low-profile and concise Excel function called TRIM, which pares superfluous spaces from cell entries.

For example - say David Bowie[space] finds itself in cell A3. If so,

=TRIM(A3)

will return David Bowie, minus the following space. If you work proactively, then, and post TRIM in an uninhabited column, copy it down the parallel column’s worth of entries (in this case A) and finish off the process with a Copy>Paste Special Values sequence back onto the A column, your gratuitous space problem should be cleared away – and David Bowie will haunt your pivot table just once.

Next case. This one's the classic predicament, the one with which we opened this post. Here the data suffer from record duplication, a particular impediment to lists built atop unique entries, e.g., a membership roster or roll of university students. Here's another elementary example, which should nevertheless embody the problem:

Screen_shot_2013-04-10_at_10.38.32_AM.png

(Again, extrapolate to a list 20,000 names long.)

You've doubtless observed a complication here. Some records exhibit a two-field duplication, and others only one; and before you proceed you need to ask yourself exactly which duplicates you want ousted from the larger list. In this case it seems clear that you'll want to squelch only those records in which both fields are matched across two or more records – here, only Mary Walters. Again, that judgment won't come as easily with 20,000 names worth of students, and that's where Excel's Remove Duplicates feature comes in. Click anywhere among the records, then click the Data tab > Remove Duplicates button in the Data Tools button group (again, I’m using Excel 2010).  You'll see:

Screen_shot_2013-04-10_at_10.39.24_AM.png

Tick the My data has headers box (I'd have written data have, but Microsoft didn't ask me), thus overriding the Column A and B entries above with First Name and Surname. Leave First Name and Surname ticked, because again we want to search for record duplication across both fields, an important point. Click OK and you'll see:

Screen_shot_2013-04-10_at_10.51.23_AM.png

No grammar check there! But no matter; Remove Duplicates has done its job, having extirpated the redundant entry for Mary Walters.

Just keep in mind that a cogent implementation of Remove Duplicates requires a bit of a think-through. After all, a real-world university enrollment list might very well divulge any number of different students owning the same last and surname, and so you'd need to stipulate additional duplicate search parameters in order to  snag truly redundant entries, e.g. by ticking First Name, Surname, and Address, assuming these fields feature in the data. Indeed – large organizations assign their members unique IDs is precisely in order to impose one, irreducibly unambiguous identifier upon their members; think of the UK National Insurance or American Social Security Numbers, for example (and you're looking at someone who has both). On the other hand, of course, there's no definitive way for Excel to know if the two instances of Mary Walters in my primitive list signify the same or two different individuals; again, one assumes additional clarifications would avail in a real-world worksheet.

The third integrity issue is perhaps the most nettling, because it asks the most of the user/investigator – a good old eyeballing of the data, banded with equally retro data entry and re-entry chores. I’ll exemplify the issue by recalling a request I received from the Times Higher Education (UK) weekly to review some of their spreadsheet data in connection with their World University Rankings. I looked at ranking data for both 2011 and 2012, and copied and pasted the data from the two years into one megasheet; and in the course of that perusal I began to notice that some universities reported different spellings across the two years, e.g.

University of California, Berkeley
University of California Berkeley

University of Illinois at Chicago

University of Illinois – Chicago

And yes, a lot of this:

University of Chicago

University of Chicago[space]

Now of course Excel is the paradigmatic idiot savant; it can sum 500,000 numbers at the tap of an Enter key, even as it lacks the human, inferential savvy to declare the above pairs equivalent.

What to do? Here the problem is the absence of redundancy among the pairs, and in order to achieve the necessary twinnings you could first apply the TRIM stratagem I recommended earlier. Next, you could sort the university names, (closely) observe how neighboring names are spelled, and decide which variant should prevail, in the event a university presents itself with disparate spellings. You could also run the names through a pivot table and assign the institution Name field to the Row Labels area, where again each uniquely-spelled name will appear but once. And here, unique appearances mean you could have a problem. You don’t want to see Harvard’s name twice in the pivot results – you want to see it once, even as it appears twice in the data – once for 2011, once for 2012. But whatever the means for trapping errant spellings, you’ve going to have to do some quaint 20th century editing and retyping.

Something that charity didn’t do. And that’s why they mailed you twice.

Comments