Why data is stupid (and how to write about it better)


Flip to any political talk show — any local news segment, Wall Street Journalpiece, or angry Facebook comments section — and you’ll inevitably see pundits fire off shoddy data in hopeless attempts to settle disagreements.

It’s no surprise why. As data driven thinking proliferates, people (myself included) seem to think that they can win any argument by referencing numbers.

But even when it comes from reputable news giants or research institutions, this data usually isn’t perfect (or good, or valid, or fair). Why? Because data is stupid. Statistics are usually misleading. They’re often calculated incorrectly, based off unrepresentative samples, and lacking in crucial context.

And yet we wield data blindly and unabashedly, as trump cards and be-all-end-all’s, as if we were the rare informed oracles of the universe that have special insight that others are just too dumb to share.

In today’s world, knowing a statistic is often a falsehood — an illusion. Even if you know that 31% of all Americans have never been skiing or own pets or are illiterate, you should also know something else:

It’s likely that most, if not all, of the statistics you think you know are not what you think.

In other words, they are wrong.

Image: Thanks to advances in sensors, surveys, and software, data is easier than ever to gather and visualize in beautiful ways — which also makes it easier to trust. But that’s not necessarily a good thing.

Below are a few examples: statistics that I pulled from ostensibly reliable sources, plus some thoughts on why they’re dumb:

Nuts May Lower Your Risk for Heart Disease

It’s 2 a.m. You should’ve fallen asleep hours ago. But you stumble upon an NY Times article claiming that nuts may lower your risk for heart disease. A good rule of thumb when you see causal claims about nutrition — like Chocolate Cures Cancer or Nuts Lower Risk of Heart Disease — is to close out of the article on your browser and then burn your computer.

The article states that “…the more nuts of all kinds that people ate, the lower their risk for cardiovascular disease…” Diving into the study itself confirms that this is indeed what the researchers found, though it’s how they found it (and how the NY Times reported it) that has issues.

The Times boasts that hundreds of thousands of people (n = 210,836) participated in the study. More participants means better data, right? No, not if the data was collected poorly.

In the study, the researchers measure nut consumption by giving you a “food frequency questionnaire” every 4 years. 4 years? I don’t remember what I ate yesterday for lunch, so I certainly wouldn’t trust my guesstimate as to how frequently I’ve been consuming walnuts in the past half-decade.

I imagine, too, that many of the study participants did their best to blaze through the questionnaire as quickly as possible so that they could take their payment and get back to their busy lives. This assumed carelessness confounds the data even more.

Even if you make the insane assumption that the data is totally valid, you still aren’t allowed to conclude that nuts lower your risk of heart disease. The researchers merely found an association (that is, a correlation), not a causation. It’s very possible that healthy people are merely more likely to eat nuts in general, and healthy people don’t get heart disease as often as non-healthy people. So it’s not that the nuts caused the health — but, in fact, the opposite.

While the language in the Times article suggests causation, a less misleading report would emphasize both the data collection method (in this case, the survey) and the correlational nature of the study. Here’s a suggested (though not perfect) rewrite below:

When it comes to bosses, Americans actually no longer prefer a man over a woman

Though the title of this recent CNBC article may indicate that centuries of deep-rooted gender biases have seemingly disappeared overnight, this claim (and the data is rests on) is untenable garbage.

A single-sentence summary: Gallup asked 1,028 Americans — less than half of whom where employed — if they preferred to work for a man or woman, and 55% said they had no preference. The conclusion made by the researchers? Americans must no longer prefer a man over a woman.

To gather data, Gallup asked participants the following question: 'If you were taking a new job and had your choice of a boss, would you prefer to work for a man or a woman?' Here are 3 reasons why this is a silly way to learn about preferences:

1. People lie. A lot.

It’s 2017. Many people know that they shouldn’t have a preference for male bosses over female bosses, either because they think it’s immoral, taboo, or both.

Because the phone survey is conducted by another human and measures a socially sensitive preference, people are pressured not to tell the truth. What would the interviewer think if I told her that I preferred a male boss?

Social scientists call this the social desirability bias, and it can ruin any survey where people may feel uncomfortable being honest.

Respondents were probably even more likely to lie given the timeliness of the survey: people were surveyed at the beginning of November 2017, just as handfuls of high-profile sexual assault cases (for example, the Harvey Weinstein scandal) broke the news, propagating a strong warning across the country that gendered oppression won’t be tolerated.

2. People also lie subconsciously

That’s because preferences come in two flavors: explicit preferences, which are the ones you think you have, and implicit preferences, which are the ones you actually have, even if you don’t know it.

Examples are proudly proclaiming that you don’t like junk food (the explicit preference) but salivating when you see a cupcake (the implicit preference), or staunchly advocating for egalitarianism (the explicit preference) but preferring the job candidate who went to your alma mater (the implicit preference).

The Gallup research fails to establish a nuanced understanding of preferences, hiding the fact that that they’re explicit and thus prone to the myriad biases that plague human brains. Because only explicit preferences were measured, we should be way less confident that they’re real.

3. People are terrible at understanding their own preferences

People are known to be bad at predicting how they’ll feel in hypothetical scenarios. Psychologists call this forecasting fallacy the hot-cold empathy gap, and, if you’re like most humans, you probably experience it often. Like when you dread going to your cousin’s birthday party but enjoy it once you arrive. Or when you’re fearful about an impending breakup but find yourself happy once it occurs.

In other words, you may think you don’t care about the gender of your boss when pondering during a phone survey, but this might not be how you feel on your new manager’s first day.

The title of the article makes a bold claim lacking in nuance and bolstered by virtually no convincing evidence. Digging into the data reveals that it doesn’t actually measure what the title indicates. The solution here, of course, is to describe the results accurately:

Yes, this is a much softer, much more boring claim, likely not even worth reporting. And I have a feeling that many articles would suffer a similar fate if reported correctly.

But at least they’d be true.

To have more informed conversations, start counting calories

If we want to have data driven, transparent, and productive discussions about the world around us, we can’t keep reaching for plain ol’ data as if it were the panacea for argumentative conflict everywhere. We must instead use mindful statistics, which better communicate the massive uncertainty of the data around us.

Here’s an analogy: In 2014, in light of an ever-troubling American obesity epidemic, the FDA regulated that many restaurants across the country label their menus with calorie counts. Certain phases of regulations are still taking effect today.

Have these regulations been successful? I have no idea, because I couldn’t find reliable data on their success or failure, but I like to think that people make more informed decisions about what to order when they have access to calorie counts.

Data should be thought about in a similar way. Would reputable news sources or political leaders make bold, causal claims about nutrition or psychology or sociology if they realized that the underlying data relied on a biased sample or a low sample size?

That’s like trusting the reading from a thermometer that you know is broken.

To be fair, I don’t think all data can be ripped apart with such scrutiny. For example, studies in particle physics are almost always more rigorous than ones in sociology, making their results less dubious. Nor should we totally discount data that falls prey to all the errors I’ve mentioned above: Data can be imperfect to different degrees.

The point here is merely that most data about most things is radically imperfect, so we shouldn’t throw it around without acknowledging its biases.

Instead, we should adopt a new vocabulary when we use numbers, one that relies heavily on words like “survey”, “correlation”, and “uncertain”, all of which remind the reader that data shouldn’t be blindly trusted, that generalizations shouldn’t be made, and that our confidence should almost always be low rather than high. Just as many restaurants now label their menu items with calorie disclaimers, we should label our statistics with bias disclaimers.

Adopting this new vocabulary is so critical because data is easily the best — if not the only — language we have for talking about the world empirically.

But just because it’s the best doesn’t mean it’s not stupid.

This article was originally published on Medium. Republished with permission. Image: Holli Conger.