Data: Dark, Found, and Crunched

From Tim Harford’s book on data and statistics, How to Make the World Add Up, let’s look at a couple of other things to watch out for when we see conclusions drawn.

 

Let’s start with an example:

“Consider the historical under-representation of women in clinical trials. One grim landmark was thalidomide, which was widely used by pregnant women to ease morning sickness only for it to emerge that the drug could cause severe disability and death to unborn children.”

Even if we try and collect data with the right representation of men/women, rich/poor, urban/rural etc, we still run into the problem that many don’t respond. Or certain types of people can’t be found easily. All these are a form of “dark data”. (This problem cannot be solved entirely). The general danger here is that all analysis is, by definition, done with “found data” only. And that is a lesson Harford asks us to remember:

“We can and should remember to ask who or what might be missing from the data we’re being told about.”

 

Let’s move onto another lesson. Harford next talks of the belief that many have: With the crazy large datasets that Internet companies like Amazon, Google and Facebook have, their data crunching results must be close to correct. Aha, but all those companies use Machine Learning (ML) algorithms: the system “learns” to make sense of the data “on its own”. The human written code is more a set of general principles/guidelines rather than a specific set of rules. This creates the problem that one has no idea how the system comes to its conclusions. Google, for example, successfully identified flu trends in the US much before the health agencies could based on reported cases. For several years in a row. Very impressive, right? Yes, until it failed. At which point, nobody (including the folks at Google) knew why it had gone wrong. The ML algorithms are a black box, based entirely on correlations, not causation. Which brings us to the lesson:

“If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down.”

 

Who’d have thought that there are so many non-numerical aspects of data analysis that one should be aware of?

Comments

Popular posts from this blog

Why we Deceive Ourselves

Europe #3 - Innsbruck

The Thrill of the Chase