Data: Dark, Found, and Crunched
From Tim Harford’s book on data and statistics, How to Make the World Add Up, let’s look at a couple of other things to watch out for when we see conclusions drawn.
Let’s start with
an example:
“Consider
the historical under-representation of women in clinical trials. One grim
landmark was thalidomide, which was widely used by pregnant women to ease
morning sickness only for it to emerge that the drug could cause severe
disability and death to unborn children.”
Even if we try and
collect data with the right representation of men/women, rich/poor, urban/rural
etc, we still run into the problem that many don’t respond. Or certain types of
people can’t be found easily. All these are a form of “dark data”. (This
problem cannot be solved entirely). The general danger here is that all
analysis is, by definition, done with “found data” only. And that is a lesson
Harford asks us to remember:
“We
can and should remember to ask who or what might be missing from the data we’re
being told about.”
Let’s move onto
another lesson. Harford next talks of the belief that many have: With the crazy
large datasets that Internet companies like Amazon, Google and Facebook have,
their data crunching results must be close to correct. Aha, but all those
companies use Machine Learning (ML) algorithms: the system “learns” to make
sense of the data “on its own”. The human written code is more a set of general
principles/guidelines rather than a specific set of rules. This creates the
problem that one has no idea how the system comes to its conclusions.
Google, for example, successfully identified flu trends in the US much before
the health agencies could based on reported cases. For several years in a row.
Very impressive, right? Yes, until it failed. At which point, nobody (including
the folks at Google) knew why it had gone wrong. The ML algorithms are a black
box, based entirely on correlations, not causation. Which brings us to
the lesson:
“If
you have no idea what is behind a correlation, you have no idea what might cause
that correlation to break down.”
Who’d have thought that there are so many non-numerical aspects of data analysis that one should be aware of?
Comments
Post a Comment