Superforecasting - 1: Evaluating Forecasts
When we
think of forecasters, sadly, we rarely know anyone beyond TV. And as Philip
Tetlock writes in his terrific book, Superforecasting:
“The one
undeniable talent that talking heads have is their skill at telling a
compelling story with conviction…”
The aim
doesn’t seem to be accuracy; rather, it’s just entertainment… and certainty.
Given
that all forecasting can only be probabilistic, the recipient of the forecast
needs to be open to uncertainty. Not an easy thing, since not everyone can
adopt the Richard Feynman attitude:
“Doubt is not a
fearful thing, but a thing of very great value.”
This
opens a new question: if a weather forecaster says 70% chance of rain and it
didn’t rain, was he wrong? Or was he right since he also meant 30% chance of no
rain? Then again, does this just make his prediction right, rain or no rain?
Tetlock
and team came up with a system to answer exactly such questions to evaluate the
accuracy of forecasters. First, forecasts must have clearly defined outcomes
(rain or no rain), clear timelines (within a day/month), and yes, assigned
probability of being right (65%). And, secondly, a huge number of forecasts
should be made to be able to gauge a forecaster meaningfully.
Ok, but
what’s the way to evaluate the “70% chance of rain” kind of forecasts? The
answer is called “calibration”.
Perfect calibration would mean outcomes predicted with 70% probability happen
70% of the time; outcomes predicted with 60% probability happen 60% of the time
and so on. Deviations from perfect calibration would indicate under-confidence
or over-confidence: neither is a good thing for a forecaster, as you might have
guessed.
That’s
a start, but it’s not perfect. A forecaster who sticks to the safe zone of,
say, 60% probability, is likely to be right more often than one who calls out
80% odds. But he’s not saying anything with too much confidence, so what good
is that?
Tetlock
added another criteria called “resolution”.
If two forecasters predicted the same event and got it right, but one said 70%
probability and other said 90% probability, the latter gets more points.
Conversely, if they’re wrong, the latter is penalized more.
Combine
the two (“calibration” and “resolution”) and you get the Brier score. Tetlock
and team now had a solid criteria to evaluate forecasters. They enrolled
volunteers for the Good Judgment Project (GJP), a project funded by the US
government. A huge number of predictions were sought and tracked over the
course of a year. The comparison was against university-affiliated teams and
even professional intelligence analysts. What would be the findings?
Comments
Post a Comment