Superforecasting - 1: Evaluating Forecasts


When we think of forecasters, sadly, we rarely know anyone beyond TV. And as Philip Tetlock writes in his terrific book, Superforecasting:
“The one undeniable talent that talking heads have is their skill at telling a compelling story with conviction…”
The aim doesn’t seem to be accuracy; rather, it’s just entertainment… and certainty.

Given that all forecasting can only be probabilistic, the recipient of the forecast needs to be open to uncertainty. Not an easy thing, since not everyone can adopt the Richard Feynman attitude:
“Doubt is not a fearful thing, but a thing of very great value.”
This opens a new question: if a weather forecaster says 70% chance of rain and it didn’t rain, was he wrong? Or was he right since he also meant 30% chance of no rain? Then again, does this just make his prediction right, rain or no rain?

Tetlock and team came up with a system to answer exactly such questions to evaluate the accuracy of forecasters. First, forecasts must have clearly defined outcomes (rain or no rain), clear timelines (within a day/month), and yes, assigned probability of being right (65%). And, secondly, a huge number of forecasts should be made to be able to gauge a forecaster meaningfully.

Ok, but what’s the way to evaluate the “70% chance of rain” kind of forecasts? The answer is called “calibration”. Perfect calibration would mean outcomes predicted with 70% probability happen 70% of the time; outcomes predicted with 60% probability happen 60% of the time and so on. Deviations from perfect calibration would indicate under-confidence or over-confidence: neither is a good thing for a forecaster, as you might have guessed.

That’s a start, but it’s not perfect. A forecaster who sticks to the safe zone of, say, 60% probability, is likely to be right more often than one who calls out 80% odds. But he’s not saying anything with too much confidence, so what good is that?

Tetlock added another criteria called “resolution”. If two forecasters predicted the same event and got it right, but one said 70% probability and other said 90% probability, the latter gets more points. Conversely, if they’re wrong, the latter is penalized more.

Combine the two (“calibration” and “resolution”) and you get the Brier score. Tetlock and team now had a solid criteria to evaluate forecasters. They enrolled volunteers for the Good Judgment Project (GJP), a project funded by the US government. A huge number of predictions were sought and tracked over the course of a year. The comparison was against university-affiliated teams and even professional intelligence analysts. What would be the findings?

Comments

Popular posts from this blog

Student of the Year

The Retort of the "Luxury Person"

Animal Senses #7: Touch and Remote Touch