How to measure the accuracy of forecasts

cartoon6106

“There’s a 30% chance of rain today.”

And then it didn’t rain. So, was the forecast accurate?

Or what if it did rain. Does that mean the forecast was inaccurate?

How do you hold forecasters accountable, when the forecast is only a probability? The answer appears tricky, then simple, then tricky again, then ends up being simple enough to answer it with Google Spreadsheets.

It’s a journey worth taking, because building better forecasts is invaluable for businesses:

Take lead scoring — putting a value on a new sales lead, predicting the ultimate value of that lead after 9 months have passed and it’s either converted or not. The forecast is what chance this lead has of converting, or what dollar value it has. Like the weather, the lead will convert or it won’t, and if it does, it has a definite dollar value.

If you could predict the chance that a given customer might churn in the next thirty days, you could be proactive and perhaps avert the loss.

If you could predict the chance that a given customer would be amenable to an upgrade, you could focus your internal messaging efforts accordingly.

But how do you measure the accuracy of a prediction which itself is expressed only as a probability? Let’s return to the meteorologist.

cartoon6825

Building the Model: Error

Clearly, a single data point tells you nothing. The correct interpretation of “30% chance of rain” is this: Take all the days the meteorologist predicted 30%. If the meteorologist is accurate, it should have in fact rained 30% of those times. Similarly, the forecaster will sometimes predict 0% or 10% or 50%. So we should “bucket” each of these predictions, and see what actually happened in those buckets.

What is the right math to determine “how correct” the forecaster is? As is often the answer in statistics 1, we can take the squared difference between the forecast and the actual result.

Suppose we have two forecasters, and the question is: Who is most accurate? “Error” is measured by the squared difference between the forecast and reality. Whoever has the least error is the better forecaster. Suppose on some set of days, forecaster A always predicted a 32% chance of rain, and B always predicted 25%, and suppose in reality it rained on 30% of those days. Then the errors are:

A: Predict 32%: Actual 30% → error = squared difference = (0.32-0.30)2 = 0.0004
B: Predict 25%: Actual 30% → error = squared difference = (0.25-0.30)2 = 0.0025

It feels like we’re finished, but we’re not.

cartoon5221Building the Model: Discernment

Suppose these meteorologists are in a region in which it’s typical that out of 365 days, it will rain 110 of those days. That is, the overall climactic average for rainfall is 30%. A meteorologist would know that. So suppose a meteorologist simply predicts “30% chance of rain,” every single day, no matter what. Even if it’s actually raining, right now, predict “30%.”

Our “error” metric will confirm that this forecaster is a genius — exactly zero over a whole year of predictions! Except the forecaster isn’t a genius. In fact, this forecaster is not forecasting at all! She’s just regurgitating the historical average.

So it’s apparent that, although we do need our measure of error, there’s another concept we need to measure: The idea that the forecaster is being discerning. That the forecaster is segmenting the days, taking a strong stance about which days will rain.

The tension between error and discernment is apparent if you consider the following scenario. Suppose forecaster A always predicts the climactic average; thus A has 0 error but 0 discernment, and is useless. Now consider forecaster B, who often predicts the climactic average, but now and then will predict 0% or 100% of rain, when he’s very sure. And suppose that when he predicts 0% the actual average is 10%, and when he predicts 100% the actual average is 90%.

B will have a worse error score, but should have a better discernment score. You would prefer to listen to forecaster B, even though he is less accurate than A. So the idea of “discernment” isn’t just a curiosity, it’s a fundamental concept in measuring how “good” a forecaster is.

How do you compute this “discernment?” We once again use squared-differences, but this time we are comparing the observed results with the climactic average.

So, in the example above:

A: Predict 30% every time for 100 days: actual is 30%, Error = (30%-30%) = 0. Discernment = (30%-30%)^2 = 0.

B: Predict 30% for 80 days; actual is 30%. Predict 0% for 10 days, actual is 10%. Predict 100% for 10 days, actual is 90%. Total error is:

1/100 * [   80*(30%-30%)2
          + 10*(0%-10%)2
          + 10*(100%-90%)2
        ] = 0.002

That’s only slightly worse than A, so that’s good. Total discernment is:

1/100 * [   80*(30%-30%)2
          + 10*(10%-30%)2
          + 10*(90%-30%)2
        ] = 0.04

As expected, we see that B has slightly more error than A, but more discernment. So it’s clear that our metrics are working directionally, but how do we combine these numbers into a total “goodness” score, that would definitively show that, in this case, B is “better” than A?

To answer that, it turns out there’s one more concept we need.

Building the Model: Uncertainty

Consider the life of a forecaster in Antofagasta, Chile, where on average it rains only five days a year (for a grand total of 1.7 millimeters of total rainfall!). At first glance it seems easy to be a forecaster — just predict “no rain” every day.

Of course you recognize that although that forecaster would have low error, she would also be undiscerning. But wait… how could a forecaster ever be discerning in Antofagasta? To be discerning you need to make varied predictions. But the reality isn’t varied, so any predictions that were varied, would necessarily be wrong! In a sense, there’s no “space” for discernment, because there’s no variation to discern between. There’s not a lot of uncertainty in the system in the first place, so there’s not much a forecaster can do to improve on guessing the climactic average.

Compare that with forecaster in Portland, Oregon, USA where it rains 164 days out of the year — about 45%. And there’s no “rainy season” — it’s just chaotic. Now there’s lots of room for improvement, even just predicting 55% or 35% here and there, will still be highly accurate but increase discernment. And a world-class forecaster has the space to create a significant amount of discernment.

So it’s not quite fair to ask “How discerning is the forecaster?” Instead we should ask “How discerning is the forecaster, compared with how much uncertainty is inherent in the system?”

In general, the closer the climactic average is to 0% or 100%, the less uncertainty there is. Maximum uncertainty is when the climactic average is 50%, i.e. a coin flip.

This metric — “uncertainty” — is computed as

a*(1-a)
.

where a is the climactic average. In the 30% example, the uncertainty metric would be 0.21. The mathematical interpretation is that the maximum possible discernment is 0.21. (The minimum is always 0.) Above, forecaster B’s discernment of 0.04 is still far from the maximum possible, so B isn’t a terrific forecaster. Still, B is better than A, with a discernment of 0! In the case of the desert where 1% of the days have rain, uncertainty would be 0.0099 — barely any.

Putting it together: The Total Score of “Goodness” for Forecasting

So now we can put all three of these numbers together to form a total “goodness” score. It turns out to be mathematically sound to compute it this way:

[goodness] = [uncertainty] - [discernment] + [error]
 .

An English interpretation is:

Every forecaster’s baseline score starts with the inherent uncertainty baked into the system. Indeed, in the case of a forecaster who guesses the climactic average, both [discernment] and [error] will be 0, and thus the total score is exactly the uncertainty.

Forecasters can improve their score (i.e. lower it) by increasing discernment, i.e. removing some amount of that uncertainty. Forecasters worsen their score by being inaccurate, which shows up with a higher [error] metric.

In general, a forecaster with increasing discernment will probably introduce a little more error as well. The better forecasters increase discernment larger than the amount of error, thus decreasing the overall score.

In a given context, uncertainty is usually fairly constant, because it’s a long-term average quantity. The other two scores vary, and here’s a handy guide to interpreting those extremes:

low error & low discernment = Useless. You’re only accurate because you’re just guessing the average.

high error & low discernment = Failure. You’re not segmenting the population, and yet you’re still worse than guessing the climactic average.

low error & high discernment = Ideal. You’re making strong, unique predictions, and you’re correct.

high error & high discernment = Try Again. You’re making strong predictions, so at least you’re trying, but you’re not guessing correctly.

Implementation in Practice

Let’s suppose you want to build a lead-scoring algorithm for your company, predicting whether a given lead will eventually convert to a paying customer.

A natural place to start is with the baseline algorithm, i.e. the useless one. So let’s build that model first.

Here’s a Google Spreadsheet of the model. Duplicate it to get an editable one of your own. You put your predictions and actuals into the first worksheet. The second worksheet computes the intermediary results. (Make sure the columns to the right on the second sheet are “Filled Down” far enough!) The third worksheet computes the three numbers above, and the total score.

For the lead-scoring exercise: Use one row on the first worksheet for each of your leads. You don’t have a prediction yet, so fill in column A with any guess, like 0.5 (a coin flip). Put the actuals in column B, i.e. a 1 or 0 depending on whether it converted. You can use additional columns for information about the lead, like ID, contact info, source, and anything else you might need as input to your forecast.

After ensuring the columns on the second worksheet are “Filled Down” sufficiently, flip over to the third worksheet. Of course your score will be terrible right now! Look at the computation of the climactic average, which in this case is simply the overall close-rate by the sales team of your leads.

Your baseline algorithm is to just guess that climactic average. So, you could type that number into column A on the first worksheet, fill down, and then check that indeed the three metrics make sense: Error should be zero, Discernment should be zero, Uncertainty should come from the overall average, and your Total Score will be equal to Uncertainty.

Now that you have a working model, the valuable part begins!

Note the score of this baseline algorithm — write it down somewhere. The job now is to think of a better algorithm, and prove it’s better by beating that score.

A common pattern in lead-scoring is that invalid data is often associated with poorer close-rates. So, a simple idea would be to check that the phone number field is not empty, or has at least eight digits. Supposing the over-all climactic average is 30%, you might guess that those with valid phone numbers might convert at 40%, while those without close at 20%.

How would you apply that idea to the spreadsheet? One way is to run the algorithm separately, then import the data into the spreadsheet. For those who are expert in spreadsheet formulas, you could add all your lead data as additional columns on the first worksheet, then use a big spreadsheet formula to compute the forecast right there! Then you can quickly test different ideas.

The second worksheet can help you iterate more quickly to better results. To see how, suppose you added the formula described just above about phone numbers. While the specific guesses of “40%” and “20%” close-rates might be directionally sensible, it’s unlikely that those are exactly the right probabilities.

So, after implementing that forecast, you flip to the second worksheet. You’ll see that every guess of “20%” is grouped into a single row, as are the guesses of “40%.” You’ll also see, computed there, the actual close-rates for those two rows. Those actuals, whatever they are, being different from the climactic average, is contributing to a higher discernment score, so that’s good. But, the difference between your guess and those actuals is contributing to your error score so that’s bad.

The fix is simple: Just “steal” those actuals and put it into your formula! So maybe it should be “if phone number is blank, forecast 23%, else forecast 33%.” That update preserves your newfound discernment, but minimizes your error.

At this point, your total score should be less than the baseline. As they say in A/B testing, you have a new winner!

So now you’d continue iterating, trying more ideas for how to segment and forecast your leads. Sometimes you’ll have ideas you think are great, but it turns out they actually increase the total versus a simpler algorithm. That’s because sometimes being too fancy just means you incur lots of error.

Finally, if someday you get extremely sophisticated, you might use machine-learning to forecast. But you can still apply exactly the same model to answer the question: “Which algorithm is better?”

The sky’s the limit!  Now go make some good forecasts!

Further Study

The mathematical model described here was invented by Glen Briar in 1950. Since then, it’s been extended with different scoring functions, multi-variate outcome models, and more. Here’s some further reading:

Notes:

  1. Why do we square the errors instead of using something simpler like the absolute value of the difference? There are two answers. One is that squaring the differences intentionally exaggerates items which are very different from each other, and this is a useful metric. The other is that the mathematics of squared differences is much more tractable than using absolute value. Specifically, you can refactor/rewrite squared differences, and you can employ differential calculus, neither of which works with absolute value. Computing a linear regression line with least-squares, for example, is derived by using calculus to minimize the squared differences, but that same method cannot be applied to linear differences.
  • http://nutriadmin.com Nutriadmin

    What a great read. Specially on a space where there seems to be so many misconceptions about basic probability and statistics. Loved the examples with the meteorologist – a complex discussion made accessible.

    I’ll have to try the practical implementation. I guess optimizing the score manually makes the algorithm “Human Learning” rather than “Machine Learning”. Hopefully this would help not only to get better forecasts, but also to get rid of human biases and assumptions (a.k.a. bulls**t) about what influences a given metric.

    • http://blog.asmartbear.com Jason Cohen

      Haha, yup definitely “human learning.” And an objective measure of if/when machine learning on a particular dataset is better than a human. In some cases a “combo of the two” is actually best; in fact that is how some of the leaders in big-data + machine learning actually work (e.g. Palantir).

  • http://www.wildting.com/ Dale

    Jason, I’ve been looking for something like this for a long time, mainly for testing different win probability models for my NFL Confidence Pool site (http://www.confidencepoolpicks.com/2014/10/how-accurate-are-moneyline-picks-2/ ). I could figure out which was more accurate, but I had the nagging feeling that if I would just choose 50% win probability for each game, I’d get a perfect score. I wish they would’ve taught this in school!

    (Next thing I’m trying to figure out is how to determine for my fantasy baseball league, which stats I should at to predict which players I should start: 15 day history, 30 day history, or full season history. Any help on that?)

    • http://blog.asmartbear.com Jason Cohen

      Hi Dale, regarding which stats to start with, perhaps the right thing is to simply try different ones and see whether one set of stats — combined with a good algorithm of course! — has better results.

  • Joannes Vermorel

    Well, there are accuracy metrics for probabilistic forecasts, see https://www.lokad.com/continuous-ranked-probability-score but again, probably overkill for business plan assessment :-) Nice read.

  • Dr Aniruddha Malpani, MD

    Shouldn’t [goodness] = [uncertainty] + [discernment] – [error] ?

  • Anny Smile

    Hmm, I” try this

  • StacyBurchard

    [roi] = [marketing companies] – [analytical reports] + [bs] …My forecasts were way off because of bs reports coming from my ex-marketing company. check out odditly if you think your business needs someone to shed some light on all the analytics out there.

  • Gideon Arom

    I’m confused about the formula for discernment. If it’s ([actual] – [average])^2 then it doesn’t incorporate the predictions made at all so how can it reflect the discernment of the forecaster?