**“There’s a 30% chance of rain today.”**

And then it didn’t rain. So, was the forecast accurate?

Or what if it did rain. Does that mean the forecast was inaccurate?

How do you hold forecasters accountable, when the forecast is only a probability? The answer appears tricky, then simple, then tricky again, then ends up being simple enough to answer it with Google Spreadsheets.

It’s a journey worth taking, because building better forecasts is invaluable for businesses:

Take lead scoring — putting a value on a new sales lead, predicting the ultimate value of that lead after 9 months have passed and it’s either converted or not. The forecast is what chance this lead has of converting, or what dollar value it has. Like the weather, the lead will convert or it won’t, and if it does, it has a definite dollar value.

If you could predict the chance that a given customer might churn in the next thirty days, you could be proactive and perhaps avert the loss.

If you could predict the chance that a given customer would be amenable to an upgrade, you could focus your internal messaging efforts accordingly.

But how do you measure the accuracy of a prediction which itself is expressed only as a probability? Let’s return to the meteorologist.

**Building the Model: Error**

Clearly, a single data point tells you nothing. The correct interpretation of “30% chance of rain” is this: Take all the days the meteorologist predicted 30%. If the meteorologist is accurate, it should have in fact rained 30% of those times. Similarly, the forecaster will sometimes predict 0% or 10% or 50%. So we should “bucket” each of these predictions, and see what actually happened in those buckets.

What is the right math to determine “how correct” the forecaster is? As is often the answer in statistics ^{1}, we can take the squared difference between the forecast and the actual result.

Suppose we have two forecasters, and the question is: Who is most accurate? “Error” is measured by the squared difference between the forecast and reality. Whoever has the least error is the better forecaster. Suppose on some set of days, forecaster A always predicted a 32% chance of rain, and B always predicted 25%, and suppose in reality it rained on 30% of those days. Then the errors are:

A: Predict 32%: Actual 30% → error = squared difference = (0.32-0.30)^{2} = **0.0004
**B: Predict 25%: Actual 30% → error = squared difference = (0.25-0.30)

^{2}=

**0.0025**

It feels like we’re finished, but we’re not.

**Building the Model: Discernment**

Suppose these meteorologists are in a region in which it’s typical that out of 365 days, it will rain 110 of those days. That is, the overall climactic average for rainfall is 30%. A meteorologist would know that. So suppose a meteorologist simply predicts “30% chance of rain,” every single day, no matter what. Even if it’s actually raining, right now, predict “30%.”

Our “error” metric will confirm that this forecaster is a genius — exactly zero over a whole year of predictions! Except the forecaster isn’t a genius. In fact, this forecaster is not forecasting at all! She’s just regurgitating the historical average.

So it’s apparent that, although we do need our measure of *error*, there’s another concept we need to measure: The idea that the forecaster is being *discerning*. That the forecaster is *segmenting* the days, taking a strong stance about which days will rain.

The tension between error and discernment is apparent if you consider the following scenario. Suppose forecaster A always predicts the climactic average; thus A has 0 error but 0 discernment, and is useless. Now consider forecaster B, who often predicts the climactic average, but now and then will predict 0% or 100% of rain, when he’s very sure. And suppose that when he predicts 0% the actual average is 10%, and when he predicts 100% the actual average is 90%.

B will have a worse error score, but should have a better discernment score. You would prefer to listen to forecaster B, *even though he is less accurate than A.* So the idea of “discernment” isn’t just a curiosity, it’s a fundamental concept in measuring how “good” a forecaster is.

How do you compute this “discernment?” We once again use squared-differences, but this time we are comparing the *observed results* with the *climactic average*.

So, in the example above:

A: Predict 30% every time for 100 days: actual is 30%, Error = (30%-30%) = **0**. Discernment = (30%-30%)^2 = **0**.

B: Predict 30% for 80 days; actual is 30%. Predict 0% for 10 days, actual is 10%. Predict 100% for 10 days, actual is 90%. Total error is:

1/100 * [ 80*(30%-30%)^{2}+ 10*(0%-10%)^{2}+ 10*(100%-90%)^{2}] =0.002

That’s only slightly worse than A, so that’s good. Total discernment is:

1/100 * [ 80*(30%-30%)^{2}+ 10*(10%-30%)^{2}+ 10*(90%-30%)^{2}] =0.04

As expected, we see that B has slightly more error than A, but more discernment. So it’s clear that our metrics are working directionally, but how do we combine these numbers into a total “goodness” score, that would definitively show that, in this case, B is “better” than A?

To answer that, it turns out there’s one more concept we need.

**Building the Model: Uncertainty**

Consider the life of a forecaster in Antofagasta, Chile, where on average it rains only five days a year (for a grand total of 1.7 *millimeters* of total rainfall!). At first glance it seems easy to be a forecaster — just predict “no rain” every day.

Of course you recognize that although that forecaster would have low error, she would also be undiscerning. But wait… how could a forecaster *ever* be discerning in Antofagasta? To be discerning you need to make varied predictions. But *the reality isn’t varied*, so any predictions that *were* varied, would *necessarily* be *wrong*! In a sense, there’s no “space” for discernment, because there’s no variation to discern between. There’s not a lot of uncertainty in the system in the first place, so there’s not much a forecaster can do to improve on guessing the climactic average.

Compare that with forecaster in Portland, Oregon, USA where it rains 164 days out of the year — about 45%. And there’s no “rainy season” — it’s just chaotic. Now there’s lots of room for improvement, even just predicting 55% or 35% here and there, will still be highly accurate but increase discernment. And a world-class forecaster has the space to create a significant amount of discernment.

So it’s not quite fair to ask “How discerning is the forecaster?” Instead we should ask “How discerning is the forecaster, compared with how much uncertainty is inherent in the system?”

In general, the closer the climactic average is to 0% or 100%, the less uncertainty there is. Maximum uncertainty is when the climactic average is 50%, i.e. a coin flip.

This metric — “uncertainty” — is computed as

```
a*(1-a)
.
```

where a is the climactic average. In the 30% example, the uncertainty metric would be 0.21. The mathematical interpretation is that the maximum possible discernment is 0.21. (The minimum is always 0.) Above, forecaster B’s discernment of 0.04 is still far from the maximum possible, so B isn’t a terrific forecaster. Still, B is better than A, with a discernment of 0! In the case of the desert where 1% of the days have rain, uncertainty would be 0.0099 — barely any.

**Putting it together: The Total Score of “Goodness” for Forecasting**

So now we can put all three of these numbers together to form a total “goodness” score. It turns out to be mathematically sound to compute it this way:

```
[goodness] = [uncertainty] - [discernment] + [error]
.
```

An English interpretation is:

Every forecaster’s baseline score starts with the inherent uncertainty baked into the system. Indeed, in the case of a forecaster who guesses the climactic average, both [discernment] and [error] will be 0, and thus the total score is exactly the uncertainty.

Forecasters can improve their score (i.e. lower it) by increasing discernment, i.e. removing some amount of that uncertainty. Forecasters worsen their score by being inaccurate, which shows up with a higher [error] metric.

In general, a forecaster with increasing discernment will probably introduce a little more error as well. The better forecasters increase discernment larger than the amount of error, thus decreasing the overall score.

In a given context, uncertainty is usually fairly constant, because it’s a long-term average quantity. The other two scores vary, and here’s a handy guide to interpreting those extremes:

low error & low discernment = **Useless.** You’re only accurate because you’re just guessing the average.

high error & low discernment = **Failure.** You’re not segmenting the population, and yet you’re still worse than guessing the climactic average.

low error & high discernment = **Ideal.** You’re making strong, unique predictions, and you’re correct.

high error & high discernment =** Try Again.** You’re making strong predictions, so at least you’re trying, but you’re not guessing correctly.

**Implementation in Practice**

Let’s suppose you want to build a lead-scoring algorithm for your company, predicting whether a given lead will eventually convert to a paying customer.

A natural place to start is with the baseline algorithm, i.e. the useless one. So let’s build that model first.

Here’s a Google Spreadsheet of the model. Duplicate it to get an editable one of your own. You put your predictions and actuals into the first worksheet. The second worksheet computes the intermediary results. (Make sure the columns to the right on the second sheet are “Filled Down” far enough!) The third worksheet computes the three numbers above, and the total score.

For the lead-scoring exercise: Use one row on the first worksheet for each of your leads. You don’t have a prediction yet, so fill in column A with any guess, like 0.5 (a coin flip). Put the actuals in column B, i.e. a 1 or 0 depending on whether it converted. You can use additional columns for information about the lead, like ID, contact info, source, and anything else you might need as input to your forecast.

After ensuring the columns on the second worksheet are “Filled Down” sufficiently, flip over to the third worksheet. Of course your score will be terrible right now! Look at the computation of the climactic average, which in this case is simply the overall close-rate by the sales team of your leads.

Your baseline algorithm is to just guess that climactic average. So, you could type that number into column A on the first worksheet, fill down, and then check that indeed the three metrics make sense: Error should be zero, Discernment should be zero, Uncertainty should come from the overall average, and your Total Score will be equal to Uncertainty.

Now that you have a working model, the valuable part begins!

Note the score of this baseline algorithm — write it down somewhere. The job now is to think of a *better* algorithm, and prove it’s better by beating that score.

A common pattern in lead-scoring is that invalid data is often associated with poorer close-rates. So, a simple idea would be to check that the phone number field is not empty, or has at least eight digits. Supposing the over-all climactic average is 30%, you might guess that those with valid phone numbers might convert at 40%, while those without close at 20%.

How would you apply that idea to the spreadsheet? One way is to run the algorithm separately, then import the data into the spreadsheet. For those who are expert in spreadsheet formulas, you could add all your lead data as additional columns on the first worksheet, then use a big spreadsheet formula to compute the forecast right there! Then you can quickly test different ideas.

The second worksheet can help you iterate more quickly to better results. To see how, suppose you added the formula described just above about phone numbers. While the specific guesses of “40%” and “20%” close-rates might be directionally sensible, it’s unlikely that those are exactly the right probabilities.

So, after implementing that forecast, you flip to the second worksheet. You’ll see that every guess of “20%” is grouped into a single row, as are the guesses of “40%.” You’ll also see, computed there, the *actual* close-rates for those two rows. Those actuals, *whatever they are*, being different from the climactic average, is contributing to a higher discernment score, so that’s good. But, the difference between your guess and those actuals is contributing to your error score so that’s bad.

The fix is simple: Just “steal” those actuals and put it into your formula! So maybe it should be “if phone number is blank, forecast 23%, else forecast 33%.” That update preserves your newfound discernment, but minimizes your error.

At this point, your total score should be less than the baseline. As they say in A/B testing, you have a new winner!

So now you’d continue iterating, trying more ideas for how to segment and forecast your leads. Sometimes you’ll have ideas you think are great, but it turns out they actually increase the total versus a simpler algorithm. That’s because sometimes being too fancy just means you incur lots of error.

Finally, if someday you get extremely sophisticated, you might use machine-learning to forecast. But you can still apply exactly the same model to answer the question: “Which algorithm is better?”

The sky’s the limit! Now go make some good forecasts!

**Further Study**

The mathematical model described here was invented by Glen Briar in 1950. Since then, it’s been extended with different scoring functions, multi-variate outcome models, and more. Here’s some further reading:

- Briar’s original paper (Just 3-pages; easy reading)
- Wikipedia on “Scoring Rules” generally
- A paper arguing for the Log() formula instead of the Briar formula, at least in the case of more than two options per trial.
- Stein’s Paradox – an estimator that’s always better than the historical average, but in a way that apparently can’t be true

Notes:

- Why do we
*square*the errors instead of using something simpler like the absolute value of the difference? There are two answers. One is that squaring the differences intentionally exaggerates items which are*very*different from each other, and this is a useful metric. The other is that the mathematics of squared differences is much more tractable than using absolute value. Specifically, you can refactor/rewrite squared differences, and you can employ differential calculus, neither of which works with absolute value. Computing a linear regression line with least-squares, for example, is derived by using calculus to minimize the squared differences, but that same method cannot be applied to linear differences. ↩