Gee they’re both so exciting! Who could choose! I know, I know, settle down. This is how these things go.
Anyway, the next day you have 32 clicks on variant A (“Code Review Tools”) and 19 clicks on B (“Tools for Code Review”). Is that conclusive? Has A won? Or should you let the test run longer? Or should you try completely different text?
The answer matters. If you wait too long between tests, you’re wasting time. If you don’t wait long enough for statistically conclusive results, you might think a variant is better and use that false assumption to create a new variant, and so forth, all on a wild goose chase! That’s not just a waste of time, it also prevents you from doing the correct thing, which is to come up with completely new text to test against.
Normally a formal statistical treatment would be too difficult, but I’m here to rescue you with a statistically sound yet incredibly simple formula that will tell you whether or not your A/B test results really are indicating a difference.
I’ll get to it in a minute, but I can’t help but include a more entertaining example than AdWords. Meet Hammy the Hamster, the probably-biased-but-incredibly-lovable tester of organic produce (click to watch 1m30s movie):
In the movie, Hammy chooses the organic produce 8 times and the conventional 4 times. This is an A/B test, just like with AdWords… but healthier.
If you’re like me, you probably think “organic” is the clear-cut winner — after all Hammy chose it twice as often as conventional veggies. But, as so often happens with probability and statistics, you’d be wrong.
That’s because human beings are notoriously bad at guessing these things from gut feel. For example, most people are more afraid of dying in a plane crash than a car crash, even though the latter is sixty times more likely. On the other hand, we’re amazed when CNN “calls the election” for a governor with a mere 1% of the state ballots reporting in.
Okay okay, we suck at math. So what’s the answer? Here’s the bit you’ve been waiting for:
(For the mathematical justification, see the end of the post.)
So your AdWords test isn’t statistically significant yet. But what if you let the test continue to run. The next day you find 30 more clicks for variant A for a total of 62, and 19 more clicks for B for a total of 40. Running the formula: N = 62+40 = 102; D = (62-40) ÷ 2 = 11; D2 = 121 which is bigger than 102, so now the measured difference is significant.
A lot of times, though, you keep running the test and it’s still not significant. That’s when you realize you’re not learning anything new; the variants you picked are not meaningfully different for your readers. That means it’s time to come up with something new.
When you start applying the formula to real-world examples, you’ll notice that when N is small it’s hard — or even impossible — to be statistically significant. For example, say you’ve got one ad with 6 clicks and the other with 1. That’s N = 7, D = 2.5, D2 is 6.25 so the test is still inconclusive, even though A is beating B six-to-one. Trust the math here — with only a few data points, you really don’t know anything yet.
But what about the vast majority of people who don’t click either ad? That’s the “ad impressions” that didn’t lead to a click. Shouldn’t those count somehow in the statistics?
No, they shouldn’t; those are “mistrials.” To see why, consider Hammy again. That video was edited (of course), and a lot of the time Hammy didn’t pick either vegetable, opting instead to groom himself or sleep. (For the “outtakes” video and more statistics, see Hammy’s Homepage.) If Hammy doesn’t pick a vegetable during a particular trial run, it doesn’t mean anything — doesn’t mean he likes or dislikes either. It just tells us nothing at all.
Because the AdWords “click-through rate” is dependant both on the number of clicks and the number of impressions, you must not use “click-through rate” to determine statistical significance. Only the raw number of clicks can be used in the formula.
I hope this formula will help you make the right choices when running A/B tests. It’s simple enough that you have no excuse not to apply it! Human intuition sucks when it comes to these things, so let the math help you draw the right conclusions.
For the mathematically inclined: The derivation
The null-hypothesis is that the results of the A/B test are due to chance alone. The statistical test we need is Pearson’s chi-square. The definition of the general statistic follows (where m = number of possible outcomes; Oi = observed number of results in outcome #i; Ei = expected number of results in outcome #i):
In the simple case of the A/B test, m = 2. From a 50/50 random process, the expected values are Ei = n/2 where n = O1+O2. Taking A = O1 to be the larger of the two observed values and B = O2 to be the smaller, the (unsimplified) formula is:
The squared difference between A and n/2 is the same as between B and n/2 (because A+B = n), so we can replace those squared-difference terms by a new variable D2. The definition of D in the text above as (A–B)/2 comes by substituting n = A+B into D = A – n/2. Rewriting in terms of D and simplifying yields:
Now we have a simple way of computing the chi-square statistic, but we have to refer to the chi-square distribution to determine statistical significance. Specifically: What is the probability this result could have happened by chance alone?
Looking at the distribution with 1 degree of freedom (B depends on A so there’s just one degree of freedom), we need to exceed 3.8 for 95% confidence and 6.6 for 99% confidence. For my simplified rule-of-thumb purposes, I selected 4 as the critical threshold. Solving for D2 completes the derivation:
P.S. Useful side-note: If D2 is more than double n, you’re well past the 99% confidence level.