So you’ve got your AdWords test all set up: Will people go for the headline “Code Review Tools” or “Tools for Code Review?”

Gee they’re both so exciting! Who could choose! I know, I know, settle down. This is how these things go.

Anyway, the next day you have 32 clicks on variant A (“Code Review Tools”) and 19 clicks on B (“Tools for Code Review”). **Is that conclusive?** Has A won? Or should you let the test run longer? Or should you try completely different text?

**The answer matters.** If you wait too long between tests, you’re wasting time. If you don’t wait long enough for *statistically conclusive* results, you might *think* a variant is better and use that false assumption to create a new variant, and so forth, all on a wild goose chase! That’s not just a waste of time, it also prevents you from doing the *correct* thing, which is to come up with *completely new* text to test against.

**Normally a formal statistical treatment would be too difficult, but I’m here to rescue you** with a statistically sound yet incredibly simple formula that will tell you whether or not your A/B test results really are indicating a difference.

I’ll get to it in a minute, but I can’t help but include a more entertaining example than AdWords. Meet Hammy the Hamster, the probably-biased-but-incredibly-lovable tester of organic produce (click to watch 1m30s movie):

In the movie, Hammy chooses the organic produce 8 times and the conventional 4 times. This is an A/B test, just like with AdWords… but healthier.

If you’re like me, you probably think “organic” is the clear-cut winner — after all Hammy chose it *twice as often* as conventional veggies. But, as so often happens with probability and statistics, **you’d be wrong**.

That’s because human beings are notoriously bad at guessing these things from gut feel. For example, most people are more afraid of dying in a plane crash than a car crash, even though the latter is *sixty times* more likely. On the other hand, we’re amazed when CNN “calls the election” for a governor with a mere 1% of the state ballots reporting in.

Okay okay, we suck at math. So what’s the answer? Here’s the bit you’ve been waiting for:

*The way you determine whether an A/B test shows a statistically significant difference is:*

**Define N as “the number of trials.”**For Hammy this is 8+4 =

**12**.

For the AdWords example this is 32+19 =**51**.**Define D as “half the difference between the ‘winner’ and the ‘loser’.”**For Hammy this is (8-4) ÷ 2 =

**2**.

For AdWords this is (32-19) ÷ 2 =**6.5**.**The test result is statistically significant if D2 is bigger than N.**

For Hammy, D2 is 4, which is not bigger than 12, so it is*not significant*.

For AdWords, D2 is 42.25, which is not bigger than 51, so it is*not significant*.

*(For the mathematical justification, see the end of the post.)*

So your AdWords test isn’t statistically significant yet. But what if you let the test continue to run. The next day you find 30 more clicks for variant A for a total of 62, and 19 more clicks for B for a total of 40. Running the formula: N = 62+40 = 102; D = (62-40) ÷ 2 = 11; D2 = 121 which is bigger than 102, so now the measured difference *is significant*.

A lot of times, though, you keep running the test and it’s still not significant. That’s when you realize you’re not learning anything new; the variants you picked are not meaningfully different for your readers. That means it’s time to come up with something *new*.

When you start applying the formula to real-world examples, you’ll notice that **when N is small it’s hard — or even impossible — to be statistically significant**. For example, say you’ve got one ad with 6 clicks and the other with 1. That’s N = 7, D = 2.5, D2 is 6.25 so the test is still inconclusive, even though A is beating B six-to-one. Trust the math here — with only a few data points, you really don’t know anything yet.

But what about the vast majority of people who don’t click either ad? That’s the “ad impressions” that didn’t lead to a click. Shouldn’t those count somehow in the statistics?

No, they shouldn’t; those are “mistrials.” To see why, consider Hammy again. That video was edited (of course), and a lot of the time Hammy didn’t pick either vegetable, opting instead to groom himself or sleep. (For the “outtakes” video and more statistics, see Hammy’s Homepage.) If Hammy doesn’t pick a vegetable during a particular trial run, it doesn’t mean anything — doesn’t mean he likes or dislikes either. It just tells us nothing at all.

Because the AdWords “click-through rate” is dependant both on the number of clicks and the number of impressions, **you must not use “click-through rate” to determine statistical significance**. Only the

*raw number of clicks*can be used in the formula.

I hope this formula will help you make the right choices when running A/B tests. It’s simple enough that you have no excuse not to apply it! Human intuition sucks when it comes to these things, so let the math help you draw the right conclusions.

**Enjoyed this post? Click to get future articles delivered by email or subscribe to the RSS feed.**

**For the mathematically inclined: The derivation**

The null-hypothesis is that the results of the A/B test are due to chance alone. The statistical test we need is Pearson’s chi-square. The definition of the general statistic follows (where *m = *number of possible outcomes; *Oi *= observed number of results in outcome #*i*; *Ei *= expected number of results in outcome #*i*):

In the simple case of the A/B test, *m *= 2. From a 50/50 random process, the expected values are *Ei *= *n*/2 where *n *= *O1*+*O2*. Taking *A* = *O1* to be the larger of the two observed values and *B* = *O**2* to be the smaller, the (unsimplified) formula is:

The squared difference between *A* and *n*/2 is the same as between *B* and *n*/2 (because *A*+*B* = *n*), so we can replace those squared-difference terms by a new variable *D*2. The definition of *D* in the text above as (*A*–*B*)/2 comes by substituting *n* = *A*+*B* into *D* = *A* – *n*/2. Rewriting in terms of *D* and simplifying yields:

Now we have a simple way of computing the chi-square statistic, but we have to refer to the chi-square distribution to determine statistical significance. Specifically: What is the probability this result could have happened by chance alone?

Looking at the distribution with 1 degree of freedom (*B* depends on *A* so there’s just one degree of freedom), we need to exceed 3.8 for 95% confidence and 6.6 for 99% confidence. For my simplified rule-of-thumb purposes, I selected 4 as the critical threshold. Solving for *D*2 completes the derivation:

QED, suckkas!

P.S. Useful side-note: If *D*2 is more than double *n*, you’re well past the 99% confidence level.