So you’ve got your AdWords test all set up: Will people go for the headline “Code Review Tools” or “Tools for Code Review?”
Gee they’re both so exciting! Who could choose! I know, I know, settle down. This is how these things go.
Anyway, the next day you have 32 clicks on variant A (“Code Review Tools”) and 19 clicks on B (“Tools for Code Review”). Is that conclusive? Has A won? Or should you let the test run longer? Or should you try completely different text?
The answer matters. If you wait too long between tests, you’re wasting time. If you don’t wait long enough for statistically conclusive results, you might think a variant is better and use that false assumption to create a new variant, and so forth, all on a wild goose chase! That’s not just a waste of time, it also prevents you from doing the correct thing, which is to come up with completely new text to test against.
Normally a formal statistical treatment would be too difficult, but I’m here to rescue you with a statistically sound yet incredibly simple formula that will tell you whether or not your A/B test results really are indicating a difference.
I’ll get to it in a minute, but I can’t help but include a more entertaining example than AdWords. Meet Hammy the Hamster, the probably-biased-but-incredibly-lovable tester of organic produce (click to watch 1m30s movie):
In the movie, Hammy chooses the organic produce 8 times and the conventional 4 times. This is an A/B test, just like with AdWords… but healthier.
If you’re like me, you probably think “organic” is the clear-cut winner — after all Hammy chose it twice as often as conventional veggies. But, as so often happens with probability and statistics, you’d be wrong.
That’s because human beings are notoriously bad at guessing these things from gut feel. For example, most people are more afraid of dying in a plane crash than a car crash, even though the latter is sixty times more likely. On the other hand, we’re amazed when CNN “calls the election” for a governor with a mere 1% of the state ballots reporting in.
Okay okay, we suck at math. So what’s the answer? Here’s the bit you’ve been waiting for:
The way you determine whether an A/B test shows a statistically significant difference is:
- Define N as “the number of trials.”
For Hammy this is 8+4 = 12.
For the AdWords example this is 32+19 = 51. - Define D as “half the difference between the ‘winner’ and the ‘loser’.”
For Hammy this is (8-4) ÷ 2 = 2.
For AdWords this is (32-19) ÷ 2 = 6.5. - The test result is statistically significant if D2 is bigger than N.
For Hammy, D2 is 4, which is not bigger than 12, so it is not significant.
For AdWords, D2 is 42.25, which is not bigger than 51, so it is not significant.
(For the mathematical justification, see the end of the post.)
So your AdWords test isn’t statistically significant yet. But what if you let the test continue to run. The next day you find 30 more clicks for variant A for a total of 62, and 19 more clicks for B for a total of 40. Running the formula: N = 62+40 = 102; D = (62-40) ÷ 2 = 11; D2 = 121 which is bigger than 102, so now the measured difference is significant.
A lot of times, though, you keep running the test and it’s still not significant. That’s when you realize you’re not learning anything new; the variants you picked are not meaningfully different for your readers. That means it’s time to come up with something new.
When you start applying the formula to real-world examples, you’ll notice that when N is small it’s hard — or even impossible — to be statistically significant. For example, say you’ve got one ad with 6 clicks and the other with 1. That’s N = 7, D = 2.5, D2 is 6.25 so the test is still inconclusive, even though A is beating B six-to-one. Trust the math here — with only a few data points, you really don’t know anything yet.
But what about the vast majority of people who don’t click either ad? That’s the “ad impressions” that didn’t lead to a click. Shouldn’t those count somehow in the statistics?
No, they shouldn’t; those are “mistrials.” To see why, consider Hammy again. That video was edited (of course), and a lot of the time Hammy didn’t pick either vegetable, opting instead to groom himself or sleep. (For the “outtakes” video and more statistics, see Hammy’s Homepage.) If Hammy doesn’t pick a vegetable during a particular trial run, it doesn’t mean anything — doesn’t mean he likes or dislikes either. It just tells us nothing at all.
Because the AdWords “click-through rate” is dependant both on the number of clicks and the number of impressions, you must not use “click-through rate” to determine statistical significance. Only the raw number of clicks can be used in the formula.
I hope this formula will help you make the right choices when running A/B tests. It’s simple enough that you have no excuse not to apply it! Human intuition sucks when it comes to these things, so let the math help you draw the right conclusions.
Enjoyed this post? Click to get future articles delivered by email or subscribe to the RSS feed.
For the mathematically inclined: The derivation
The null-hypothesis is that the results of the A/B test are due to chance alone. The statistical test we need is Pearson’s chi-square. The definition of the general statistic follows (where m = number of possible outcomes; Oi = observed number of results in outcome #i; Ei = expected number of results in outcome #i):
In the simple case of the A/B test, m = 2. From a 50/50 random process, the expected values are Ei = n/2 where n = O1+O2. Taking A = O1 to be the larger of the two observed values and B = O2 to be the smaller, the (unsimplified) formula is:
The squared difference between A and n/2 is the same as between B and n/2 (because A+B = n), so we can replace those squared-difference terms by a new variable D2. The definition of D in the text above as (A–B)/2 comes by substituting n = A+B into D = A – n/2. Rewriting in terms of D and simplifying yields:
Now we have a simple way of computing the chi-square statistic, but we have to refer to the chi-square distribution to determine statistical significance. Specifically: What is the probability this result could have happened by chance alone?
Looking at the distribution with 1 degree of freedom (B depends on A so there’s just one degree of freedom), we need to exceed 3.8 for 95% confidence and 6.6 for 99% confidence. For my simplified rule-of-thumb purposes, I selected 4 as the critical threshold. Solving for D2 completes the derivation:
QED, suckkas!
P.S. Useful side-note: If D2 is more than double n, you’re well past the 99% confidence level.
121 responses to “Easy statistics for AdWords A/B testing, and hamsters”
Jason, great post. For those readers that enjoy this sort of pragmatic look at probability, I’d highly recommend this book: The Drunkard’s Walk: How Randomness Rules Our Lives. It’s a really fun read.
@Louis — Yes anything that opens our eyes to the power of randomness and coincidence is useful in combating our horrible instincts.
Jason,
Nice post. So if I have n (>2) test data, can I use the "Pearson’s chi-square" to sort/rank the them based on significance difference? Does any other search engines use similar approaches?
Thanks
-James
I really enjoyed this post, thanks
@James — Two things:
(1) On "n>2 you can use chi-square," no you cannot. Here’s what you can do. Given m different possible "result buckets" and n trials, and with a known "expected number of results" per bucket (for an even distribution you expect n/m per bucket), you can then use chi-square not to order the buckets but rather to determine whether the entire distribution in the buckets is significantly different from what you’d expect from a random process.
That is, it does not order the buckets. It says whether the overall process "looks random" or not. Even if it "does not look random," you still don’t have an ordering. You would need to do additional pair-wise chi-square tests to determine whether there’s any ordering among the various buckets.
However there lies dragons as well, because once you start "hunting" for statistical significant among various pair-wise groups where you don’t necessarily expect any relationship, it’s well-known that you will find "significant" results which are in fact not significant. See the literature on the F-test for more information.
So the bottom line is: Give up trying to "order," and instead take the group in aggregate. Or, if you’re going to do pair-wise stuff, you need much higher confidence factor than I gave here. 99.99 would not be out of the question for a duck-hunt like that.
(2) Other search engines: This technique works for anything that is a 50/50 A/B test. Even Hammy! So, "yes."
…instead take the group in aggregate. Got it. Thanks.
James
Your article has created an over the desk – very interesting- conversation about statistics and statistical significance in my office. And that´s always a good thing: )
I also like the way you´ve written the article, very readable and funny in the beginning, and then rational and explicative. Impressed a little bit.
@Ina — Thanks! Yes it’s too easy to ignore stats and go with our gut; at the same time it’s hard to apply the rules properly because the assumptions are often hard to verify, and how much you can still trust the numbers after bending the assumptions is an art best left to the statisticians.
This is an excellent post. Yay numbers!
I’d like to make a PSA for the next step in assessing significance, "Effect Size". From Wikipedia:
The details are explained here and the here, though neither as simply as Jason’s above explanation of significance. here has a practical example.
The D^2 test above tells you whether you have made enough measurements to confidently assert you’ve teased out the answer. The Effect Size tells you whether that difference /matters/. A Google-sized sample can measure a statistically significant difference in click-through among 41 shades of blue. But I’ll eat my copy of _How to think about Weird Things_ if there’s an actionable effect-size difference among the top 10.
Anyway, what’s important isn’t the mathematical machinery of finding effect size. If the picture looks like this or better separated you’ve found an actionable difference. If it looks like this you should ignore the result or start reading the chapters at the end of your statistics textbook.
@Flip — Great information, especially turning everyone onto that Wikipedia page — I didn’t know they had a nice collection of "effect size related" things. Nice.
In the case of the A/B test, the given math does consider effect size because I am indeed doing the significance testing you’re talking about.
However, as you say this is not the end of the story. Statistics don’t tell you answers, they just characterize data.
In particular, my test really merely suggests that "results can/cannot be just as easily explained by a random process than by a systematic process." This isn’t subtly different from "Is definitely systematic." Of course it’s much better than just guessing. :-)
A terrific example to support what you’re saying is Anscombe’s Quartet. Four data sets with identical stats — not just average and variance but even linear correlation. But clearly the data sets are radically different.
In short, stats are nice for guiding us and for generally characterizing data, but we still have to use our brains to really know what we have.
Which is a problem ’cause we all (all of us!) bad at that last part. :-)
One of the most useful articles I’ve read in some time. Thank you.
BTW, "QED, suckkas!" made my day. Cheers!
Jason, nice explanation and a great rule of thumb for marketers waiting for the A/B "pot to boil" when testing ads and landing pages with low CTRs.
Why do you use a chi-square test on the number of conversions instead of a two-sample t-test on the conversion rate?
@George, @Dale — Thanks, glad it was helpful.
@Albert — The t-test is the wrong test because it answers the question: What is the chance that these two independent sample distributions are drawn from the same population?
In this case we have one sample — the selection process — which has two discrete outcomes. Not two separate samples, and certainly not independent.
The two-sample t-test is used e.g. if we took a sampling of heights from men and another sampling of heights from women and wanted to know whether there was a significant difference between the means.
It might be clearer if you think of the A/B test as a series of coin-flips. We do a bunch of flips and measure heads and tails. The question is: Is the coin biased or not? There’s just one sample (i.e. the heads and tails) and one degree of freedom (because "heads" implies "not tails"). If the "coin" is fair — in the A/B test — it means there’s no difference between the results we’ve seen and a coin-flip, therefore there’s no reason to believe "heads" or "tails" is "favored." If the coin is "biased," there’s a systematic rule here that is significant.
See also my own comment above where I qualify this even more.
Good question though, thanks for asking!
Jason,
I thought the two-sample t-test was used to test if two population means were equal?
That’s what they say here: http://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htm
Quoting: "A common application of this is to test if a new process or treatment is superior to a current process or treatment."
Isn’t that exactly what an A/B test is? One of your tests is the control treatment, and the other is your experimental treatment?
Jason,
I guess this is the way I think of it. You have two treatments, A and B. Let’s call A the control. You want to measure the conversion rate for some action (CTR on an ad, % of people purchasing, etc.).
This leads to two random variables: X_A and X_B. Your null hypothesis is that the means of these two random variable are equal.
Is this the wrong way of thinking about it?
@Albert — I think you’re right if you use rates rather than number of hits.
Usually you use the t-test only when the sample distributions are normally distributed, which in this case they are not. That is, if you’re sampling "click-through rate," the possible results of each trial is a discrete "0" or "1" depending on whether there was a click. Clearly this is a two-valued discrete distribution — nothing resembling normal.
However, it’s also true that the t-test can also work even when the samples are not normally distributed when N is large (because we’re talking about standard error of the mean which is always normally distributed with large N). And in the case of AdWords and click-through rates, N is indeed typically large, like in the thousands.
As a (non-rigourous!) test of your method, I ran a t-test and the chi-square test using the following example: AdWords shows option A 2000 times and B 2000 times. A is picked 20 times (for a CTR of 1.0%) and B is picked 32 times (for a CTR of 1.6%).
Here the X^2 value is 2.77 which is significant at 90%. The t-test value is 1.68 which is also significant (two-tailed) at 90%.
So in this little but realistic example, the two tests come out identically. Probably not coincidence, but I’m not qualified to say! I’ll run this by some real statisticians and I’ll post what they say about it.
Under the assumption that they are identical, my rule-of-thumb is of course easier to compute. But I agree it’s important to know whether they are indeed identical.
Thanks again for this great discussion!
Good article, with one presentational grumble: Giving a simple yes/no “statistically significant” answer is too simplistic. Even for “a statistically sound yet incredibly simple formula” you need to state explicitly (as you do in the mathematical appendix) that this is statistical significance at slightly more than the 95% confidence level. This is a widespread problem in the non-technical discussion of sampling. Generally when journalists use statistical significance, they speak as though there is some magic threshold between statistical certainty and complete irrelevance, whereas (as usual) it’s actually a continuum with no single silver bullet.
@Flash — Good criticism; I agree.
That’s one reason I included the math at the end — there I was precise about what I mean. But you’re right that a mention of confidence level might have been better in the main text.
The balance is trying to get people to use better techniques than they currently do without turning them off with details.
Is it sad that people can’t be bothered to understand the details? Yes. But it’s reality, so in the interest of helping I did what I could. At least precision does exist in the bottom part.
Jason, good discussion with Albert.
I was thinking about the same issue over Easter weekend, and did the same thing as you — compared results for t-test vs. chi-square, and for the couple of examples I tried, I also got the same answer. So…I’m waiting to hear what some ‘real statisticians’ say!
@Dale — I asked informally and was told that it ought to come out to the same thing, roughly, but that if you had substantially different variances between the samples it would be different.
The theory is that the variances shouldn’t be different of course, but it’s not clear that it must be the same.
It does seem in the end that both work in this particular case, and that therefore the chi-square method is more practical just because it’s easier.
Great article. I must admit that the hamster drew me in and the great article was nice bonus :)
@Kris — Thanks! Hamsters are awesome. Cute overload made me a convert. :-)
Could you provide a little more explanation regarding:
"Because the AdWords "click-through rate" is dependant both on the number of clicks and the number of impressions, you must not use "click-through rate" to determine statistical significance. Only the raw number of clicks can be used in the formula."
I don’t use AdWords, but I’m interested in A/B testing in general. Is this specific to AdWords, or a general comment about click-through rate?
Thanks.
@leon — That’s an AdWords-specific comment. For general A/B testing just ignore that and use the rule of thumb.
Hello Jason, first time on your site. This is a fabulous article and I think is very intuitive. Thank you for sharing with us. We run campaigns from time to time on the 3 big ones and this will come in useful.
This seems to have a problem if you start a new test in the middle of the day. Since adwords only reports daily clicks the old ad would have click data and the new ad would have 0. Wouldn’t this almost immedialtly generate a confidence that the old ad with data was better since the new ad would appear to have 0 clicks?
Rob
@Rob — If one ad runs for 24 hours and another runs for less time, of course you cannot compare the two numbers! That’s clear from an experimental point of view, but thanks for pointing out that AdWords doesn’t make that condition clear, so you have to be extra careful not to fall into that trap.
Also if one ad really does have zero clicks and the other has a large number, that’s a sign that your test (or data) might be completely broken.
P.S. The mathematics of having one test be zero is fine — you don’t get a divide-by-zero or any such thing — but unless it’s really true that there’s zero clicks it probably means it’s a problem with the data.
Jason,
Any thoughts on when it is appropriate to run this test? As an example.
Start of test:
Old Ad 1000 impressions 100 clicks
New Ad (started at noon) 0 impressions 0 clicks
Midnight of day 1
Old Ad 2000 impressions 200 click
New Ad 100 imprressions 50 clicks
So somewhere between noon and midnight I might start to want to see if we have a winner but if I check too soon I may think I have a winner but will not because of the small number of clicks from the new ad. What is a good way of telling when I shoud start checking?
Rob
Rob
@Rob — When you have an unequal number of impressions you cannot use the test. This only works when there’s a 50/50 chance the person sees either ad.
Generally having N as 100 or more is good.
@Jason: I’m afraid you’re making the typical rookie mistake in trying to interpret classical hypothesis testing as probabilities of the null hypothesis being true when you say (in a comment) "The t-test is the wrong test because it answers the question: What is the chance that these two independent sample distributions are drawn from the same population?"
The Wikipedia’s entry for statistical hypothesis testing provides a proper definition: "Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?". The probability in this definition is the p value. Classical statisticians will "reject the null hypothesis" if the p value is low (typically p < 0.05 or p < 0.01), but note that they will never "accept the null hypothesis". The closest they get is not rejecting the null hypothesis.
The difference is subtle, but note that the test isn’t phrased as "what is the chance the null hypothesis is true" (or "what is the chance it’s false", from which we could easily calculate the probability of it being true). Classical statisticians get very upset if you try to interpret their tests as probabilities about hypotheses.
It’s easy to see that t-tests aren’t estimating the probability that the null hypothesis is true (or that it’s false), because the null hypothesis is a point, and the probability of a given point in a continuous density is always 0. You need to integrate over a continuous range of more than one point (that is, a set with non-zero measure), to get non-zero probabilities.
Most advice on classical chi-square tests also requires at least five entries per cell; that’s because they’re assuming independent normal variables which aren’t well approxiamted by arbitrary low count data.
@Bob — Technically you are correct that the null-hypothesis (H0) is an exact number, i.e. "the means are identical." However in science that restriction is often lifted or blurred. For example, if your H0 says only that means are identical and doesn’t mention variances (a typical situation!), then in fact this is not a single point.
Also of course it’s true that the t-test is asking the probability that a result "as extreme" would have been observed at random.
In all this text I’ve tried to avoid preciseness when it interrupts comprehension by someone who isn’t schooled in the terminology and detail of statistics.
It is accurate to say that, in layman’s terms, one use of the t-test is to gauge whether two samples appear to come from the same population, and specifically the probability that a particular two samples could have been randomly drawn from one population.
Your rule of thumb of 5 items per cell is good; I would probably argue for even more items than that, mostly because people tend to put too much weight into results with small N, so I like to err on the side of letting the test run "too long" and getting significant results.
Thanks for your exact definitions though — it’s good for everyone who wants to get exactly the right treatment of the problem.
It sounds good for small number of clicks. Lets say, if clicks are in hundreds or thousands, where the DxD can easily exceed the N, even though Dx2 is small enough in comparision. Does this formula still holds good?.
A = 500
B = 400
n = 900
D = 50
DxD = 2500 > 900.
The actul difference in clicks is just a 100. or +- 25%.( not enough to discard as statistically insignificant )
Sorr, I am poor at maths, couldnt get those formulas.
@SEOIndia — Don’t apologize! This isn’t easy stuff. Yes, the formula is still correct, and yes you’re right that with large numbers it’s easy to get significant results.
You cannot use things like +/-.25% to decide whether something is "significant." If the numbers were A=5, B=4, this differs by a massive 20% (or 25% depending on how you count it) and yet it’s completely insignificant because with so few trials it could easily be like that based on chance.
I realize this is counter-intuitive, but as I point out in the text, intuition is useless when it comes to statistics.
I totally agree with that.
If you are working with a very small data set finding what is statistical significance is generally not possible (or a good idea).
When you collect more data you will be able to conduct a much more accurate test.
.-= Charles’s latest blog post: 3 Reasons That Site Owners Are Not Using A/B Testing =-.
@Rob — When you have an unequal number of impressions you cannot use the test. This only works when there’s a 50/50 chance the person sees either ad.
Generally having N as 100 or more is good.
Jason, what test would you use if the impressions are not of the same size? Or is it impossible to come up with any statistics as to effectiveness when the impression sizes are different?
@Matt — You can certainly do it when the number of impressions is unequal.
You use the same root statistical test — chi-square — you just can’t use my simplification. So take the original equation in the math section, and determine the "number of hits" and "expected number of hits."
So for example, if A is presented 70 times and B is presented 30 times, E1 would be 70, E2 would be 30, and then O1 and O2 would be the observed numbers for each of those. You grind out the value of X2, then use the chi-square distribution table I linked to near the bottom to determine the confidence interval.
This has got to be one of the best demonstrations of what can be such a complicated subject I have ever read. Good job Jason.
.-= Sean Clark’s latest blog post: Lights, Camera, Action could your business be an Oscar nominee? =-.
Regarding statistics of plane crashes versus car crashes, what are the survival rates? I don’t base my fear of flying on success rate, I base it on what % chance do I have of surviving if an accident happens. Usually the calculus is not in favor of the airplane.
My statistic of “60x more likely to die from a plane” is based on death rate not accident rates. 60x more people die in cars than in planes. Of course assuming an accident has occurred you’re more likely to survive in a car, but that wasn’t the idea.
I thought a Student t was the standard mean difference test, rather than Pearson. (For example, that’s what Mathematica uses in its function MeanDifferenceTest.) Does anyone have any information on the benefits of one over the other?
.-= Daniel Reeves’s latest blog post: What Can Search Predict? =-.
See comments above where this was already discussed.
Know comment from a Bayesian yet?
Looks like you took a chapter from Super Crunchers and applied it to A/B testing. Excellent post, excellent book, too.
Nope, haven’t read the book, but now I’m intrigued! Thanks for the link.
Jason, I think you should point out that there is a minimum sample size you need before you can apply the Chi square test. As a thumb rule, at least 5-10 samples are needed in each cell before the test should be considered for testing statistical significance. Any less than that, you may have serious chance of committing type II error (accepting the null hypothesis when you should be rejecting it).
See this for details.
Agreed, thanks for the pointer.
There’s a problem with your stats.
If you repeatedly check, “is it significant yet?”, and stop as soon as you think it is, you’re biasing the maths.
Consider a fixed-size sample, tested once at the end, which shows an A:B test is not significant. Now process the same data, but check after every new data point is added. “A”s and “B”s arrive randomly, but some of the random possible sequences of “A”s and “B”s have more “A”s at the beginning, only to be balanced by more “B”s later. This adds up to show a lack of significance at the end, but sometimes the checks *in the middle will show “A” is significant and incorrectly stop the test.
This is comparable to the different distributions of a “drunkard’s walk” and a “drunkard’s walk near the edge of a cliff”. In the first case the drunk is equally likely to wander generally left, as right. But in the second case, he’s more likely to end up wandering towards – and probably over – the cliff, because once he falls he has no chance to wander back towards safety.
Fantastic point! This is indeed a fallacy that we need to be careful about.
However, A/B tests typically have to be done in situ rather than in a controlled environment. There are many variables that can also cause incorrect results such as time of year, coincidence with holidays, or a big piece of press by you or a competitor.
So in practice you do just have to pick a point and measure. Your point is that you should pick a pre-defined N and go exactly that long, and I agree!
P.S. Although I agree with you generally, the drunkard’s walk is not IMO the right analogy, especially since with a cliff the probability of him falling is 100% as N tends to infinity.
Nice to see articles where you get into the “how-to” and not just the why. What did you use to generate your equations?
Microsoft Equation Editor. Comes with Word.
Jason, I quote: “The test result is statistically significant if D2 is bigger than N.” I ask, what is the confidence level for this? 95% 98%?
Read the bottom of the post where I state the confidence level among other details.l
Here’s a dumb question, 6 months later:
Can I use this same approach to test conversion on two different landing pages? (Or is this test only valid for testing click-through on an impression?)
Let’s say that I’m buying media with an ad, and I send half the clickers to page A and the other half to page B. I care about the number of people who make a purchase.
If I send 1000 people to each page, and 10 make a purchase on A but 20 make a purchase on B, then N = 30, D = 5, and the test is NOT statistically significant (at 95% confidence interval).
Is this correct? If so, it’s a bummer, because most people that I know are not testing their landing pages sufficiently.
Yes that is correct, and no the test isn’t significant.
It’s an extremely common fallacy to stop testing even with small N. We say “Ooo look, one number is DOUBLE the other number, it must MEAN something,” but it’s just not true.
Here’s a good way to understand WHY a difference of 10 isn’t significant in the example you gave. The total number of people in the “study” is 1000 per side, but we expect only a small sliver of those people to act (e.g. between 1% and 2%). Because that number is so small, just based on random variation it could easily be that the same side that got “10” this time might get “15” or “5” next time, and similarly the “20” might easily be “25” or “15.” Notice that in both cases “15” is quite possible, hence there’s not enough of a difference.
Yet another way to see this is to run the same test again with a new 1000 people. Do you still get EXACTLY 10 and 20? If yes, THEN it’s significant (but of course just running with 2000/side would show that too), but I’m guessing the results wouldn’t be so similar.
I like this! It’s a simple tool that anyone can do in their head to get a measure on the strength of evidence in an A/B test. While it’s definitely a step in the right direction, I have two concerns.
First, as Pete Austin already pointed out, with this technique if you don’t pick an N initially you’re certain to misuse the tool. Since there is always a non-zero (and possibly large, even with chisq > 4 as a cutoff) chance of a false positive, as you check more and more you’re almost certain to get significance even if A is pretty much the same as B.
(I like the drunkard walk metaphor, though. If you check after every click, the chance of you finding significance goes to 1 as N goes to infinity even in H0 is true.)
Second, kind of related to Jason’s comment, your method disregards total impressions. Since the point of an A/B decision is to influence the visitor’s behavior, it’s important to have a context for how many times someone didn’t do something. 10 clicks for A and 7 for B definitely isn’t significant if there’ve been 1000 impressions on each page, but it might be if there were only 10.
In the name of simplicity, neither is really deadly. Pick an N, let it be pretty high, everything will go smoothly. I’m curious about wonderer’s comment though, so I did a little math/simulation toward a Bayesian technique. I’m going to do a double-check before commenting more.
I’m very much looking forward to the results of your simulation! Thanks in advance for sharing that with us.
However your comment about the drunken walk is completely wrong. What you’re thinking of is: The probability that the drunkard reaches any particular point X goes to 1 as time goes to infinity. That’s in no way relevant to an experience where each trial is independent! Each of the the drunkard’s trials are dependent on the trials that come before.
This experiment is more like flipping heads and tails. In that case it is NOT true that as N -> infinity there will automatically be a significance.
But we do agree on a basic error with my formulation — that it suggests you “just keep going” rather than pre-select an N.
Jason – this is awesome stuff, but surely most of the time you wouldn’t have equal number of impressions? So couldn’t use your simplification? Usually with Adwords or landing page testing, you have a pre-existing version which has been running for a while and you want to test if a new version is better or not, right?
It’s not really like Hammie’s experience, because he always sees both carrots and has to choose one (or, as you say, do nothing, have a wash, whatever). So each carrot is, by definition, presented an equal number of times.
But with Adwords, you don’t present both ads at the same time and have the user choose between them. Some users see one ad and some users see another, and you probably won’t have an equal number of impressions for each ad. Especially if Adwords is following its default behaviour of ‘optimising’ your ad variants, i.e. showing more impressions of the better-performing variant, giving your new test variant less exposure.
So in most Adwords situations you’d have to churn through the full formula, I think, which is a lot harder: after a good hour or so, even with the help of Excel’s CHITEST formula, I still haven’t quite got my head around how to do the arithmetic from scratch).
(I know you have addressed this in comment 37, just thought it maybe needed highlighting.)
.-= Bruce Greig’s latest blog post: Can you explain your business to a cab driver? =-.
These are great points, and you’re right to be wary. Answers:
RE: “Won’t be equal number of impressions.” With standard A/B testing the algorithm typically toggles back and forth with essentially equal impressions. For large N it should be fine.
RE: “AdWords stops showing worse ads.” This is a very real issue and absolutely negates everything said here. To use this technique with AdWords you must disable the “heat seeking” display algorithm.
RE: “Not one person choosing.” Correct again. However in this case we’re considering every person to be “identical enough” to every other person, so that trials across people are the same as multiple trials with one person. Which, of course, is an assumption in which you could poke lots of holes! Probably the validity of that assumption depends on the specific scenario and market segment you’re hitting. It’s safe to say that this implies that small, statistically-insignificant differences are therefore even less likely to be interesting.
Jason – firstly thanks very much for a brilliant post. Shows how good it is when people are still discussing a year after the original contribution!
My colleague and I have used this as a pointer to develop what we should be assessing our current site A/B tests with, given a known A/B split as a decimal (i.e. 70/30 would be 0.7 and 0.3, etc). It’s a little more complicated than the above rule of thumb, but still surprisingly elegant. :)
However (and I know there has been some discussion of this above), my problem is deciding on a large enough N to aim for.
For example, if we are a/b testing two different versions of a page and want to count the number of clicks on commercial links on each version, I’m finding it hard to figure out how many impressions or indeed total clicks will make the test valid depending on the split. Is there a statistically sound way of deriving this?
Or is the only way to continually monitor the ‘yes/no’ validity of the results and keep a note of when the frequency of ‘yes’ exceeds a certain limit?
Many thanks,
Tim.
Actually neither! You have to pick an N ahead of time. See comment 56 above and my response to it.
You’re right that the further you get away from 50/50 the more N you’ll need in general. To see why, consider a 1/1000 split. Since you’re showing one side just 1/1000th of the time, small changes in actual “hits” there will make massive changes in the results.
Unfortunately I don’t have a good answer for you as to how much N is enough. You might check out the resources listed by Flip in this comment. Also Flip is a nice guy and loves data and math, so you might even reach out to him directly!
This is great article and a great debate. However, I haven’t found answers to my question (maybe because there are so many comments :):
I have A/B test of ecommerce product detail and no.visits seeing A are not exactly the same as those seeing B version. We label each visit with unique session id, but sometimes this for some reason gets in favor of A.
This is why I rather used contingency table to calculate the true correlation between A/Bversion and goal completions – and calculated chi-sqare manually. I thought of your formula being a simplification of cross-tabs anyway when you assume A-B having 50% chance each.
But is my calculus really correct? If I have for instance 186000 visits seeing A with 2900 conversions; and 182000 visits seeing B having 2800 conversions; then hi-square is 0,2368 ie diff. is NOT significant?
So … conversion visiting B is 0,02% higher than to A, but this is due to chance alone and we can not draw any conclusions from that…?
Good question. You are right that you can’t use my simplified formula when the trials aren’t distributed evenly between A and B.
Your resulting statement is correct — it’s not significant.
Also you should never say something like “B conversions are X% higher than A.” The reason: The percentage higher one is from the other is not enough to know whether it’s significant — it’s the amount different and the total number of trials. This is what the stats are doing for you; the proportion bigger isn’t relevant.
Hi Jason,
I am no statistical expert but I want to challenge your notion that impressions do not matter in this case because I think the experiment you set up with hammy is not analogous to a ppc situation. The reason you can take CTR out of the equation in the hammy experiment is because there is an equal number of impressions for either carrot because they are shown simultaneously. In PPC, only one ad is served at a time. Let’s say that the serving rate is not 50/50 for whatever reason and Google serves one ad more often than the other. How would only the raw number of clicks matter?
My bad I’m sorry, someone already asked this question and I was under the impression that the newest posts were at the top, not the bottom!