Topic: Research/Metrics

Margin Of Error In Split Tessing

Posted by Anonymous on 500 Points
We're running some split tests on our web sites where half of the
audience sees Version A of a page and the other half sees Version B. We count how many clicks each version generates and the one with the most clicks wins. For example:

Version Views Clicks CTR
Version A 4,550 2,010 44%
Version B 4,510 1,750 39%

We'd like to stop running the test as soon as we're reasonably sure that we've got a winner but I don't really know when that is. How do we go about determining if the test results are conclusive and meaningful?

Thanks for any assistance!
To continue reading this question and the solution, sign up ... it's free!


  • Posted by Frank Hurtte on Member
    You forgot to say how many days of data this is...
    assuming it is over a period of a couple weeks, i would say you have a winner.
  • Posted by wnelson on Accepted

    You have a two part question - but you didn't realize it. The question, which you asked, is: "What should my sample size be for this test?"

    If you go to the Sample Size Calculator website, you can find this:

    The parameters you are setting are:
    1) Population - the number of people in the world who will be seeing your website. Let's assume that your population is "everyone in the world." So, if we use a very large number, say, 1,000,000, we will calculate the maximum sample size needed.
    2) Confidence - this is how sure you are going to be that the results of your sample reflect the true population. The higher the number, the larger the sample size. This is the "certainty" of the results. Customarily for most marketing work, 95% confidence is ample. The default in the website is set at 0.95.
    3) Margin - This is how much error you are willing to allow. If you allow 5% error, that means that in a sample size of 100, if the results are 50 clicks, the true number of clicks could be between 45 and 55 clicks. This is the "precision" of your test. The small this number, the larger the sample size.
    4) Probability - this is the value of the result you expect to get. For instance, if you expect to get 50 clicks out of 100 views, this value is set at 0.5. Of course, most of us don't know this value. But the good news is that setting it at 0.5 yields the largest sample size. If the number of clicks is 20 or 80, the confidence increases for the same margin or the margin decreases for the same confidence. And this is a good thing.

    Sample size, of course, determines cost of the test. In your case, this is time. If we use the parameters of a population of 1,000,000 with 95% confidence for 5% margin of error with a probability of 50%, then the sample size is 385. For a 1% margin, it's 9517. You are at about 1.45% now. That means that you are within 65 of the true population result. Additionally, assuming the number of views are pretty close, this means that if the results of your test can tell the difference between the two websites as long at the results are different by more than 65.

    The second part of the question - that you didn't ask - is how do I determine if the two results are really different. For this, you do another test - a Chi-squared test. You have a hypothesis that the two results are the same versus the alternative that they are different. For the test, we look at the observed values versus "expected values." Expected values are what we'd get is we added the total clicks for both tests and divided by total views and then multiplied by the views for each:

    3760/9060 = 0.415
    Expected values
    4550 * .415 = 1888
    4510 * .415 = 1872

    Divide the difference of the expected values squared by the expected value and add the two values:

    (2010 - 1888)^2/1888) = 7.883
    (1750 - 1872)^2/1872) = 7.951

    Sum is 15.834. Using a Chi-square table like at:

    If we set our confidence at 95%, we use 1 - .095 = 0.05. Our degrees of freedom are calculated by subtracting 1 from the number of proportions in our test: 2-1=1. So, for 95% confidence, we test the value of 15.834 against 3.841. Since 15.834 is larger, we reject our hypothesis that the two results are the same and accept the alternative hypothesis that they are different - with a much greater than 95% confidence.

    So, as Nelson (nelsonm) stated in about 20% of the space, you're done. Looks like version A is the best with a greater than 95% confidence.

    I hope this helps.

  • Posted on Member
    A statistical answer based on margin of error alone won’t be enough. Statistically, you appear to have enough numbers in your sample size for good confidence to claim superiority for Version A.

    But without knowing further the context of your study and your business, you may or may not need to continue. It may be possible for you to gather that much data from one afternoon of posting the study.

    For all studies, you should also consider:
    - Duration (how long you’ve had the study going).
    - Period and cycles (is the study conducted at a time of high, medium, or low activity based on time of day, season, budget cycles). In some businesses, this is a major factor as different populations sizes and demos are present during different times: education, government, retail.

    Extend your study to cover longer durations and different periods in order to increase your external validity (the ability of your study to be generalized to other populations).

    Happy marketing.
  • Posted by wnelson on Member
    Actually, if Version A and Version B are being tested at the same time, and I assume from your description "we're running some split tests on our web sites where half of the audience sees Version A of a page and the other half sees Version B," then the periodicy and duration doesn't matter. They are the same and hence you can conclude you have a winner between the two versions. To extend that to say something about what will happen over a year period, yes, you have to take into account the timeframe and periodicity of the website. But, that's not what you were asking. The statistics presented above are indeed enough to distinguish if the versions are different and which one is better. :)

  • Posted by wnelson on Member
    For large sample sizes, the z-test is OK as an approximate to the Chi-square test. In fact, the z value squared will be nearly equal to the Chi square value. It's either/or. Whichever you prefer. The complexity of the math for the calculations are about the same. That being the case, I usually don't bother with the approximation.

    z= ((x1/n1)-(x2/ns))/sqrt((p*(1-p)*(1/n1 +1/n2)) where p = (x1+x2)/(n1+n2)

    Test that against the z statistic from the table:

    When I gave the answer, I converted clicks to click-through rate. A step beyond click-through is "conversion rate" where you measure the result in the proportion of viewers who take some action - like registering or buying. Statistics are the same here - just a different measurement. With respect to this test, if you are measuring the pages for how effective they are at driving a person to click to go to another page, and then take an action, if you can assume that the two pages have no effect on whether the viewer takes action on the next page - i.e. the page under test and the next page with the action are independent, then click-through is adequate. I would be skeptical of this assumption, however. Whatever information is prompting someone to click probably affects their likelihood to take action. However, it would be very odd and rare for a page to yield a higher click-through and then a lower conversion rate, since you'd think the same thing that would cause them to click through would probably support an action. So, if the click-through rate is higher for version A, I'd believe the conversion rate for version A would be greater than that of version B.

    So, in conclusion, I still believe you are done. Version A is better than version B with a confidence higher than 95% and I believe it's pretty safe to assume the conversion rate will be better than with this version.

  • Posted by steven.alker on Member
    That’s three brilliant and spot on answers from Wayde with the statistical analysis.

    That is assuming that your statement that half the audience sees site A and half sees site B is 100% true and that the viewing is mutually exclusive. Have you indeed ensured that your audience is partitioned and that neither has seen both sites?

    If they have, especially if there was a mechanism for respondents to see site A before Site B without your knowledge (A lot of if’s here, but the figures are close on a confidence rating which is high) then the only thing which could skew your results would be the possibility that your audience responses would gain a bias by viewing both sites. That’s would introduce an error due to the subjective appreciation of the questions and willingness to click through after viewing the second site.

    If you carried out a simple poll based on the same question, presented in two different ways by two pollsters separated such that their polled sample’s were unaware of the other pollster and unable to be asked by the second pollster then the same analysis of the results would hold true.

    If the some of the polled sample were able to be questioned by both pollsters, but only in the order of pollster A followed by pollster B (Say it was a one-way street!) then there would be reluctance from that % of the sample who had already been polled by pollster A to answer the question put by pollster B.

    If through your test methodology you are able to guarantee that all respondents to site B had no knowledge of or access to site A, then your answers would not suffer from any subjective bias and your result stands as Wade has proven.

    Nice piece of work – I wish that more Emarketers would utilise valid mathematical techniques rather than Dan Brown style, symbolic gibberish!

    Steve Alker
    Unimax Solutions
  • Posted on Author
    Thanks to all for your responses. This is exactly the kind of information we were looking for.

    Jeff James
  • Posted by wnelson on Member
    If this were a search engine and keyword question, that may be pertinent. However, this question involves two identical pages and the counting of a desired action - a click. One take on this is that the "click" is the desired action in and of itself. The other is that perhaps there is another page that has some sort of desired action - registration, purchase, whatever. If the situation is the first, then conversion rates are not of interest, since the act of the "click" is the conversion. If it is the second, then there are two pages that route the view to the same page for which the viewer is faced with a conversion decision.

    The SE argument for the relationship to keywords and clicks is that keywords influence rank and higher rank means more clicks. In some instances, lower position keywords generate higher conversion rates and the thinking is that perhaps this is because lower position keywords are more specific. If the searcher inputs the lower position keywords, they are serious about the search and are a higher potential buyer. In the following article, the author use an example of "running shoes" and "low cost running shoes". While the latter will rank lower, the searcher is specifically looking to buy versus there are many more reasons to search on "running shoes" than for a purchase decision.

    This article also shows some of this:

    This article concludes that high rank means high CTR, but the conversion rates go up as the rank declines. HOWEVER: The actual amount of business goes down (product of clicks times conversion rate).

    Both articles conclude that in general, SE optimizers should strive for keywords that yield high click through and also content that is good and promotes conversion.

    This being interesting and a big concern in search engine design, the relevancy to this question is not clear. Jeff's experiment appears to be reviewing content, not keywords. And most experts see a direct correlation between content and conversion rates - as in this article:


Post a Comment