Question

Topic: Strategy

Email: To Split Test Or Not To Split Test?

Posted by JESmith on 500 Points
Do you think split testing of email marketing campaigns is worthwhile for small lists? On a recent webinar, the speaker said they did not consider split testing statistically relevant unless their client had a list size of 50,000, and that instead they encouraged their clients to focus on best practices.

Thoughts?
To continue reading this question and the solution, sign up ... it's free!

RESPONSES

  • Posted by arthursc on Accepted
    That speaker seems to have been an idiot. Of course you always want to use best practices, testing or not. And, btw, testing is a best practice.

    It all comes down to what is a statistically valid sample, and the answer for me is not that complex.

    In direct mail, the rule of thumb was always 5M per split or panel initially, and then for re-testing (which in any channel one must almost always do), if the results were iffy, stick at 5M, or the if the results looked promising, up the count to 10M, 15M or like that. Now, that assumes a certain minimal total list quantity, but nonetheless it's common practice. That said, if I had a list of 5-6M total, I would still test! I'd split to 2.5M. Of course the smaller sample means the data is somewhat less reliable, depending on your response rates. Bottom line is a test panel needs to garner enough responses to be actionable.


    In email, conceptually it's really the same, although the best practice in email testing, simplistically, is to take some small samples (see below) as test panels, email 'em, and roll out the remainder with the winning the panel. Many companies for various reasons cannot do that, and so resort to standard A/B or A/B/C testing, depending on quantity.

    But the problem in email is that responses are almost always lower than in direct mail. (ROI may be higher,of course). So using smaller test panel quantities could be problematic. But not necessarily.

    Say you've got a total list of only 5M, but you know you need to find out what day of week, or subject line, or offer, works best for the audience. You intutively know that doing an A/B split of 2.5M per panel doesn't seem like you'd get enough responses to guide you.

    But wait! Maybe you'd do well and get a 1% total or 50 responses or orders. What if one panel got 40 and the other 10? I'd say that's telling me something! Should you rollout? Not necessarily. It's still a small sample, and it could be a fluke. You'd be better served repeating the test if at all possible. Get similar results, then roll out.

    But what if the test panel results are so close that within a margin or error they are a tie? (That could happen of course in a huge emailing as well.) Perhaps it means that whatever you're testing didn't really matter in the response. That's actionable. But perhaps it actually does mean that the the sample was too small to be meaningful.
    How do you know which of those conclusions to draw? You have to test again!

    So bottom line is that even large test panels often require a re-test, and real small samples most always require a re-test, and especially so if the results are close.

    Don't not test because qty is small, just be aware of how to review and massage the data, in what circumstances the data is actionable, and be prepared always to restest.

    But finally, testing below 2.5M per panel is not likely to be useful.




  • Posted by ajanzer on Accepted
    There's an online tool for looking at whether split testing of online ads is statistically significant, which I imagine you could apply to email campaigns as well.

    https://www.splittester.com/

    For ad impressions, I'd substitute # of emails, and for click-through-rate, substitute response rate. Disclaimer: I'm not a statistician, nor have I played on on TV.
  • Posted by SteveByrneMarketing on Accepted
    In agreement with above posters - test, test, test.

    I usually test the subject line, (since in scientific method you can only test one element at a time). And I base the test on offer structuring, meaning either make an (a) offer and a similar (b) offer or I use the same offer just stated in a different way.

    For example, "buy one, get one free" will always out perform a "50% off" offer.

    If you would like to discuss offer structuring, see my profile for contact info.

    Steve

    good luck,

    Steve
  • Posted by wnelson on Accepted
    Jono,

    Oh, for gosh sakes, there's no need for any "rule of thumb" answers here! Let's use math instead of gut feel. Statistical validity is a very precise answer that involves three things: Universe size, Sample size, Confidence level, and margin of error. For lower margin of error and/or higher confidence level, you need more samples. The acceptable confidence level and margin of error is a business decision - pure and simple. The trade-off is the value of accuracy versus the the cost of sampling (cost in dollars and time). Additionally, the use of the information plays into this decision. If you use the data to make a decision and that decision is +/- 25%, then you don't need a confidence level of more than 75% for the sampling. Commonly, a confidence level of 95% is used with an margin of error of 5%. How you interpret that is: If you had a a test where 60% of the people answered "Yes," then you could interpret this as "the sample results indicate that we are 95% confident that the true proportion of the population that would answer "Yes" is between 55% and 65%.

    If you had 150 respondents for each site and the difference in taking the desired action was greater than 10% (like for email A, 55% clicked and for email B, 45% - you are 95% sure that A is better than B. 150 is a lot different than 50,000, huh? Nor does it take millions.

    You can use a calculator like at: https://www.dimensionresearch.com/resources/calculators/ztest.html.

    As results come in, you can recalculate and see what the confidence level is. When it's high enough, you have your results. Sample sizes per split don't have to be the same.

    I hope this helps.

    Wayde
  • Posted by Linda Whitehead on Accepted
    You have received some really great, detailed responses to your question. In my opinion, without question, you should always test regardless of your list size. As Wayde says, you will always get useful information that will help you make decisions. We wrote an article about our experience with email testing at my former company that you might want to check out-https://blog.adbase.com/2008/10/email-testing-your-guide-to-better-response....
    Also, there is something else I thought you might want to be made aware of. There is one ESP who has split-testing capabilities programmed-MailChimp www.mailchimp.com
    "MailChimp can run an A/B test on a tiny, random sample of your list, analyze your results, then automatically optimize your campaign and send it to the remainder of your list for you. Patent pending."
    I have not used their service but it looks good and has a good reputation as far as I have heard.

    Linda Whitehead
    Zuz Marketing
  • Posted by Jay Hamilton-Roth on Accepted
    Mathematically speaking, you do need a large sample size to perform a statistically valid split test. Using a much smaller sample will produce results, but they might not accurately scale up. Wayde and Arthur have given you great explanations why.

    However, if you don't have a large sample size, then you still get data that matters to your business: what works better (A vs. B). The data won't necessarily scale up (from a sample size of 500 to 50000), but that's not your real-world problem. You're just trying to optimize the data/actions you have.
  • Posted by wnelson on Member
    Jono,

    One thing I didn't point out is that the SAMPLES should be reflective of the mix of the population. If your samples come about randomly and closely reflect the make-up of your entire population, then you can indeed scale up a sample as small as 150 for each site and be 95% certain that the results would reflect those of the population within +/- 5%. That's the way the statistics work out. Of course, if +/-5% gives you too much error (like if the results of the two samples overlap if you add the +/-5% so they look the same) - then more sampling is needed. But, if you take a sample of 150 each and the results are 20% respond for email A and 40% respond for email B, then you can be 95% sure that email B is better. However, if you want to make an inference about what the results of email B would be if you sent it out to your population (say, 1,000,000), then you could only say that with a 95% confidence, the population response rate would be between 32% and 48%. (8% margin of error). if you want to remedy this so that you can be within +/-5% for the inference, then you'd want to have 384 samples in each. If you want +/-1% (95% confidence level), you'd need 9513 in each. If you want 99% confidence for no more than a 1% error margin, you need 16,317 in each.

    However, within the boundaries of the error margin and confidence level, you certainly can scale up results with samples as small as 150 to a million in a population OR the entire group of respondents in the entire universe...IF the sample is random and reflects the population of the universe (make sure you include enough ET's).

    Wayde
  • Posted by JESmith on Author
    This is a fascinating thread, and I've learned a ton. Thanks to everyone for chiming in.

    Wayde: Thanks for introducing margin of error into the discussion. If I'm understanding you correctly, it sounds like I can use a calculator like this to determine the margin of error for running tests on my email list?

    https://www.americanresearchgroup.com/sams.html

    So for example, if I have an email list of 500 people, and send an email campaign with Subject Line A to a random sample of 250 people, and Subject Line B to a random sample of the other 250 people, and the difference in the open rates is greater than the 5% margin of error, I've got a valid test?

  • Posted by wnelson on Member
    Jono,

    Not quite. Your test would indeed be valid with 250 samples in each. In fact, with a 95% confidence, you could detect differences of +/-4% with 250 in each sample. However, the calculator you linked above will allow you to test one sample and give you the sample size that you need in order to ensure your sample results reflect the population (as long as the members in the sample are reflective of the population, drawn randomly, etc).

    The one I thought I sent (but it didn't work, I discovered) will allow you to measure your margin of error and confidence level for two samples as results come in and when you achieve the desired margin of error, you can stop. I think this link will work for you:

    https://www.dimensionresearch.com/resources/calculators/ztest.html

    But, let me help you think ahead! After you discover the two samples are different - one having a higher open rate - you will want to know an estimate of that open rate. If you have 128,063 or more in your population, you will need 384 in each group so that the sample results reflect the population results within +/-5% with a confidence of 95%. 250 would give you a 6.19% margin of error. If this is OK, then go with it. Otherwise, use 384. And use this calculator for single sample tests versus the one you posted:

    https://www.raosoft.com/samplesize.html

    Does that make sense?


    Wayde
  • Posted by matthewmnex on Accepted
    Dear All, This thread has really stirred things up :)

    I hope you don;t mind if I chime in :)

    I tend to agree with everyone here :)

    Split testing is an essential and powerful tool but when the sample size is too small, the results can be anecdotal at best.

    The problem is that the behaviour of users can be si fickle depending upon the weather, what is in the news, whether it is Thursday :)) .

    What I now do in order to get a more accurate read is to tag everyone in the DB based on the date the Contact was acquired and add this 'Batch umber' in my tracking.

    This way I can see opening, click and conversion rates compared against the age of the contact.

    Fresher contacts tend to be more active than older ones but even contacts 2 years old can still perform well.

    If I send a mail on a small set one day and then send the exact same mail again the next day. The results can be markedly different if the size of the testing set is too small. That is why I agree to a certain extent with the person who gave the seminar. Less than 50K DB is very small.

    If you hi 20% open rate and 20% click through rate, then you still have only 2,000 clicks but the same 50K base will perform very very differently from day to day and depending of course on the contanet and purpose of the mail.

    I would tend to focus more on comparing acquisition cost (per record) against revenue generated per record and let that be my guide with respect to the frequency and type of mails I send.

    Split testing, yes, I would probably do some but I would add a large pinch of salt to the results and not waste too much time trying to be scientific with such a smal DB. Use your gut feeling and experience as a marketer to decide what type of message you want to send to them and good luck :)
  • Posted by wnelson on Member
    Matthew talks about the fickleness. By nature of his definition of fickle "users can be si fickle depending upon the weather, what is in the news, whether it is Thursday," the behavior termed fickle is random. I assure you that the statistics fully comprehend randomness when we calculate sample size and determine the inferences that can be made. Statistics is based on the random nature of processes and in fact won't work if the processes are not random.

    Let's look at it this way: If you have four kinds of responses:
    1) People who find the subject line interesting and open
    2) People who find the subject line uninteresting and don't open
    3) People who accidentally open the email but don't find the subject line interesting
    4) People who find the subject line interesting but don't open

    3 and 4 are errors. The errors are random. If we take a sample of 150 people, and the change of error for 3 and 4 are the same across the entire population, then some people will open today because they find the subject interesting and tomorrow will not open it because of random occurrences that stop them. The same with the people who don’t open today because the subject line is uninteresting and then tomorrow accidentally open it. So a random sample of 150 people today will reflect the same results tomorrow, even if the same person behaves differently today versus tomorrow. Another person will behave differently and make up for the random behavior.

    The statistics that I explained in the above postings works – provided the sample is drawn randomly and the behaviors causing error in the sample are random. In this case, I believe this to be true. Our gut feel reaction is that it takes a lot of information to understand the population. Indeed, more data means the sample results more closely reflect the population (e.g. 95% sure you are within +/- 5% of the population results). However, the statistics tell us how far off we are. Whether we invest more into sampling to be more precise depends on the value of the data and how we use it versus the cost of more sampling (time and money). If you have forever and an unlimited budget, sample forever. If you don’t, then sample until you reach a diminishing return in cost of sampling versus preciseness.

    Wayde


  • Posted by Gary Bloomer on Accepted
    Dear Prodmktguy,

    Regardless of the numbers of people on your list the statistical relevance of any number of people and their likelihood to respond to an offer is based on how the copy (what you're saying) of that offer impacts the recipient and his or her problem or need.

    My considered opinion here is that the speaker you listened to
    on your webinar is a half wit. Harsh? Yes. But an honest opinion nevertheless.

    You don't say how large your list is, but let's assume it's 5,000 people. This isn't a list of faceless beings, it's a group of friends, neighbors, and people you pass on the street. They walk, they talk, they all have a heartbeat, and, presuming you pulled this list together yourself and that it's not a bought list, these people are on your list for a reason: to find a solution to a felt need or pain point.

    Let's say this pain point is a deeply felt desire to rid their backyards of deer ticks, but that your data also shows that everyone on your list is an animal rights supporter and that they all raise coy carp and are bee keepers.

    If you split your list in two and send half your list a great solution that insists on use of a harsh but effective chemical to rid their gardens of ticks, a chemical that's also highly toxic to fish and honeybees, and you send the other half of your list a brilliantly crafted offer to stamp out ticks with a wee box of tricks that gets plugged into the mains fifty feet away, what will your response rates be from both mailings?

    Had you NOT tested, and sent out the first mailing, your marketing money would have scored a big fat zero and your PR department would have some explaining to do. Not good.

    The whole point of testing is distillation and refinement, so that you can see what's working and what's not working, and so that you can make the finest use of your marketing budget. The other reason for testing, even if its for a small list, is so that you don't waste money telling an unreceptive audience about something they are simply not interested in, or that they violently object to.

    The point is to to build relationships that you can then serve and one of the best ways of doing this is to test, review, rework and refine, and retest.

    Don't forget, you are not sending a mass mailing to thousands of people. You are having one conversation with one person about their need. Get inside that person's head. Talk to that person about the things that are important to them. Cater to that need and offer astonishing value and you'll gain trust, credibility, and people's confidence. But you'll only gain these things if you approach the person's pain point from their perspective, and the best way to do this is to refine the elements of what you're saying until they cut straight to the solution of that need.

    I hope this helps.
  • Posted by JESmith on Author
    Needless to say, I didn't take a stats course in college, so I want to make sure I have this right.

    Let's say I have 15,818 people on my email list. I'm emailing a fundraising appeal to my lapsed donors. Version A is an appeal with no images, version B is an appeal with images.

    I send version A to random mix of 7902 people and 692 people click on my donate button.
    I send version B to a random mix of 7916 people and 582 people click on my donate button.

    Assuming I'm using the calculator (https://www.dimensionresearch.com/resources/calculators/ztest.html) correctly, it looks like I can have a 99.9% confidence rating that Version A is going to outperform version B every time.

    Did I get that right?

    Thanks :)
  • Posted by wnelson on Member
    Good job! You did well! Yes, that's right. Now if you want to infer how well option A would do if you sent out an email to 128,063 or more people, you could use:

    https://www.raosoft.com/samplesize.html

    and you'd be able to say that Option A would result a 95% confidence that the population open rate would be 8.76% +/- 1.1%.

    Wayde
  • Posted by JESmith on Author
    Wayde - One step forward, one step back.

    I'm not sure I understand how you used www.raosoft.com/samplesize.html to determine that Option A would result a 95% confidence that the population open rate would be 8.76% +/- 1.1%.

    Please advise?
  • Posted by wnelson on Member
    Not a problem. In the third box down, put the population size as 1,000,000. Leave every thing else the same. At the bottom of the page, under "Alternate Scenarios" - if you change "100" to 7902. Underneath, you'll see 1.1%. This is the margin of error with 95% confidence. Your proportion for option A is 8.76% opened. (692/7902). So from this sample, you can say with 95% confidence that the population average for "opens" would be 8.76% +/- 1.1% for populations of 1,000,000 or higher.

    Wayde
  • Posted by Gary Bloomer on Member
    So, 15,818 people on your list.

    A fundraising appeal to lapsed donors. Version "A" has no images, version "B" has images.

    Based on your findings and your calculations, you have a 99.9% confidence rating that Version A (NO IMAGES) is going to outperform version B every time, right?

    Er? I say, is that based on just ONE mailing? If so, then alas dear chap. No. I think not. In fact. Wrong. Here's why:

    Many people will say that figures never lie. But what these people forget is that figures can be used to skew results and fudge assumptions and presumptions. In non profit funding, assume nothing until the cheques are cashed! THEN count your success.

    In a rock solid belief in figures there is great danger.

    If you're sending a mailing to people that have attended a shin dig of some kind at your non profit (black tie event, opening, dance etc.,) they're more inclined to respond positively to your mailing WITH images when those images show people like them, meaning, people in their social group, economic background etc., all having a jolly time.

    It's human nature, old boy: birds of a feather and all that.

    Likewise, if you're sending a mailing related to a holiday or a recognized day (Halloween, July 4th, The Queen's Birthday, whatever), and that mailing shows people at previous Halloween events, again, the recipient will see those images and then see themselves doing that thing and taking that action, because now, they have a frame of reference from the images you've included.

    Mix your message and image ratio and relate those images to an event, a theme, or a previous experience (where the recipient had a good time, or where someone they know and who is pictured had a good time), and send smaller mailings to fewer people and you'll get different results.

    When you send your mailing matters just as much as what's in it and what the mailing shows, or what pictures the copy paints in people's heads.

    I hope this opinion dosen't throw things out too much, but it's worth bearing in mind that one ought to never presume that one's figures do not lie.

    To quote Benjamin Disraeli (and a quote that was later popularized by Mark Twain) "There are lies, damned lies, and statistics."

    Good luck to you.

    Gary Bloomer
    WILMINGTON, DE, USA
    mr.garybloomer


  • Posted by wnelson on Member
    There's no doubt about it, statistics are the root of all evil. In the earliest Christian, Islam, Jewish, and Buddhist texts, when evil is mentioned, percentages are given - the writers were 95% confident that more than 85% of all statisticians were evil.

    I believe that there was a discussion posted on RockProfs by the cavemen that fire was bad, too. After all, if used carelessly, it could hurt you. If used recklessly, it could destroy complete civilizations. And it could be used maliciously. However, while the wary cavemen huddled in the cold, some who learned how to use fire correctly and understood all the cautions necessary were inside, toasty warm.

    Statistics has been used as a tool for decision making and risk reduction in just about every scientific advancement. If it weren't for statistics, we probably wouldn't have made it to the moon - at least in nine years and at least without losing a few dozen men in the process. But, then, there are those cavemen out there who don't believe we made it to the moon anyway. The PC on which you are reading this - probably couldn't have been made without the use of statistics. Well, it wouldn't be as small as it is and as inexpensive. The internet over which you receive the signal - telecommunications is based on statistics.

    Like fire, however, statistics is only a tool. The definition of statistics is: The mathematics of the collection, organization, and interpretation of numerical data. Statistics does not replace science, reason, or common sense. We all heard the quip about if I put my hand in a bucket of ice and into a bucket of boiling water, on the average, my body is 50 degrees C. There are a few laws of physics busted with that statement.

    Taking a "survey" by drafting a bunch of questions and sending them out to a large number of prospective respondents and using then resulting statistics is folly. There are design guidelines for survey questions and then only after a multi-step process of focus testing and piloting would any knowledgeable and respectable Marketing Researcher release such a survey. Data from the survey would be carefully analyzed for many parameters, including cross-correlation, bias, etc.

    And such a survey rarely would be the only instrument for research. Marketing research has two branches - quantitative and qualitative. We've been discussing the quantitative branch. Within quantitative research, instruments like surveys and the associated statistics generally are based on attribute data. Attribute data is Yes/No, or "pick the rating out of the seven choices. While less expensive to administer per data point, the value of information is less. More samples are required to have statistically significant data. Another branch of quantitative statistics uses variable data. Variable data is measured - like speed or length. It's hard to come up with anything meaningful to measure in an email marketing campaign that would be variables data. But, if we could, we'd need much a smaller sample size than we could with attributes data - like some place around 70 max ever.

    As for the other side of marketing research, the qualitative side, this branch relies on one-on-one testing. Think focus groups. Using many different methods, direct feedback and observation results in a more complete understanding. As mr.garybloomer suggested, a conversation with just a few customers is worth much more than the results from a million survey respondents because you can ask what you want to ask and then, if you need clarification - the "why" - you can ask that too. Your answers result in a more complete picture. Qualitative data is important especially early on. Conducting a series of focus group testing can help formulate the email "rules" based on the results. Two trade-offs to qualitative research, however: First, it's expensive. Second, because of this, it's difficult to cover a representative population so that you can be sure that the individuals in the test are representative of the population. Generally, qualitative statistics is used to develop hypotheses and quantitative statistics is used to prove. So we'd do focus group testing initially, develop some ideas, and then prove those ideas using a survey across a broad sample. In the case of an email where there are content questions - proportion of pictures and kinds of images, wording of calls to action, etc - the survey may be split into many more splits than two so that you can look at these aspects. Another area of statistics would be employed - experimental design - and sample sizes of each "split" would be determined that way.

    However, not all things have to be tested statistically again and again. Using "best practices" usually is the best route because presumably those best practices have been proved already and testing them again is a waste of resources. So, start with knowing and using best practices and then if there are practices you don't know about, test. Mr.garybloom and others have discussed many best practices.

    If the question here was really, "How can I improve my open rate," the answer is that to more closely conform to best practices. As it was, the question was if split testing was statistically relevant if the population was less than 50,000. Statistical relevance is "testable" and means something very precise. And provided that all the conditions are met, the split test can be statistically relevant. The other part of the webinar giver's statement, "use best practices," - as someone else has said, why wouldn't you use best practices regardless of split testing.

    Note that when I discussed statistical inferences, I used a precise language to state the results: "With 99.9% confidence, sample A has a higher open rate than sample B." Further, with a 95% confidence, the true population open rate will be 8.76% +/-1.1%. I also was clear about conditions. The sample has to be drawn randomly and is representative of the overall population. "Randomly" has specifiers. Think "putting 15,000 names in a a hat and drawing them." As for how well the sample represents the population, there are statistical tests for this.

    However, it all comes back to how valuable the data is. In general, you will statistically test until the cost of the data is equal to its worth. In economic-speak, when the marginal value of data is zero, you stop testing. Also, if your entire population is 500 people (or maybe even 15,000), look at the possibility of having a conversation with each of them versus "testing." If the population is 100,000 or 1,000,000, then this is not so possible.

    So this should be more than enough to completely polarize the world into the "good" and the "evil" side and maybe even convert a few to the evil side in preparation for the big battle. Or at the very least, help some insomniacs.

    Wayde

Post a Comment