When discussing multivariate testing, the most common question asked is “What sample size is needed for credible test results?” The question often comes from B to B marketers who deal with smaller customer sets. The surprising answer is that sample size can be a lot smaller than you previously thought.
The reason that significant sample sizes might be smaller than you expected is that there is no single number, no standard “ideal size” for credibility. The best answer to the question is, ‘”It depends.” It depends on whether you are using multiple test groups of equal size or two groups of unequal size, for example a larger test group and a smaller control group. It depends on your tolerance for errors. It depends on the amount of unexplained variation (or “noise”) in your test. In fact, it depends on several parameters in your experimental test.
Depending on the type of test you’re conducting, there are up to four types of numbers you need to know to calculate the necessary sample size: 1) power, or how much you can rely on the test to detect differences between groups if they really exist; 2) the size of the difference that you need to detect; 3) significance, the level of error you can tolerate if you make the wrong conclusion about the effectiveness of the treatment you are testing; and (in some cases) 4) standard deviation, the amount of noise or variance in the results within a group. These measures are explained in the Definitions section below. Given those numbers, you can calculate the sample size.
Rather than do the calculation, many marketers conducting a test use a large sample set, hoping that it is big enough to get reliable results, yet not so big that the cost will be excessive. If you want to be more precise and understand the relationship of these numbers, study the examples at the end. If you just want to calculate a number for your experiment, skip to the Calculation section of this paper.
Power is analogous to the resolution of a microscope. The higher the power, the smaller the features that can be distinguished, so power is a helpful measure of how much you can rely on the test to detect differences in the results from different groups. Power is one minus the probability of concluding that something doesn’t matter when it actually does. Sometimes, when sample size is fixed, you calculate power to see the capability to differentiate. More commonly, sample size is adjusted to achieve the desired power level. Ideally, you want the power of your test to be as high as possible, and an acceptable rate for most studies is somewhere in the neighborhood of 90%. However, when dealing with small sample sizes and marketing campaigns, there is still some information to be gained from tests with lower power.
The results difference that you wish to detect will usually be the difference between the response results from your test groups. Suppose the average order size from each of the four test groups was $250, $250, $250, and $255. If you need to tell whether that five dollar difference in the fourth sample is meaningful, then the results difference you want to detect is five. This is an extreme example; campaign results typically will have a wider range of differences and so it will be easier to detect them. You need to have sufficient power to handle the size difference among your test groups. Of course, unless you have some historical experience with previous tests, you will need to estimate the range of outcomes before you actually run the test.
Standard deviation is a measure of the noise, the dispersion of the results around the mean within a group. It measures how widely the values in a data set are spread. Using the same example, if the standard deviation is $25, approximately two-thirds of the time you can expect an average return between $225 and $275.
The significance level is the probability of saying something matters or works when it actually doesn’t. For marketing tests, we generally set the significance level at 5% (0.05) and forget about it. The significance level you can accept will be another key in determining how much you can rely on the results of your tests.
There are two types of errors that are inherent in any test you conduct. The first type of error is when there is no difference between the main group and the control group, but your test results seem to indicate that a difference exists. This first type of error is indicated by the significance level.
The second type of error is when a difference is actually present, but your test concludes that there is no difference. This second type of error is controlled by designing a test with sufficient power to detect a difference in the groups. Often you don't know what kinds of results you will get and so will have difficulty saying what difference you are trying to detect. In that situation you must either estimate or rely on any previous runs.
Statisticians talk about some errors being more “expensive” or “dangerous” or “costly” than other types of errors. By that they mean the consequences of such an error type are greater or more painful than errors of the other type. In some cases, it is the first type that is more costly; in other cases it is the second type that must be prevented. In an ideal world, both types of errors would be minimized, but that might lead to impractical sample sizes. The benefit of having multiple parameters to adjust when determining sample size for an experiment is that the experiment designer can use the relevant parameter to control the most costly type of error within the constraints of available sample sizes. In the Examples section at the end of this paper there is a more detailed description of a practical marketing test where both types of errors are considered.
Knowing the values for the four measures, it is a straightforward (albeit somewhat mathematically complicated) task to calculate the necessary sample size. The online calculator we supply does this math for you. Here are some useful guidelines.
A larger sample size is needed for tests requiring higher power levels or tests with smaller differences to detect. A smaller sample size will suffice with a smaller standard deviation. Also, a more relaxed (larger) significance level will result in higher power for the same sample size
Actually calculating the sample size required for your experiments is beyond the scope of this short paper. However Loyalty Builders has posted a free, online calculator to do the job for you.
When running a marketing test, your conclusions are not necessarily meaningful unless there is sufficient power in your test. Sample size is the key ingredient in getting enough power, but your required significance level and the amount of noise inherent in your test will also play a role in determining whether you can rely on the observable differences. You can either set the power and calculate the needed sample size, or you can calculate the power associated with a given sample size. A sample size calculator is available to do the math for you.
For those readers who want to understand the underlying concepts more deeply, we offer several examples showing the relationship of power, difference to detect, and standard deviation to sample size. Feel free to skip over this material if the mathematics is not your primary interest. If you just want to dip your toes into this water, read Example 3 about controlling the most costly errors.
To make this discussion more concrete, we have calculated the power for several different sample sizes and tests. The examples are divided into two classes. One class of examples uses groups of equal sample size, such as you might have with the different recipes of a factorial design for a multivariate test. The second class uses groups of unequal size, such as would be found in experiments using test and control groups. The mathematics is different for these two classes of tests. Below are tables showing the relationship of the relevant parameters to sample sizes required for several different situations.
Table A reflects tests where, for example, the average order is about $250. A standard deviation of $25 means that for a bell curve response, 67% of the orders are between $225 and $275. There might be three groups where the average order is $250, and a fourth group where the average order is $255. The question is whether or not that $5 difference is meaningful, given the sample sizes used and the other parameters.
Table A. Three tests of four groups with equal sample sizes (for example four recipes in a factorial design)
Parameter | Test A1 | Test A2 | Test A3 |
| Difference to detect | 5* | 5* | 5* |
| Standard deviation ## | 50 | 25 | 25 |
| Sample Size | 4000** | 4000** | 2000# |
| Power | 0.622 | 0.998 | 0.916 |
*--three groups with equal results, one group with five units higher result
**--four equal groups of size 1000 each
#--four equal groups of size 500 each
##--as a first approximation, use the standard deviation from your existing experience
Test A2 differs from Test A1 only in the standard deviation. With the same sample size, narrowing the dispersion increases the power to an excellent 0.998, and the $5 difference is meaningful. In test A3, the sample size is cut in half, but there is still reliable power. However the larger standard deviation for test A1 (and the corresponding spread of order values) yields a power that is insufficient to properly detect the $5 difference.
Case 2a. Average dollars per contact
Suppose a marketer pulls a list and holds out a 10% control group to test the effect of a new e-mail. The marketer has sent the old version of the e-mail for the previous five mailing cycles, and the control group will receive this version in the test. The marketer wants to examine the difference in average dollars per order and difference in response rate between the test group and the control group, where the test group gets a new e-mail. Below is a table describing the elements of the test:
Table B. Average Dollars per Contact
| Parameter | Test B1 | Test B2 | Test B3 |
| Difference to detect | $25 | $10 | $10 |
| Standard deviation | $75 | $75 | $50 |
| Sample Size | 1,100 | 5,500 | 5,500 |
| Power | 0.888 | 0.812 | 0.989 |
Test B1 is looking to detect a difference of $25 between the test group and the control group. Historically, the standard deviation for the Business As Usual (BAU) e-mail is $75. With a sample size of 1,100 customers (1,000 in the test group, 100 in the control group), the power of the test is a hearty 88.79%.
Test B2 is looking to detect a smaller difference of $10 between the two groups. With the same standard deviation, it takes a much larger sample the get a credible power. With a sample of 5,500 customers (5,000 in the test group, 500 in the control group), the power is 81.17%. (A power of 80% is credible for most cases in customer response scenarios. A power of 90% or greater is preferred for quality control processes.)
Test B3 is looking to detect the same $10 difference with the same sample size, but applies to a different historical situation where the standard deviation in the previous mailing cycles has been $50. As we can see, the power of the test is greatly improved at 98.94%. Clearly, a smaller historical standard deviation will lead to a more powerful test.
Case 2b. Response rate
With unequal sample sizes, you can’t compare the absolute number of responses if the test group has 1000 members and the control group has 100; you need to look at ratios. This makes the math somewhat different for response rates.
Table C. Response rates
| Parameter | Test C1 | Test C2 | Test C3 | Test C4 |
| Percent response, Test | 17% | 4% | 4% | 2.1% |
| Percent response, Control | 10% | 2% | 2% | 1.5% |
| Sample size | 2,750 | 2,750 | 8,250 | 22,000 |
| Power | 0.875 | 0.434 | 0.874 | 0.629 |
Test C1 is looking to detect a difference between the 17% response from the test group and the 10% response from the control group. With a sample as small as 2,750 customers (2,500 in the Test, 250 in the control) the power of the test is 87.53%
Test C2 illustrates that with the same sample size and a much smaller difference to detect, the power of the test is much weaker. To detect a difference between a 4% response rate and a 2% response rate with a significant level of credibility, a much higher sample is needed.
Test C3 shows that with 8,250 customers, the power of the test from Test 2 is 87.38%.
Finally, in Test C4, we see that the difference between the test and the control group is small (0.6%) and the response rates in general is low. (These are the results from a cross sell campaign.) A much larger sample of 22,000 customers yields a power of only 62.9%. A much larger sample would be required to show any evidence that the difference between these two results is due to anything other than random noise.
Suppose we want to compare the marketing effectiveness of a new mailing B with the old version A that we have been using. We suppose that mailing B is more expensive to produce and mail than A so we definitely want to use it only if the response it generates pays off. For simplicity, let’s suppose that we send A to a group of 1000 customers and B to a similar group of 1000 customers. After the campaign we compare the response rate between the two groups. Our goal here is to decide whether new mailing B is more effective, of equal effectiveness, or less effective than old mailing A.
Clearly, there are two possible errors we can make: 1) we can decide that B is better than A, and thereby, increase campaign costs, when it isn’t really more effective (a type 1 error) or 2) we can decide that B isn’t more effective than A when it really is (a type 2 error), and consequently lose the additional revenue we would receive from using the more effective version, mailing B. Obviously, choosing between mailing A and mailing B involves a gamble with possible errors no matter what we do. This is where “error tolerance” comes into the picture. How much are you willing to bet that you’ve made the right decision?
Fortunately, as they do in Las Vegas, statistics can help us to change the odds on the “A or B” bet. Statistical methods can alter and even control the likelihood of the possible errors; choosing the sample size is the most effective tool for that purpose. In general, the larger the sample size, the smaller the likelihood of making either of the two errors. But since testing costs money, we need to know how small the sample size can be while still guarding us against errors we cannot tolerate.
Suppose, for this example, that in order to justify using the B campaign, the response rate must be at least 2% higher to cover the higher cost of the B mailing. That means we set the results difference we need to detect to 2%. We specify a significance level of 0.05 and a power of 0.90 to detect a true difference of 2%, which are commonly used values for significance and power. Thus, we have a 5% control on Type 1 errors and a 10% control on Type 2 errors. That is, if the more expensive B campaign is really no better than A (less than 2% improvement), we have only a 5% chance of incorrectly choosing it and making a costly mistake. In contrast, if the more costly B mailing is in fact at least 2% more effective than the A mailing, we have a 90% chance of detecting that 2% difference and choosing B over the A campaign.