Chapter 10 What are A/B tests?
We have briefly mentioned A/B tests in this book, and although you most likely know something about them, we will discuss them in more detail in this section, starting from their very definition.
A/B tests are a tool which allows us to compare two versions of a given solution and to evaluate which one brings better results. Testing can be applied to emails sent to your customers.
You can use A/B testing for the layout of sub-pages of the store, the text of the “checkout” button and other data you want to perfect, but in this context it is most important in the field of emails.
Virtually everything can be tested using A/B testing, provided that the following assumptions are met:
You have an impact and control over what you are testing (this is obvious: it makes no sense to test something, if you will not be able to introduce the changes suggested after the test)
You have prepared two versions of a given solution for your users
You test one selected aspect of some solution (do not try to test everything at the same time as aspects combined influence the user differently than singular aspects)
You can present each of the versions to comparable groups of users (critical: if two groups aren’t comparable, results will not be accurate)
You can measure how many users became familiar with the individual variants (if you are testing a web site, it is the number of views of the given site variant. In the case of mailings, we usually measure email opens)
You can measure how many reacted in the desired way (this could include clicks on a link, likes, adding of a product to the shopping cart, making a purchase or evaluation of a product after the purchase)
You must accept that the A/B test is only a tool which helps in your work, but does not release you from making the final decision
In A/B testing, people tend to expect one variation to be clearly more dominant than the other. However, it’s important to know when both variations yield nearly identical results and not waste time on meaningless results based on this misconception. Let’s take a look at this example below, which we’ll analyse in depth shortly.
Looking only at the numbers in this example we could think that creation A is winning in relation to creation B. However, when we enter the results to the results we obtained from the calculator (http://www.evanmiller.org/ab-testing/chi-squared.html) it turns out that the difference is not statistically significant (see: Image 5).
Source: Evanmiller.org, Evan’s Awesome A/B Tools, 7 Nov. 2016, http://www.evanmiller.org/ab-testing/chi-squared.html
This result should not be surprising, if we reveal that in the discussed examples, the same versions of the email message were tested!
10.1 Size matters
Now, we will analyse this example in greater detail.
The open rate of an average newsletter — the percentage of customers who become acquainted with it — is about 10%. In addition, edrone’s experience shows that 75% of users among those who have responded to the email (opened the message and/or clicked on the link in the message) did so in the first 24 hours after dispatch. The remaining 25% opened the email up to several days after the dispatch, which you can take a look at in the image below.
We can clearly see the greatest number of newsletter opens on the dispatch date and on the second day. The welcome message is opened immediately or not opened at all. From the above facts we can draw two very important conclusions: 1) the test should last one day (24 h) - it’s not worth to wait any longer 2) 10% * 75% = 7.5% — this is the percentage of newsletter opens that we can expect on the second day.
When edrone performs a test, they determine in a percentage value what part of the database will the test be sent to. In order to estimate what number of emails directed to the test will be opened until the test is completed, the assumed 7.5% of responses obtained after 24 hours should be multiplied by the percentage of newsletters directed to the test and by the size of the database.
Will the calculated number of emails which produce specific reactions of recipients, be enough to determine the result of the test?
Just a bit of statistics is enough to get lost! Below, we’ll discuss statistics in-depth, listing some examples and methods to establish quality results in A/B testing. This section goes a great deal into detail, and if you’re more interested about the overall view of A/B testing, skip to the part about the critical points of A/B testing you should not ignore (page XX)
There is no place to get lost in the edrone system. The system aims at answering the question of how many emails should be sent in the test phase, in order for the test to determine which of the variants, A or B, is better. It is worth to first learn about some concepts from the field of statistics. Don’t worry — it is not difficult.
On the example of a coin toss we will discuss the concept of a statistical test and the size of the sample needed to carry it out.
How big does the sample size have to be (how many times do we have to toss the coin) in order to be able to determine whether the coin is biased or not? This all depends on how accurate we want to be. The ability to detect a biased coin which gives tails in 99% of attempts, and heads in 1% of the attempts is a different thing than the ability to detect a coin which gives tails in 60% of attempts and heads in 40% of attempts. After some calculations we can see that in the first case we only need 16 attempts in order to obtain a statistically significant result (that is, indicating the rejection of the null hypothesis), while in the second case we need as many as 369 attempts. Before you say RUN A / B TEST Let’s go back to the A/B tests. First, we want to determine how many emails we should plan to dispatch in order to establish which variant of our email campaign is the best.
Let’s use the calculator from the site http://www.evanmiller.org/ab-testing/sample-size.html (see: Image 3).
Source: Source: Evanmiller.org, Evan’s Awesome A/B Tools, 7 Nov. 2016, http://www.evanmiller.org/ab-testing/sample-size.html
Here’s how to use it:
- In this case, we are studying the conversion rate (the views of products, adding of products to the shopping cart, depending on what we want to measure). In the “Baseline conversion rate” field we enter the expected conversion rate, to which our variants will be compared. Where is this value derived from? It is best to base it on previous newsletters or a study or development presented in the industry press. We suggest setting it at 10%, which can be the average value of conversion.
- In the “Minimum Detectable Effect” field we enter the precision, with which our test will be able to indicate that the A/B test versions (variation) differ from the baseline level. If we have selected a baseline level of 10% at the previous stage and we will set the “Minimum Detectable Effect” at 2%. This means that the test variations need to have more or less than 8%-12% conversion to be considered different. With that in mind, in the event the actual conversion rate of variant A is 10% the following occurs:
- If the actual conversion rate in variant B is lower than 8% or higher than 12%, we will be able to detect these differences with our test in 80% of cases (80% is the power of the test). In other words, our test will detect differences larger than 2% with an 80% effectiveness, with the assumption that the average conversion rate will be at the level of about 10%.
- If the actual conversion rate in variant B is in the range of 8-12%, then our test will not detect a difference and will show that the variant in question is not different from the Baseline level. This would mean that your attempt to create a new variant has failed because you get the same results as with the current variant.
- The “Absolute” / “Relative” switch determines whether we use percentages or percentage points (it is best to leave “Absolute”)
- “Statistical power 1-β” is the power of the test (usually 80%) - it is also best to leave the default value.
- “Significance level α” is usually set at 5% - it is best to leave the default value.
From the calculator we read a sample size value of 1629 per test variant. In the example that we presented at the beginning of this analysis, we have sent 10 000 emails per variant and calculated that we expect 750 email opens per variant after 24 hours. If our test concerned the conversion from an email open to, for example, a purchase, which occurs with an average conversion rate of 10%, then we should send about 2 times more emails in order to reach the 1629 digit.
At this point we could ask why are we comparing the conversion rate with the baseline level, and not both variants with each other. For the time being, we are not yet checking the test result. The described calculator is used to calculate the sample size and the “Baseline conversion rate” value should be treated as the average value of conversion of variants A and B (which is not known at the time of the test, but we rely on our expertise or knowledge of the experts from edrone).
All winners are not equal! We are now getting to the crux of the matter. After we have planned with edrone what percentage of the database we will send the newsletter to, it is time for the dispatch and the results. The results are collected in the A/B summary tab. Below are the results of an actual dispatch, which we’ve carried out from our system:
10.2 Critical points of A/B testing you should not ignore
Some marketing systems used for A/B tests check which option is better not once, after the expiry of an established time period (e.g. 24 hours), but multiple times, for example, every half hour. If the test shows that the difference between A and B is statistically significant, then the A/B test is stopped. This is a mistake! Each performance of the test is associated with a specific error resulting from the established level of significance and the power of the test. If we perform the test many times, until the moment we get a statistically significant result, then every time we introduce an error that adds up with each test. This could be likened to a situation where we would like to prove that we have a coin which falls heads up 30% of the time, and tails up 70% of the time. We would perform the test by tossing the coin and writing down the number of heads and tails that we got up to that point. We would continue until the average of all the tosses would actually be 30% of heads and 70% of tails. Of course, this does not make any sense. Even if the coin is not biased (it is characterized by a 50% chance of falling heads up and 50% chance of falling tails up), with a certain number of attempts it could happen, that the tosses would result in 30% of heads up and 70% of tails up. Checking the significance of the test - without deciding on this basis whether to stop or to continue the test - is not incorrect, but it is a waste of time, because we couldn’t use the knowledge of the test results (if we do not want to make an error).
In the case of mailing the audience, it is limited. The situation is slightly different when we are testing, for example, the landing page and when new users are constantly coming in. We cannot add new recipients once we send the mailing. This means that before sending the message we need to do a calculation: how many users will open our email and whether this number is enough to evaluate the tested versions.
The marketing plan typically provides for the dispatch of multiple newsletters in a week. Repeating a test which did not end with a clear conclusion can be a waste of time. This should not stop our further marketing activities. It is better to end the test after 24 hours and carry out the dispatch regardless of the result. If the test did not end with statistically significant results, let’s decide on the basis of experience or intuition — they are often the best sources of advice!
Especially since newsletters are usually designed to be sent at a specific date (e.g. on holidays). It is better to send a newsletter that is a bit worse, but is dispatched on time.
On the other hand, we should resist the temptation to stop the test too early — if we set a test duration time of 24 hours (duration suggested by edrone) — then let’s assess the results after that time.
How not to get lost in A/B testing and statistics? As we’ve seen, A/B testing is a broad topic. The good news is that you don’t need to memorize all of that, because most tools which support A/B testing will guide you through the process. Moreover, the edrone algorithm will always indicate the winner of the test, and additionally, it will each time check, whether the A/B versions differ statistically and will inform the user. If there is no difference, it is better to rely on one’s own experience or the recommendations of a proven and effective system.