Does this sound familiar? Your team spends months coming up with the perfect test idea, weeks developing the test, and days getting multiple sign-offs from QA, upper management, and the brand team. When you finally launch, you get one of two scenarios:
Scenario 1: After three days, the test reaches statistical significance, everyone jumps for joy, and you stop the test. A few months later, you notice the change has produced zero visible lift in your bottom line.
Scenario 2: The test runs for ten weeks with a somewhat consistent trend but the results remain inconclusive. Frustrated, you stop the test and make changes based on an insignificant data set. Months on, you see no visible lift.
Cue the deep sigh. Here is how we handle these issues.
Diagnose the Problem
When we encounter one of these two scenarios, we evaluate the results to determine which of the following is to blame:
- False statistically significant results
- Underpowered test
For the most part, each is caused by something we can control, adjust, or prevent. First up, false results are caused by one of three things:
- A too-short run time, which prevents the data from normalizing and accounting for outliers.
- Low traffic and/or an artificially high lift, which causes premature statistical significance.
- A mathematically sound error. That’s right. There is always a .10% chance that the statistically significant results are produced in error. Hence the “for the most part” statement from earlier.
Second, we have underpowered tests, which are caused by a combination of low traffic and a low percentage change between the variation and the original.
Having low traffic, heavily segmenting your visitors for the test, testing minor changes (think button color) or testing low-traffic areas of your site can all cause an underpowered test.
Need examples? We’ve got ‘em. Below we’ll show you a false result and an underpowered test.
Understand the Problem
Scenario 1: False Results
A test that quickly reaches statistical significance should be met with caution, not celebration. Due to the limited amount of time the test was given to collect data, account for outliers or determine normality, in this instance a fake winner is highly probable. You wouldn’t turn off a movie if the bad guy is vanquished halfway through the movie. Movie-watching experience tells you to keep watching ‘cause that guy is coming back before the final scene. Similarly, you need to train yourself to keep watching a test when a win seems like it shows up too early.
The graph below depicts a test that reached statistical significance within twenty-four hours after it was deployed. Immediately afterward, the test goes insignificant again, then two days later it’s significant, where it stays until another two days later when it drops to almost insignificant results.
The green arrows show each time the variation became statistically significant, while the orange arrow shows the variation descending toward the significance threshold.
This up and down behavior is directly related to an artificially high level of lift, causing a false positive. By the time the test had reached statistical significance the first time, fifteen thousand visitors had been exposed to the test, but only two hundred total conversions were attained. The difference in the conversion rate between the original and the variation was great enough to cause the significance but, due to the short run time, the data had not yet normalized.
This is a great example of the normalization process and the importance of providing the test with enough time to collect a significant amount of data.
All tests that have a significant amount of traffic should run for a minimum of two weeks. Sites with lower traffic should increase the minimum runtime to account for the full data normalization process. Additionally, you should run your tests for complete purchase cycles to overcome skewed data caused by stopping the test on a particularly high or low converting day of the week.
Scenario 2: Underpowered Test
When a test runs for multiple months with a consistent trend but fails to reach statistical significance, this indicates an underpowered test.
The graph below shows a test that ran for four months and failed to reach statistical significance. With this client, it was apparent that the amount of traffic to the site was too low to power subtle tests. Minor changes would nearly always produce insignificant results and require a lengthy run time.
Although a commonly proposed solution to this type test is “Keep waiting,” few people have such patient managers. Even those of us with the most unflappable superiors know that we can’t run a test for a year and have nothing to show but a measly increase.
We know that an underpowered test is caused by either the test failing to produce a distinguishable change, low traffic, or a combination of the two. Solving for those two outcomes is hardly a challenge for a smart tester. Especially, since we are going to provide some suggestions on how to limit your chances of an inconclusive test. At the very least, you’ll prevent future issues. At the most, you could find a way to tweak your test to be successful.
If you have a low level of lift, try…
- Testing larger ideas that more drastically impact the user’s experience.
- Using testing as a way to prove business-level decisions instead of UX-based interactions.
- Focusing on pages lower in the shopping funnel and closer to key decision points like add-to-cart button clicks and the checkout.
If you have a low traffic, try…
- Avoiding segmenting data to ensure the maximum amount of traffic is being exposed to the test.
- Setting expectations with management that that the minimum run time for a test will be longer than industry norms.
- Focusing on higher traffic pages that are at the top of the funnel.
- Using micro KPIs like email signups or CTA clicks.
Last Tips Before You Test
It’s a lot less costly to test the right thing the first time around. As our last bit of advice, here’s a checklist of questions to ask yourself before you launch a test:
- Sample Size – Is the sample size large enough to show a true representation of the data?
- Traffic – Is there an adequate amount of traffic being exposed to the test?
- Level of Lift – Are the tests changing the on-site experience enough to influence user behavior?
- Buying Cycles – Are the tests being stopped at the same point of the buying cycle as when they were started?
- Run time – Was the test given enough time to normalize regardless of statistical significance?