Validating A/B Test Results: Answers
In this lesson we'll cover:
Preparation and prioritizing
A/B tests can alter user behavior in a lot of ways, and sometimes these changes are unexpected. Before digging around test data, it's important to hypothesize how a feature might change user behavior, and why. If you identify changes in the data first, it can be very easy to rationalize why these changes should be obvious, even if you never would have have thought of them before the experiment.
It's similarly important to develop hypotheses for explaining test results before looking further into the data. These hypotheses focus your thinking, provide specific conclusions to validate, and keep you from always concluding that the first potential answer you find is the right one.
For this problem, a number of factors could explain the anomalous test. Here are a few examples:
- This metric is incorrect or irrelevant: Posting rates may not be the correct metric for measuring overall success. It describes how Yammer's customers use the tool, but not necessarily if they're getting value out of it. For example, while a giant "Post New Message" button would probably increase posting rates, it's likely not a great feature for Yammer. You may want to make sure the test results hold up for other metrics as well.
- The test was calculated incorrectly: A/B tests are statistical tests. People calculate results using different methods—sometimes that method is incorrect, and sometimes the arithmetic is done poorly.
- The users were treated incorrectly: Users are supposed to be assigned to test treatments randomly, but sometimes bugs interfere with this process. If users are treated incorrectly, the experiment may not actually be random.
- There is a confounding factor or interaction effect: These are the trickiest to identify. Experiment treatments could be affecting the product in some other way—for example, it could make some other feature harder to find or create incongruous mobile and desktop experiences. These changes might affect user behavior in unexpected ways, or amplify changes beyond what you would typically expect.
Validating the results
1. The number of messages sent shouldn't be the only determinant of this test's success, so dig into a few other metrics to make sure that their outcomes were also positive. In particular, we're interested in metrics that determine if a user is getting value out of Yammer. (Yammer typically uses login frequency as a core value metric.)
First, the average number of logins per user is up. This suggests that not only are users sending more messages, but they're also signing in to Yammer more.
View Mode Analysis