1. Including unaffected users in your experiment
The first common mistake in A/B testing is including users in your experiment who aren’t actually affected by the change you’re testing. It dilutes your experiment results, making it harder to determine the impact of your changes.
Say you’re testing a new feature in your app that rewards users for completing a certain action. You mistakenly include users who have already completed the action in the experiment. Since they are not affected by the change, any metrics related to this action do not change, and thus the results for this experiment may not show a statistically significant change.
To avoid this mistake, make sure to first filter out ineligible users in your code before including them in your experiment. Below is an example of how to do this:
// Incorrect. Will include unaffected users
function showNewChanges(user) {
if (posthog.getFeatureFlag('experiment-key') === 'control') {
return false;
}
if (user.hasCompletedAction) {
return false
}
// other checks
return true
}
// Correct. Will exclude unaffected users
function showNewChanges(user) {
if (user.hasCompletedAction) {
return false
}
// other checks
if (posthog.getFeatureFlag('experiment-key') === 'control') {
return false;
}
return true
}
2. Only viewing results in aggregate (aka Simpson’s paradox)
It’s possible an experiment can show one outcome when analyzed at an aggregated level, but another when the same data is analyzed by subgroups.
For example, suppose you are testing a change to your sign-up and onboarding flow. The change affects both desktop and mobile users. Your experiment results show the following:
Variant | Visitors | Conversions | Conversion Rate |
---|---|---|---|
Control | 5,000 | 500 | ✖ 10% |
Test | 5,000 | 1,000 | ✔ 20% |
At first glance, the test variant seems to be the clear winner. However, breaking down the results into the desktop and mobile subgroups shows:
Device | Variant | Visitors | Conversions | Conversion Rate |
---|---|---|---|---|
💻 Desktop | Control | 2,000 | 400 | ✔ 20% |
Test | 2,000 | 100 | ✖ 5% | |
📱 Mobile | Control | 3,000 | 100 | ✖ 10% |
Test | 3,000 | 900 | ✔ 30% |
It’s now clear the test variant performed better for mobile users, but it decreased desktop conversions – an insight we missed when we combined these metrics! This phenomenon is known as Simpson’s paradox.
Depending on your app and experiment, here’s a list of aggregate metrics you want to breakdown:
- User tenure
- Geographic location
- Subscription or pricing tier
- Business size, e.g., small, medium, or large
- Device type, e.g., desktop or mobile, iOS or Android
- Acquisition channel, e.g., organic search, paid ads, or referrals,
- User role or job function, e.g., manager, executive, or individual contributor.
3. Conducting an experiment without a predetermined duration
Starting an experiment without deciding how long it should last can cause issues. You may fall victim to the “peeking problem“: when you check the intermediate results for statistical significance, make decisions based on them, and end your experiment too early. Without determining how long your experiment should run, you cannot differentiate between intermediate and final results.
Alternatively, if you don’t have enough statistical power (i.e., not enough users to obtain a significant result), you’ll potentially waste weeks waiting for results. This is especially common in group-targeted experiments.
The solution is to use an A/B test running time calculator to determine if you have the required statistical power to run your experiment and for how long you should run your experiment. This is b