Annoying A/B testing mistakes every engineer should know by Twixes

Share This Article

Sed ut perspiciatis unde.

1. Including unaffected users in your experiment

The first common mistake in A/B testing is including users in your experiment who aren’t actually affected by the change you’re testing. It dilutes your experiment results, making it harder to determine the impact of your changes.

Say you’re testing a new feature in your app that rewards users for completing a certain action. You mistakenly include users who have already completed the action in the experiment. Since they are not affected by the change, any metrics related to this action do not change, and thus the results for this experiment may not show a statistically significant change.

To avoid this mistake, make sure to first filter out ineligible users in your code before including them in your experiment. Below is an example of how to do this:


// Incorrect. Will include unaffected users
function showNewChanges(user) {
  if (posthog.getFeatureFlag('experiment-key') === 'control') {
    return false;
  }
  if (user.hasCompletedAction) {
    return false
  }
  // other checks
  return true
}


// Correct. Will exclude unaffected users
function showNewChanges(user) {
  if (user.hasCompletedAction) {
    return false
  }
  // other checks
  if (posthog.getFeatureFlag('experiment-key') === 'control') {
    return false;
  }
  return true
}

2. Only viewing results in aggregate (aka Simpson’s paradox)

It’s possible an experiment can show one outcome when analyzed at an aggregated level, but another when the same data is analyzed by subgroups.

For example, suppose you are testing a change to your sign-up and onboarding flow. The change affects both desktop and mobile users. Your experiment results show the following:

Variant	Visitors	Conversions	Conversion Rate
Control	5,000	500	✖ 10%
Test	5,000	1,000	✔ 20%

At first glance, the test variant seems to be the clear winner. However, breaking down the results into the desktop and mobile subgroups shows:

Device	Variant	Visitors	Conversions	Conversion Rate
💻 Desktop	Control	2,000	400	✔ 20%
	Test	2,000	100	✖ 5%
📱 Mobile	Control	3,000	100	✖ 10%
	Test	3,000	900	✔ 30%

It’s now clear the test variant performed better for mobile users, but it decreased desktop conversions – an insight we missed when we combined these metrics! This phenomenon is known as Simpson’s paradox.

Depending on your app and experiment, here’s a list of aggregate metrics you want to breakdown:

User tenure
Geographic location
Subscription or pricing tier
Business size, e.g., small, medium, or large
Device type, e.g., desktop or mobile, iOS or Android
Acquisition channel, e.g., organic search, paid ads, or referrals,
User role or job function, e.g., manager, executive, or individual contributor.

3. Conducting an experiment without a predetermined duration

Starting an experiment without deciding how long it should last can cause issues. You may fall victim to the “peeking problem“: when you check the intermediate results for statistical significance, make decisions based on them, and end your experiment too early. Without determining how long your experiment should run, you cannot differentiate between intermediate and final results.

Alternatively, if you don’t have enough statistical power (i.e., not enough users to obtain a significant result), you’ll potentially waste weeks waiting for results. This is especially common in group-targeted experiments.

The solution is to use an A/B test running time calculator to determine if you have the required statistical power to run your experiment and for how long you should run your experiment. This is b

Annoying A/B testing mistakes every engineer should know by Twixes

Annoying A/B testing mistakes every engineer should know by Twixes

Share This Article

Newsletter

1. Including unaffected users in your experiment

2. Only viewing results in aggregate (aka Simpson’s paradox)

3. Conducting an experiment without a predetermined duration

HackTech

Leave a comment Cancel reply

Editor's Choice

Annoying A/B testing mistakes every engineer should know by Twixes

Annoying A/B testing mistakes every engineer should know by Twixes

Share This Article

Newsletter

1. Including unaffected users in your experiment

2. Only viewing results in aggregate (aka Simpson’s paradox)

3. Conducting an experiment without a predetermined duration

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter