The inequality of A/B testing

You need to change how you do a/b testing and CRO

Aug 09, 2023

I planned on writing a legacies post this week and responding to some of your concerns around innovation and work from home, and it will happen one of these days. But in the meantime this is very marketing focused and worth your attention.

Microsoft recently published a paper on the distribution of results from their A/B tests that has significant implications for most marketers. Here is the abstract (bold is my editorial):

Large and statistically powerful A/B tests are increasingly popular to screen new business and policy ideas. We study how to use scarce experimental resources to screen multiple potential innovations by proposing a new framework for optimal experimentation, that we term the A/B testing problem. The main departure from the literature is that the model allows for fat tails. The key insight is that the optimal experimentation strategy depends on whether most gains accrue from typical innovations or from rare and unpredictable large successes that can be detected using tests with small samples. We show that, if the tails of the unobserved distribution of innovation quality are not too fat, the standard approach of using a few high-powered “big” experiments is optimal. However, when this distribution is very fat tailed, a “lean” experimentation strategy consisting of trying more ideas, each with possibly smaller sample sizes, is preferred. We measure the relevant tail parameter using experiments from Microsoft Bing’s EXP platform and find extremely fat tails. Our theoretical results and empirical analysis suggest that even simple changes to business practices within Bing could increase innovation productivity.

A few things going on here:

If A/B test results are “normally distributed” then the current methods of running tests should work fine
But, if A/B tests have “fat tails” — if there are more results than expected that are far from the control group (in both directions) — then we should be running tests differently
When the team looked at Bing’s tests they found VERY fat tails, therefore they should be running tests differently

How fat were the tails? “…the top 2% of ideas are responsible for 74.8% of the historical gains”. As the paper says, “This is an extreme version of the usual 80-20 Pareto rule”.

The conclusions:

“Ideas with small t-statistics should be shrunk aggressively, because they are likely to be lucky draws”
“…the marginal value of data for experimentation is an order of magnitude lower than the average value, but is not negligible.”
“…there are large gains from moving towards a lean experimentation strategy”

Every conversion (CRO) team I have encountered has a version of this story:

CRO Team: “We ran 100 experiments, and rolled out 23 new changes. The total impact from our team was a +28% conversion improvement site-wide.”
Executive: “But our conversion is down slightly from last year. How is that possible?”
CRO Team: “There must be other things going on, like market conditions, marketing channel changes or competition. We would be down even more if it wasn’t for our tests. We have the test and control results to prove it.”
Executive: “I guess…”

When you get a positive result on an A/B test the reason is some combination of actual improvement and luck. But since most tests are unlikely to show any improvement, the ones that are “statistically significant” usually aren’t — they are just noise. And when they are positive, in theory the error bars go both ways. In practice they never go to the right (you never find that “our test showed a 10% improvement in conversion, +/- 9% with 95% confidence, but when we rolled it out we had a +15% conversion rate improvement).

While these facts become obvious once you spend time running A/B tests for a business, understanding the math for WHY this is, is a lot harder (and why a lot of marketers insist that everything works according to basic theory and that the team really FIF drive +28% conversion improvement). This paper is an attempt to show that more complicated math. Dive in if that is your thing, but for the rest of us, we can focus on the implications. Here they are again in my words:

If the results are “barely significant” or “marginally significant”, assume they are not significant at all
When you need an A/B test to tell you if something is better or worse, it likely doesn’t matter
Overall conversion rate improvements will come from a few huge wins, not from dozens of incremental improvements

You should keep running your tests, but when the results are “inconclusive”, just assume the its a null result, and will stay a null result, and rather than extending the test to see what you can find, just end the test and try something new. This is the VC model of A/B testing. You want to test a lot, not so you can layer gains on top of gains, but so that you can get lucky one in a hundred tests, and that one lucky result can make your year.

There is one more reason to test. Not to figure out if a new idea is better, but to verify a new idea is not worse. Often a company will want to re-brand the website, or launch a new CMS. Most of the time the new idea will result in a LOWER conversion. So you can use a test to make sure the new design is not significantly worse than the old design. If it’s about the same, then you can use your business judgement to decide you want to replace the old one with the new one.

So in practice the CRO team should operate like this:

If other people in the org want to make a site change, they need to go through the CRO team to make sure it is not value destructive (if it creates immediate value, great, but that is not the point)
Meanwhile the CRO team needs to be searching for lots and lots of new ideas and testing them quickly. Tests do NOT need to be run to ensure statistical significance. If the tests do not show dramatic improvement fairly early, they should just end the test and try something new. Better to end a bunch of tests that MIGHT get a 1% improvement, and use that time to keep searching for the test that gets a 20% improvement

I saw this first hand with a client recently. They re-did their website and had to launch it without testing (another story). I expected the new site to be significantly worse at conversion and that they would need to roll back to the old site and test into the new site later (which is what normally happened). Not this time. The new site converts at +50% to the old site (the leads are worse quality, but not 50% worse, so it is a huge win). We didn’t need to see the A/B test results to notice a +50% CR improvement!

One last recommendation from the paper on how Bing itself should be operating:

We consider a counterfactual where Bing experiments on 20% more ideas, with the marginal ideas having the same quality distribution, while keeping the same number of users. We find that productivity would increase by 17.05%. Naturally, whether these gains can be attained depends on the costs of running additional experiments. We perform a back-of-the envelope calculation using Bing’s monetary valuation for quality improvements. We find that moving towards lean experimentation would be profitable even if the fixed costs of one experiment were of the order of hundreds of thousands of dollars per year.

Basically: End tests early, but add 20% more tests. The result is a 17% improvement in conversion. The Break Even cost on doing it this way is if every “new experiment idea” costs more than $100,000. Your business is unlikely to have the scale of Bing.com, but the conclusion is likely to be very similar.

Keep it simple,

Edward

p.s., I have come across this study in a number of places in the last week, but first saw it in TheDiff.

Marketing BS with Edward Nevraumont

Discussion about this post