Performance benchmarks for A/B testing algorithms: which ones work best

From the blog

August 2023

This post is a more technical review validating which algorithms perform best when analyzing A/B testing results. We felt this was an essential part of establishing a solid foundation for A/B testing analysis. We needed to feel confident we can make sound business decisions on the effectiveness of marketing programs that get tested in real life.

Why benchmark A/B testing algorithms?

It’s easy to understand why A/B tests get biased in real life: so many variables can change between the control group and the treated group. Selection bias is often hard at work and the two groups often end up being not really comparable without proper A/B testing analysis and a properly adjusted control group.

But how do you know your adjusted control group is right? How do you know it will give you a reliable estimate of the true impact of the program or campaign you are trying to evaluate? This is where benchmarking is really useful.

We start with synthetic data for which the true answer is known, and we run it through different libraries and different algorithms to see which ones give you a correct answer. The results are eye-opening, let’s get started!

Which A/B testing libraries did we benchmark?

After a preliminary evaluation of a range of open-source causal analysis libraries, our final list of contenders boiled down to two widely used libraries:

CausalInference: Originally developed by Laurence Wong, causalinference is one of the oldest causal analysis libraries around. The library implements methods that are well documented in Imbens, G. & Rubin, D. (2015). Causal Inference in Statistics, Social, and Biomedical Sciences: An Introduction. It tends to be fast, and works well with most tabular datasets.
DoWhy: Originally developed by a team of scientists at Microsoft Research, dowhy is one of the most robust causal analysis libraries. The library has become in many ways a standard and is now also endorsed by AWS data scientists. While it is one of the most robust libraries around, it also tends to be meaningfully slower than causalinference.

Other libraries we evaluated that did not make it to our final list include psmpy, causallib, causalimpact. We felt these libraries were either not mature enough or too narrowly specialized to be effective as our main analytical workhorse.

A/B testing benchmarking results

We used three synthetic datasets with different true impacts, so we could test what happens when the treatment effect is small, medium, or large. Results for the small effect case, which is most challenging, are graphed below. The chart shows the error of the estimate vs. the true treatment effect:

The results are pretty clear: propensity score matching works fairly well, with both causalinference and dowhy. In both cases the algorithm was able to estimate the treatment effect with near perfect accuracy, though dowhy was about 10x slower than causalinference.

Propensity score weighting did worst across the board; dowhy generated a biased result but causalinference did not even produce a working result. Distance matching performed much better than we initially expected.

Full results are shown below for different effect sizes; the percentage is the error of the estimate vs. the true treatment effect :

Library	Algorithm	Small effect	Medium effect	Large effect
causalinference	propensity score matching	0.0%	0.0%	0.0%
causalinference	propensity score blocking	17%	3.4%	0.3%
causalinference	propensity score weighting	N/A	N/A	N/A
causalinference	propensity score OLS	35%	7.0%	0.7%
dowhy	propensity score matching	0.0%	0.0%	0.0%
dowhy	propensity score stratification	6.1%	1.2%	0.1%
dowhy	propensity score weighting	223%	45%	4.5%
dowhy	distance weighting	0.0%	0.0%	0.0%

How Can We Help?

Feel free to check us out and start your free trial at https://app.g2m.ai or contact us below!

A/B Testing, Analytics

Garrett has over 23 years of consulting experience building and growing companies. His practice expertise resides in go-to-market strategy, pricing, financial modeling and operations. Garrett is a trusted advisor to many C-Level executives and drives significant shareholder value for his clients.

See All of Garrett Sznip's Posts

Why It’s Time to Rethink Decision-Making: Introducing Autonomous Decision Intelligence (ADI)

Read Post

The Science of Sales Analytics: Unlocking Performance with Data-Driven Insights

Read Post

How to Build a Data-Driven Revenue Model for Your SaaS Business

Read Post

Performance benchmarks for A/B testing algorithms: which ones work best

Why benchmark A/B testing algorithms?

Which A/B testing libraries did we benchmark?

A/B testing benchmarking results

How Can We Help?

Related Articles

Why It’s Time to Rethink Decision-Making: Introducing Autonomous Decision Intelligence (ADI)

The Science of Sales Analytics: Unlocking Performance with Data-Driven Insights

How to Build a Data-Driven Revenue Model for Your SaaS Business