When your winning variant is quietly losing: Simpson's paradox in A/B tests

You shipped a test. The dashboard said variant B converted better than the control, so you rolled it out to everyone. A month later revenue is flat, maybe down, and nobody can explain it.

There is a good chance you ran into Simpson's paradox.

What Simpson's paradox actually is

Simpson's paradox is when a pattern holds inside every subgroup of your data and then flips the moment you add the subgroups together. The winner at the segment level becomes the loser in the totals, or the reverse.

It is not a tracking bug or a broken test. Both views are arithmetically correct. The aggregate number is simply answering a different question than the one you think you are asking.

A conversion example you will recognise

Say you test a new pricing page against the control and break the results out by device.

Control

Desktop: 270 conversions from 900 visitors, a 30% rate
Mobile: 5 conversions from 100 visitors, a 5% rate
Combined: 275 from 1,000, a 27.5% rate

Variant

Desktop: 35 conversions from 100 visitors, a 35% rate
Mobile: 90 conversions from 900 visitors, a 10% rate
Combined: 125 from 1,000, a 12.5% rate

Read it by device and the variant wins everywhere. It beats the control 35% to 30% on desktop and 10% to 5% on mobile. Read the totals and the control wins in a landslide, 27.5% against 12.5%.

Both statements are true at the same time. The variant is better for every visitor and worse on the report.

Why it happens

Look at where the traffic landed. The control was shown mostly to desktop visitors, the variant mostly to mobile. Device is a lurking variable: it drives conversion on its own, and the two pages saw wildly different mixes of it. The combined rate is just a weighted blend, and it gets dragged toward whichever segment each page happened to receive more of.

In a live test this usually traces back to one of a few things. Allocation drifted instead of staying a clean random split. Two variants ran in different time windows with different traffic. Or a campaign dumped a wave of low-intent visitors onto one side.

How to stop trusting the wrong number

Read results by segment, never just the headline rate. Device, traffic source, new versus returning, geography, and campaign are the usual culprits.
Check that each variant saw a comparable traffic mix. Lopsided splits mean the aggregate is not safe to act on.
Keep randomization concurrent and honest: same window, same allocation rule, same audience pool.
Distrust any result that hangs on a single oversized segment.

The deeper lesson: one winner is the wrong goal

Simpson's paradox is a symptom of a bigger habit. Crowning a single global winner assumes one best page exists for everyone. It almost never does. A desktop price-shopper and a first-time mobile visitor arrive with different questions, and the page that converts one can stall the other.

The fix is not a cleaner aggregate. It is to stop aggregating decisions you should be making per segment.