You shipped a test. The dashboard said variant B converted better than the control, so you rolled it out to everyone. A month later revenue is flat, maybe down, and nobody can explain it.
There is a good chance you ran into Simpson's paradox.
What Simpson's paradox actually is
Simpson's paradox is when a pattern holds inside every subgroup of your data and then flips the moment you add the subgroups together. The winner at the segment level becomes the loser in the totals, or the reverse.
It is not a tracking bug or a broken test. Both views are arithmetically correct. The aggregate number is simply answering a different question than the one you think you are asking.
A conversion example you will recognise
Say you test a new pricing page against the control and break the results out by device.
Control
- Desktop: 270 conversions from 900 visitors, a 30% rate
- Mobile: 5 conversions from 100 visitors, a 5% rate
- Combined: 275 from 1,000, a 27.5% rate
Variant
- Desktop: 35 conversions from 100 visitors, a 35% rate
- Mobile: 90 conversions from 900 visitors, a 10% rate
- Combined: 125 from 1,000, a 12.5% rate
Read it by device and the variant wins everywhere. It beats the control 35% to 30% on desktop and 10% to 5% on mobile. Read the totals and the control wins in a landslide, 27.5% against 12.5%.
Both statements are true at the same time. The variant is better for every visitor and worse on the report.
Why it happens
Look at where the traffic landed. The control was shown mostly to desktop visitors, the variant mostly to mobile. Device is a lurking variable: it drives conversion on its own, and the two pages saw wildly different mixes of it. The combined rate is just a weighted blend, and it gets dragged toward whichever segment each page happened to receive more of.
In a live test this usually traces back to one of a few things. Allocation drifted instead of staying a clean random split. Two variants ran in different time windows with different traffic. Or a campaign dumped a wave of low-intent visitors onto one side.
How to stop trusting the wrong number
- Read results by segment, never just the headline rate. Device, traffic source, new versus returning, geography, and campaign are the usual culprits.
- Check that each variant saw a comparable traffic mix. Lopsided splits mean the aggregate is not safe to act on.
- Keep randomization concurrent and honest: same window, same allocation rule, same audience pool.
- Distrust any result that hangs on a single oversized segment.
The deeper lesson: one winner is the wrong goal
Simpson's paradox is a symptom of a bigger habit. Crowning a single global winner assumes one best page exists for everyone. It almost never does. A desktop price-shopper and a first-time mobile visitor arrive with different questions, and the page that converts one can stall the other.
The fix is not a cleaner aggregate. It is to stop aggregating decisions you should be making per segment.
This is the principle Optimeleon is built on. Instead of picking one winner and serving it to your whole audience, it learns the best variant for each segment and routes every visitor to it, continuously. The paradox cannot bite, because the system never flattens your audiences into a single average that hides what is really happening. If you want the mechanics, we wrote up how Optimeleon works in plain language.
