The math behind Optimeleon: how a bandit serves the right page to every visitor

Most testing tools ask one question and pause your growth until the answer arrives: which page is better, A or B? Optimeleon asks a sharper one and answers it nonstop: which page is most likely to convert the person looking right now?

That shift is mathematical, not cosmetic. Here is what is actually running.

The cost of a 50/50 split

Classic A/B testing sends half your traffic to each variant until a significance test clears. While you wait, half of everyone keeps seeing the weaker page. Statisticians call that lost ground regret: the gap between the best choice you could have made and the choice you actually made, summed over every visitor.

The longer the test runs and the bigger the true difference, the more conversions you burn proving something you could have started acting on far earlier. A fixed split is a deliberate decision to keep paying that toll until a p-value gives you permission to stop.

Variants as slot machines

Picture each variant as the arm of a multi-armed bandit. Every arm has an unknown payout, its real conversion rate, and you only learn about an arm by pulling it. The goal is to win as much as possible while you figure out which arm pays best.

That sets up the central tension:

Exploration: spend traffic learning how each variant performs.
Exploitation: spend traffic on the variant that looks best so far.

A 50/50 split is pure exploration with no payoff for what it learns. Shipping a hunch is pure exploitation with no learning at all. A good policy blends the two and keeps re-balancing as evidence piles up.

Thompson sampling, the engine

Optimeleon runs on Thompson sampling. The idea: describe each variant's conversion rate as a probability distribution instead of a single number, then let those distributions compete.

For a yes-or-no event like converting, the natural model is a Beta distribution updated by Bernoulli trials. They are conjugate, which is a precise way of saying the bookkeeping stays exact and cheap as data arrives. Every variant starts at Beta(1, 1), a flat curve that says every conversion rate between 0 and 100 percent is equally plausible. Then, for each visitor:

Draw one random sample from each variant's current Beta curve.
Serve the variant whose sample came out highest.
Record the result. A conversion adds 1 to that variant's alpha, a miss adds 1 to its beta.

That loop is the whole method. A variant with little data has a wide curve, so its random draws land high often enough to keep earning impressions. That is exploration, and it happens on its own. A variant with plenty of data has a narrow curve sitting near its true rate, so it only wins the draw when it genuinely should. That is exploitation, also on its own. No threshold to set, no test to babysit.

Why this wins: regret grows slowly

The payoff shows up in how regret accumulates. A fixed split racks it up in a straight line, because every visitor routed to the loser costs you the full gap, start to finish. A well-run bandit racks it up logarithmically: as confidence sharpens, mistakes get rare quickly. Across a real campaign that is the difference between leaking conversions for weeks and capturing most of the upside within days.

Context is the real unlock

A plain bandit still hunts for one best arm. Real audiences do not share one best arm.

So Optimeleon conditions on context: device, traffic source, geography, new or returning, and more. Each context carries its own set of Beta curves, which means the mobile visitor from paid social and the desktop visitor from organic search can converge on different winners at the same moment, with no manual segmentation rules.

This is also why a single aggregate winner can quietly mislead you. A global average can mask a variant that wins some segments and loses others, a trap known as Simpson's paradox. Per-context bandits never average those decisions together in the first place.

Is it still rigorous?

Yes. Allocation stays randomized, because the choice comes out of random draws rather than a hand-picked rule, so you are running a controlled experiment the whole time. You are still measuring real, causal lift. There is no p-hacking, because nothing hinges on stopping at a flattering moment. The system simply spends your traffic on the answer instead of only on the question.

The outcome is easy to state even though the math is doing real work underneath: every visitor gets the page most likely to convert them, and the model gets sharper with each one.