The name comes from slot machines, the old "one-armed bandits." Picture a row of them, each paying out at a different unknown rate, and a finite stack of coins in your pocket. Which arms do you pull?
That puzzle is the whole idea, and it maps cleanly onto optimizing a website.
The core dilemma: explore versus exploit
- Explore: try arms you are unsure about, to learn how they pay.
- Exploit: keep pulling the arm that has paid best so far.
Pull only the current best and you might be stuck on a mediocre arm you judged too soon. Spread your coins evenly forever and you waste them on known losers. The entire game is balancing the two.
How it maps to your site
Each variant of a page is an arm. Each visitor is a coin. A conversion is a payout. A 50/50 A/B split is the "explore evenly forever" strategy: fair, but it keeps sending half your traffic to the loser for the whole test. A bandit shifts traffic toward winners as evidence builds, so you lose fewer conversions while you learn.
The common strategies, in plain terms
- Epsilon-greedy: mostly exploit the leader, occasionally pick a random arm to keep exploring. Simple, a little blunt.
- Upper confidence bound: favor arms that are either doing well or still uncertain, on the logic that uncertainty might be hiding a winner.
- Thompson sampling: model each arm's rate as a probability distribution and let them compete through random draws. Elegant and effective. (How that works in detail.)
When a bandit wins
- Many variants, where a fixed split would starve each one of data.
- Ongoing optimization with no clean end date.
- High traffic cost, where wasting visitors on a known loser is genuinely expensive.
When a plain A/B test is fine
- You truly need one clean, defensible verdict for a one-time decision.
- You have low traffic and only two options.
For the head-to-head, see A/B testing vs multi-armed bandits.
