False positives, false negatives: Type I and Type II errors in A/B testing

Every A/B test makes a yes-or-no call: did the variant beat the control? There are two ways that call goes wrong, and most teams only worry about one of them.

Type I error: the false positive

A Type I error is declaring a winner that is not real. The variant looked better by chance, you shipped it, and the lift quietly evaporates. Your significance level, often 95%, is the dial for this. At 95% you are accepting a 5% chance of a false positive on any single test.

Type II error: the false negative

A Type II error is missing a winner that is real. The variant genuinely converts better, but your test could not see it, so you kept the worse page. This is the expensive mistake nobody notices, because nothing visibly broke. You just never banked the lift.

The tradeoff nobody mentions

Tighten one and you loosen the other. Demand more certainty to cut false positives, and you need more data to detect a true effect, which raises false negatives unless you wait longer. The lever that helps both at once is statistical power, driven mostly by sample size and the size of the real effect.

Why this bites in practice

Peeking. Checking the test daily and stopping the moment it looks significant inflates false positives badly.
Underpowered tests. Too little traffic means real winners hide in the noise. (Why most tests are underpowered.)
Tiny effects. A true 1% lift needs a lot of traffic to confirm, so many teams give up and bank a false negative without realizing it.

How to keep both in check

Decide the sample size before you start, and do not stop early.
Choose a significance level that matches the stakes, not a reflex 95%.
Prioritize tests on pages with enough traffic to actually resolve.

A different way out

Both errors come from forcing one binary decision at one moment. Methods that keep allocating traffic by probability instead of a single verdict sidestep the trap: they shift toward what is likely better and keep updating, so a wrong early read self-corrects instead of getting shipped. That is the logic behind multi-armed bandits.