Hypothesis Testing: p-Values, Statistical Significance & Common Misconceptions
Hypothesis testing is how we make decisions from data while controlling the risk of being wrong. It's the foundation of A/B testing, clinical trials, and data-driven product decisions — yet it's one of the most misunderstood concepts in statistics.
The Core Idea
You have a question: "Does this new feature increase conversion?" You observe data. But randomness exists — maybe conversion went up by pure chance. Hypothesis testing gives you a framework to decide: is this result real, or noise?
Step 1 — Define Two Hypotheses
| Hypothesis | What it says | Example |
|---|---|---|
| H0 (null hypothesis) | Nothing happened. No effect. Status quo. | "The new button does NOT change conversion rate" |
| H1 (alternative hypothesis) | Something happened. There IS an effect. | "The new button DOES change conversion rate" |
H0 is the default assumption. You assume nothing is happening until the data convinces you otherwise. This is like "innocent until proven guilty" — H0 is innocence, the data is evidence.
Step 2 — Collect Data & Compute a Test Statistic
You run your experiment and compute a test statistic — a number that summarizes how far your observed result is from what H0 would predict.
Example: You test a new checkout button on 1,000 users per group.
| Group | Users | Conversions | Conversion rate |
|---|---|---|---|
| Control (old button) | 1,000 | 100 | 10.0 % |
| Treatment (new button) | 1,000 | 125 | 12.5 % |
The observed difference is +2.5 percentage points. But could this happen by random chance if the button has no real effect?
Step 3 — The p-Value
p-value = the probability of observing a result at least as extreme as yours, assuming H0 is true.
In our example: "If the new button truly has NO effect on conversion (H0 is true), what's the probability of seeing a +2.5 pp difference (or more) just by chance?"
p-value answers: "How surprising is this data IF nothing changed?"
H0 is true (no effect)
↓
What would random data look like?
↓
How extreme is MY data compared to that?
↓
p-value = probability of being this extreme or more
What p-value IS
- The probability of the data (or more extreme) given H0
- A measure of how incompatible your data is with the null hypothesis
- Written as: P(data | H0)
What p-value IS NOT
| Common misconception | Why it's wrong |
|---|---|
| "p = 0.03 means 3 % chance H0 is true" | p-value is about the DATA, not about H0. It's P(data|H0), not P(H0|data) |
| "p = 0.03 means 97 % chance H1 is true" | Same error in reverse. The p-value says nothing about the probability of H1 |
| "p < 0.05 means the effect is real" | It means "unlikely under H0" — not "H1 is true". False positives happen |
| "p = 0.05 is significant, p = 0.06 is not" | There's no magic threshold. 0.049 and 0.051 are practically identical |
| "Small p-value = large effect" | A tiny effect can give p < 0.001 with enough data. p-value ≠ effect size |
Think of it like a fire alarm. The p-value tells you "this alarm would ring by accident only 3 % of the time." It does NOT tell you "there's a 97 % chance there's a fire."
Step 4 — Decision Rule (α threshold)
Before looking at the data, you choose a significance level α (typically 0.05):
| If... | Decision | What it means |
|---|---|---|
| p ≤ α | Reject H0 | The data is unlikely enough under H0 to conclude something is happening |
| p > α | Fail to reject H0 | Not enough evidence to conclude an effect exists |
"Fail to reject H0" is NOT "H0 is true." It means: we don't have enough evidence to say otherwise.
Types of Errors
| H0 is actually true | H0 is actually false | |
|---|---|---|
| Reject H0 | Type I error (false positive) — probability = α | Correct! (true positive) — probability = power |
| Fail to reject H0 | Correct! (true negative) | Type II error (false negative) — probability = β |
| Error type | In A/B testing terms | How to control it |
|---|---|---|
| Type I (α) | "We shipped a feature that does nothing" | Lower α (e.g., 0.01 instead of 0.05) |
| Type II (β) | "We killed a feature that actually works" | Increase sample size or run longer |
| Power (1 - β) | "Ability to detect a real effect" | Target 80 % power minimum |
Statistical Power & Sample Size
Power = probability of correctly detecting a real effect.
| Factor | Effect on required sample size |
|---|---|
| Smaller effect you want to detect | ↑ Need more data |
| Higher power (lower β) | ↑ Need more data |
| Lower α (stricter significance) | ↑ Need more data |
| Higher baseline variance | ↑ Need more data |
Rule of thumb for A/B tests:
| Minimum detectable effect | Baseline conversion | Approx. sample per group |
|---|---|---|
| +2 pp (10 % → 12 %) | 10 % | ~2,000 |
| +1 pp (10 % → 11 %) | 10 % | ~8,000 |
| +0.5 pp (10 % → 10.5 %) | 10 % | ~32,000 |
| +0.1 pp (10 % → 10.1 %) | 10 % | ~800,000 |
Detecting small effects requires enormous samples. This is why you should always define the minimum effect that matters before running an experiment.
Confidence Intervals > p-Values
A 95 % confidence interval tells you more than a p-value:
| What it gives you | Example |
|---|---|
| Estimated effect | Conversion increased by +2.5 pp |
| Precision | 95 % CI: [+0.5 pp, +4.5 pp] |
| Significance | If CI doesn't contain 0 → significant at α = 0.05 |
| Practical significance | Is the lower bound (+0.5 pp) big enough to matter? |
Bayesian Alternative
Bayesian hypothesis testing gives you what most people actually want: the probability that the treatment is better.
| Frequentist | Bayesian |
|---|---|
| P(data | H0) | P(H1 | data) ← what you actually care about |
| "p = 0.03" | "95 % probability B is better than A" |
| Requires fixed sample size | Can stop early, update beliefs continuously |
| Significance threshold (α) | Prior + posterior distribution |
Bayesian A/B testing is increasingly adopted (Statsig, GrowthBook, VWO) because the results are easier to interpret and communicate.
Common Pitfalls Cheat Sheet
| Pitfall | Problem | Fix |
|---|---|---|
| Peeking | Checking results repeatedly inflates false positives | Use sequential testing or commit to fixed analysis |
| Multiple comparisons | Testing 20 metrics → one will be "significant" by chance | Bonferroni or Benjamini-Hochberg correction |
| Underpowered tests | Can't detect real effects → everything looks "not significant" | Calculate sample size before starting |
| p-hacking | Tweaking analysis until p < 0.05 | Pre-register hypotheses and analysis plan |
| Ignoring effect size | Statistically significant ≠ practically important | Always report confidence intervals |
| Confusing correlation & causation | Observational data can't prove causation | Use randomized experiments |
Decision Framework
Is this a randomized experiment?
/ \
Yes No
| |
Use hypothesis testing Use causal inference
(t-test, chi-square, etc.) (diff-in-diff, IV, etc.)
|
Is your sample large enough?
(power ≥ 80%)
/ \
Yes No
| |
Run test, Increase sample
interpret or accept
with CI lower power
Resources
- Seeing Theory — Interactive visualization of probability and statistics
- Statistics Done Wrong — Common statistical mistakes (free online book)
- Trustworthy Online Controlled Experiments — Kohavi, Tang & Xu (the A/B testing bible)
- Bayesian Methods for Hackers — Practical Bayesian stats
- Khan Academy — Hypothesis Testing — Step-by-step fundamentals
:::