tadata
Back to home

Hypothesis Testing: p-Values, Statistical Significance & Common Misconceptions

#statistics#data-strategy#experimentation#data-science

Hypothesis testing is how we make decisions from data while controlling the risk of being wrong. It's the foundation of A/B testing, clinical trials, and data-driven product decisions — yet it's one of the most misunderstood concepts in statistics.

The Core Idea

You have a question: "Does this new feature increase conversion?" You observe data. But randomness exists — maybe conversion went up by pure chance. Hypothesis testing gives you a framework to decide: is this result real, or noise?

Step 1 — Define Two Hypotheses

HypothesisWhat it saysExample
H0 (null hypothesis)Nothing happened. No effect. Status quo."The new button does NOT change conversion rate"
H1 (alternative hypothesis)Something happened. There IS an effect."The new button DOES change conversion rate"

H0 is the default assumption. You assume nothing is happening until the data convinces you otherwise. This is like "innocent until proven guilty" — H0 is innocence, the data is evidence.

Step 2 — Collect Data & Compute a Test Statistic

You run your experiment and compute a test statistic — a number that summarizes how far your observed result is from what H0 would predict.

Example: You test a new checkout button on 1,000 users per group.

GroupUsersConversionsConversion rate
Control (old button)1,00010010.0 %
Treatment (new button)1,00012512.5 %

The observed difference is +2.5 percentage points. But could this happen by random chance if the button has no real effect?

Step 3 — The p-Value

p-value = the probability of observing a result at least as extreme as yours, assuming H0 is true.

In our example: "If the new button truly has NO effect on conversion (H0 is true), what's the probability of seeing a +2.5 pp difference (or more) just by chance?"

p-value answers: "How surprising is this data IF nothing changed?"

     H0 is true (no effect)
     ↓
     What would random data look like?
     ↓
     How extreme is MY data compared to that?
     ↓
     p-value = probability of being this extreme or more

What p-value IS

  • The probability of the data (or more extreme) given H0
  • A measure of how incompatible your data is with the null hypothesis
  • Written as: P(data | H0)

What p-value IS NOT

Common misconceptionWhy it's wrong
"p = 0.03 means 3 % chance H0 is true"p-value is about the DATA, not about H0. It's P(data|H0), not P(H0|data)
"p = 0.03 means 97 % chance H1 is true"Same error in reverse. The p-value says nothing about the probability of H1
"p < 0.05 means the effect is real"It means "unlikely under H0" — not "H1 is true". False positives happen
"p = 0.05 is significant, p = 0.06 is not"There's no magic threshold. 0.049 and 0.051 are practically identical
"Small p-value = large effect"A tiny effect can give p < 0.001 with enough data. p-value ≠ effect size

Think of it like a fire alarm. The p-value tells you "this alarm would ring by accident only 3 % of the time." It does NOT tell you "there's a 97 % chance there's a fire."

Step 4 — Decision Rule (α threshold)

Before looking at the data, you choose a significance level α (typically 0.05):

If...DecisionWhat it means
p ≤ αReject H0The data is unlikely enough under H0 to conclude something is happening
p > αFail to reject H0Not enough evidence to conclude an effect exists

"Fail to reject H0" is NOT "H0 is true." It means: we don't have enough evidence to say otherwise.

Types of Errors

H0 is actually trueH0 is actually false
Reject H0Type I error (false positive) — probability = αCorrect! (true positive) — probability = power
Fail to reject H0Correct! (true negative)Type II error (false negative) — probability = β
Error typeIn A/B testing termsHow to control it
Type I (α)"We shipped a feature that does nothing"Lower α (e.g., 0.01 instead of 0.05)
Type II (β)"We killed a feature that actually works"Increase sample size or run longer
Power (1 - β)"Ability to detect a real effect"Target 80 % power minimum

Statistical Power & Sample Size

Power = probability of correctly detecting a real effect.

FactorEffect on required sample size
Smaller effect you want to detect↑ Need more data
Higher power (lower β)↑ Need more data
Lower α (stricter significance)↑ Need more data
Higher baseline variance↑ Need more data

Rule of thumb for A/B tests:

Minimum detectable effectBaseline conversionApprox. sample per group
+2 pp (10 % → 12 %)10 %~2,000
+1 pp (10 % → 11 %)10 %~8,000
+0.5 pp (10 % → 10.5 %)10 %~32,000
+0.1 pp (10 % → 10.1 %)10 %~800,000

Detecting small effects requires enormous samples. This is why you should always define the minimum effect that matters before running an experiment.

Confidence Intervals > p-Values

A 95 % confidence interval tells you more than a p-value:

What it gives youExample
Estimated effectConversion increased by +2.5 pp
Precision95 % CI: [+0.5 pp, +4.5 pp]
SignificanceIf CI doesn't contain 0 → significant at α = 0.05
Practical significanceIs the lower bound (+0.5 pp) big enough to matter?

Bayesian Alternative

Bayesian hypothesis testing gives you what most people actually want: the probability that the treatment is better.

FrequentistBayesian
P(data | H0)P(H1 | data) ← what you actually care about
"p = 0.03""95 % probability B is better than A"
Requires fixed sample sizeCan stop early, update beliefs continuously
Significance threshold (α)Prior + posterior distribution

Bayesian A/B testing is increasingly adopted (Statsig, GrowthBook, VWO) because the results are easier to interpret and communicate.

Common Pitfalls Cheat Sheet

PitfallProblemFix
PeekingChecking results repeatedly inflates false positivesUse sequential testing or commit to fixed analysis
Multiple comparisonsTesting 20 metrics → one will be "significant" by chanceBonferroni or Benjamini-Hochberg correction
Underpowered testsCan't detect real effects → everything looks "not significant"Calculate sample size before starting
p-hackingTweaking analysis until p < 0.05Pre-register hypotheses and analysis plan
Ignoring effect sizeStatistically significant ≠ practically importantAlways report confidence intervals
Confusing correlation & causationObservational data can't prove causationUse randomized experiments

Decision Framework

         Is this a randomized experiment?
         /                              \
       Yes                               No
        |                                 |
  Use hypothesis testing          Use causal inference
  (t-test, chi-square, etc.)     (diff-in-diff, IV, etc.)
        |
  Is your sample large enough?
  (power ≥ 80%)
  /              \
Yes               No
 |                 |
Run test,      Increase sample
interpret        or accept
with CI          lower power

Resources

:::