A/B testing is the gold standard for data-driven product decisions, yet most organizations either underinvest in rigor or over-rely on p-values without understanding their limitations. This post covers the statistical foundations, tooling landscape, and the maturity journey from ad-hoc experiments to a culture of continuous experimentation.
Statistical Concepts Cheat Sheet
| Concept | Definition | Practical Implication |
|---|
| p-value | Probability of observing data this extreme if H0 is true | Not the probability H1 is true; commonly misinterpreted |
| Significance level (alpha) | Threshold for rejecting H0 (usually 0.05) | 5% chance of a false positive per test |
| Statistical power | Probability of detecting a real effect (usually 0.80) | 80% chance of catching a true winner |
| Minimum Detectable Effect (MDE) | Smallest effect size you want to detect | Smaller MDE = larger sample = longer runtime |
| Confidence interval | Range likely containing the true effect | More useful than p-value for decision-making |
| Multiple comparisons | Testing many variants inflates false positives | Apply Bonferroni or Benjamini-Hochberg corrections |
| Sequential testing | Peeking at results before the test ends | Use sequential methods (CUPED, always-valid p-values) |
| CUPED | Variance reduction using pre-experiment data | Can cut required sample size by 20-50% |
Sample Size Reference Table
Baseline conversion rate: 5%, power: 80%, significance: 5%
| Minimum Detectable Effect | Sample Size per Variant | At 10K daily users |
|---|
| 1% relative (5.0% → 5.05%) | ~31,000,000 | ~8.5 years |
| 5% relative (5.0% → 5.25%) | ~1,240,000 | ~248 days |
| 10% relative (5.0% → 5.5%) | ~310,000 | ~62 days |
| 20% relative (5.0% → 6.0%) | ~78,000 | ~16 days |
| 50% relative (5.0% → 7.5%) | ~12,600 | ~3 days |
Key insight: detecting small effects requires enormous sample sizes. This is why most teams should focus on testing bold changes, not micro-optimizations.
Experimentation Platform Comparison
| Capability | Statsig | LaunchDarkly | Eppo | Optimizely | GrowthBook |
|---|
| Primary Focus | Full experimentation | Feature flags + experiments | Warehouse-native experiments | Digital optimization | OSS experimentation |
| Stats Engine | Bayesian + Frequentist | Frequentist | Warehouse-computed | Frequentist (Stats Accelerator) | Bayesian + Frequentist |
| Warehouse Integration | Good | Limited | Excellent (native) | Limited | Good |
| Feature Flags | Yes | Best-in-class | Basic | Yes | Yes |
| CUPED Support | Yes | No | Yes | Yes | Yes |
| Pricing | Usage-based | Per-seat, premium | Usage-based | Enterprise | Free (OSS) + Cloud |
| Best For | Product teams at scale | Engineering-led orgs | Data team-led orgs | Marketing optimization | Budget-conscious teams |
Experimentation Maturity Model
| Level | Stage | Characteristics | Metrics |
|---|
| 0 | Ad-hoc | No formal testing; decisions by HiPPO | 0 tests/quarter |
| 1 | Reactive | Occasional tests on major features | 1-5 tests/quarter |
| 2 | Systematic | Dedicated platform, basic process | 10-25 tests/quarter |
| 3 | Cultural | Most features tested, power analysis standard | 50+ tests/quarter |
| 4 | Advanced | Sequential testing, CUPED, interaction detection | 100+ concurrent tests |
| 5 | Optimized | Automated experimentation, bandit algorithms, causal ML | Tests inform strategy |
Common Pitfalls Taxonomy
| Category | Pitfall | Consequence | Mitigation |
|---|
| Design | No power analysis | Inconclusive tests, wasted time | Always compute sample size upfront |
| Design | Testing too many variants | Diluted traffic, inflated false positives | Limit to 2-3 variants max |
| Execution | Peeking at results | Inflated significance, premature stops | Use sequential testing methods |
| Execution | Unequal assignment | Simpson's paradox, biased results | Validate randomization quality |
| Analysis | Ignoring novelty effects | Short-term lift that fades | Run tests for full business cycles |
| Analysis | Cherry-picking metrics | Confirmation bias | Pre-register primary metrics |
| Org | No decision framework | Tests complete but nobody acts | Define decision criteria upfront |
| Org | Testing everything | Resource exhaustion, slow shipping | Prioritize by expected impact |
Building an Experimentation Culture
The biggest barrier to effective A/B testing is not statistical knowledge — it is organizational willingness to accept that intuition can be wrong. The most impactful investment is not a better platform but executive sponsorship that rewards learning from failed experiments as much as confirming winning ones.
Resources