A/B Test Sample Size Calculator
Plan an A/B test before you launch. Enter your baseline conversion rate, the minimum detectable effect (MDE), significance level (alpha) and power (1 minus beta) to get the required sample size per variant, total sample size, and how long the test will take given your daily traffic.
Your ad blocker is preventing us from showing ads
MiniWebtool is free because of ads. If this tool helped you, please support us by going Premium (ad‑free + faster tools), or allowlist MiniWebtool.com and reload.
- Allow ads for MiniWebtool.com, then reload
- Or upgrade to Premium (ad‑free)
About A/B Test Sample Size Calculator
The A/B Test Sample Size Calculator plans an A/B test before you launch it. Enter the baseline conversion rate, the minimum detectable effect (MDE) you care about, the significance level (alpha) and the statistical power you want, and the calculator returns the required per-arm and total sample size — plus an automatic test-duration estimate from your daily traffic and traffic share, a power curve showing how power grows with sample size, a sensitivity table that compares the cost of different MDE choices, a traffic-allocation visualization, and a plain-language feasibility verdict. Built specifically for conversion-rate A/B tests (two-proportion z-test, Cohen formulation), with optional Bonferroni correction for multivariate tests.
How to Use
- Enter the baseline conversion rate of the current variant (A), measured over a recent representative window.
- Set the minimum detectable effect (MDE) — the smallest lift that would actually change your decision. Toggle between relative percent and absolute percentage points.
- Pick a significance level (alpha) — 5% (95% confidence) is the industry default.
- Pick a statistical power — 80% is the industry default; raise to 90% for high-impact roll-outs.
- Choose two-tailed (B different from A in either direction, default) or one-tailed (only credit B beating A).
- If you are running a multivariate test, set the number of variants — the calculator applies a Bonferroni correction automatically.
- Enter daily visitors to the page and the traffic share routed into the experiment.
- Click Calculate Sample Size to read the per-arm and total sample size, expected test duration, power curve, sensitivity table, and step-by-step math.
Formula Used (Two-Proportion Power Formula)
p₂ = p₁ × (1 + MDE_relative) or p₂ = p₁ + MDE_absolute
p̄ = (p₁ + p₂) / 2 (pooled rate under H₀)
SD₀ = √[ 2 × p̄ × (1 − p̄) ] (standard deviation under the null)
SD₁ = √[ p₁(1 − p₁) + p₂(1 − p₂) ] (standard deviation under the alternative)
n / arm = (zα/2 × SD₀ + zβ × SD₁)² / (p₂ − p₁)²
For one-tailed tests, replace zα/2 with zα. For K variants vs one control, replace α with α / (K − 1) (Bonferroni correction).
What Makes This Sample-Size Calculator Different
- Live preview before you submit — every keystroke updates the per-arm sample size, total visitors, target conversion rate, and duration estimate.
- Test duration in real time — turns the abstract "you need 31,000 visitors" into the concrete "your test will run for 8 days at 4,000 visitors/day in the test."
- Animated power curve — see exactly where your target sample size lands on the power curve and how much more power an extra week of traffic would buy.
- MDE sensitivity table — compare the sample-size cost of detecting 2%, 5%, 10%, 15%, 20%, and 25% lifts side-by-side, so you can pick the smallest lift that is still feasible.
- Relative or absolute MDE — one-click toggle between the two most common ways product teams specify lift targets.
- Multivariate support with Bonferroni — handles A/B/C and A/B/C/D tests with automatic correction; many calculators silently use simple A/B math for multivariate inputs.
- Traffic-allocation visualization — a stacked bar showing exactly how the test traffic divides between control and each variant.
- Plain-language feasibility verdict — green/amber/red banner that flags slow tests before you launch.
- Quick scenarios — one-click presets for typical e-commerce, SaaS, email, and mobile-install baselines.
Reading the Feasibility Verdict
- Green — Feasible. Test completes within two weeks. You have ample traffic to detect the chosen lift at the chosen confidence.
- Amber — Doable. Test takes two to six weeks. Plan around at least one full business cycle and resist the urge to peek.
- Red — Slow. Test takes longer than six weeks (or cannot complete). Long tests are exposed to seasonality and shifting user behavior — either raise the MDE you care about or increase the traffic share routed into the experiment.
Why Sample Size Scales So Quickly
Two relationships matter most. First, required sample size scales with one over the square of the MDE — halving the lift you want to detect quadruples the required sample. Second, low-baseline tests cost more — at a 1% baseline you need roughly 25 times more visitors than at a 5% baseline to detect the same relative lift. Together these two effects explain why even high-traffic sites struggle to detect small lifts on low-rate flows.
Common Pitfalls in A/B Test Planning
- Setting MDE too small. Inflates the sample size to numbers you cannot collect in a reasonable time. Pick the smallest lift that would actually change your roll-out decision — not a hopeful guess.
- Power below 80%. A test with 60% power has a 40% chance of missing a real effect. The standard for product decisions is 80%; do not lower it just to make the test "fit."
- Stopping early on a low p-value. Peeking at interim results and stopping the moment p < 0.05 inflates the false-positive rate dramatically. Commit to the planned sample size before launch.
- Ignoring the multivariate cost. An A/B/C/D test with four variants needs the Bonferroni-corrected alpha — usually 2-3× the per-arm sample of a simple A/B test.
- Forgetting weekend effects. A test of 7 days minimum lets you average out day-of-week traffic mix; very short tests can be skewed by weekday/weekend differences.
- Underestimating allocation overhead. If you only route 50% of traffic into the test, the per-arm rate halves — double the calendar duration.
Choosing Alpha and Power
Alpha is the false-positive rate — the probability of declaring B a winner when it truly is not. Power is one minus the false-negative rate — the probability of detecting a real winner of the MDE size. Industry defaults are alpha = 0.05 and power = 0.80. Use alpha = 0.01 and power = 0.90 for high-stakes roll-outs where a wrong call is expensive. Both choices tighten the test and inflate the required sample size: lowering alpha from 0.05 to 0.01 roughly doubles the sample; raising power from 0.80 to 0.90 increases it by another 30%.
Relative vs Absolute MDE
Relative MDE (% of baseline) is the most common framing: "I want to detect a 10% lift on my current 5% conversion rate," meaning p₂ = 5.5%. Absolute MDE (percentage points) is the right framing when business impact is expressed in points: "I want to detect a +0.5 pp lift on my 5% baseline," meaning p₂ = 5.5%. The two are equivalent — pick whichever matches how your stakeholders think about the metric.
Multivariate Tests and Bonferroni Correction
If you compare K variants against one control, you are running K − 1 simultaneous tests. The naïve false-positive rate inflates with each extra comparison — three independent tests at alpha = 0.05 have a combined false-positive probability of roughly 14%, not 5%. The standard fix is the Bonferroni correction: divide your nominal alpha by the number of comparisons before computing the critical z-value. This calculator applies the correction automatically when you set the number of variants above 2. The result is a larger required per-arm sample size — multivariate tests cost more traffic per arm than simple A/B tests.
FAQ
What sample size do I need for an A/B test?
It depends on four numbers: baseline conversion rate, minimum detectable effect (MDE), significance level (alpha), and statistical power. For a typical e-commerce test with a 5% baseline, a 10% relative lift target, alpha 0.05 and 80% power, you need roughly 31,000 visitors per variant. Lower baselines and smaller MDEs both inflate the required sample size dramatically.
What is the minimum detectable effect (MDE) and how do I pick one?
MDE is the smallest lift you want the test to reliably detect. Pick it based on business impact — the smallest improvement that would change your roll-out decision. Common starting points: 5 to 10% relative for high-traffic checkout and signup flows, 15 to 25% relative for lower-traffic features. Smaller MDE means a much larger sample size, so do not under-set it.
What significance level and power should I use?
Alpha 0.05 (95% confidence) and 80% power are the industry defaults for product and marketing tests. Use alpha 0.01 and 90% power for high-impact roll-outs. Lowering either alpha or beta requires a larger sample size — the trade-off is between false positives (alpha), false negatives (beta), and how long the test takes.
Why does my test need so many visitors per variant?
Two factors dominate. First, lower baseline conversion rates inflate the required sample size — detecting a small lift on a 1% baseline takes about 25× more visitors than on a 5% baseline. Second, the required sample size scales with one over the square of the MDE — halving the MDE quadruples the required sample. Increase the MDE you care about or accept a longer test.
How is the formula derived?
It is the standard two-proportion power formula based on the normal approximation. The per-arm sample size equals the square of (zα times the pooled standard deviation under the null plus zβ times the standard deviation under the alternative), divided by the squared rate difference. The calculator uses pooled variance for the null term and unpooled variance for the alternative term — the most common textbook formulation (Cohen 1988, Fleiss et al. 1980).
How do I handle multivariate tests with more than one variant?
When you compare K variants against one control, the calculator applies a Bonferroni correction by dividing alpha by (K − 1) before computing the critical z value. This protects against the inflated false-positive rate that comes from running multiple comparisons. The result is a larger required per-arm sample size — multivariate tests cost more traffic per arm than simple A/B tests.
Should I run the test for the recommended number of days or stop when it hits significance?
Run it for the recommended duration and only evaluate significance at the end. Stopping the moment a p-value drops below 0.05 (peeking) inflates the false-positive rate well above the nominal alpha. The sample size shown by this calculator is the planned target — commit to it before launch and resist the urge to call the winner early. After the test ends, plug your results into the companion A/B Test Significance Calculator to read the p-value and confidence interval.
What if my conversion rate is very low (under 1%)?
The normal approximation can be slightly inaccurate when np or n(1 − p) is small. For very low-rate tests (e.g. a 0.1% baseline), the calculator still gives a reasonable planning estimate, but consider a small extra buffer (10-15%) on top of the recommended sample size. For very small samples per arm, Fisher's exact test is a more conservative alternative for the analysis stage.
Reference this content, page, or tool as:
"A/B Test Sample Size Calculator" at https://MiniWebtool.com// from MiniWebtool, https://MiniWebtool.com/
by miniwebtool team. Updated: 2026-05-17
You can also try our AI Math Solver GPT to solve your math problems through natural language question and answer.