A/B Test Significance Calculator
Determine whether the difference between A/B test variants is statistically significant. Enter visitors and conversions for control and variant groups to get p-value, confidence level, uplift, and a clear win/lose verdict.
Your ad blocker is preventing us from showing ads
MiniWebtool is free because of ads. If this tool helped you, please support us by going Premium (ad‑free + faster tools), or allowlist MiniWebtool.com and reload.
- Allow ads for MiniWebtool.com, then reload
- Or upgrade to Premium (ad‑free)
About A/B Test Significance Calculator
The A/B Test Significance Calculator applies a two-proportion z-test to your experiment data and reports whether the observed difference between the control variant (A) and the challenger variant (B) is statistically significant. Enter visitors and conversions for both groups and the tool returns the p-value, the confidence interval for the rate difference, the absolute and relative lift, the statistical power for the observed effect, the per-arm sample size you would need to confirm the lift at 80% power, and a plain-language win / lose / inconclusive verdict — backed by an animated visualisation of where your z-score lands on the standard normal distribution.
How to Use
- Enter the number of visitors and conversions for the control variant (A).
- Enter the same two numbers for the variant being tested (B), measured over the same time window.
- Pick a confidence level — 95% is standard, 99% is stricter, 90% is for early exploration.
- Choose two-tailed (B different from A in either direction) or one-tailed (only credit B if it beats A).
- Click Calculate Significance to read the verdict, p-value, confidence intervals, power, and the step-by-step math.
Formula Used (Two-Proportion Z-Test)
p₁ = c₁ / n₁ · p₂ = c₂ / n₂
p̂ = (c₁ + c₂) / (n₁ + n₂) (pooled rate under H₀)
SE = √[ p̂ × (1 − p̂) × (1/n₁ + 1/n₂) ]
z = (p₂ − p₁) / SE
p-value (two-tailed) = 2 × (1 − Φ(|z|))
CI for (p₂ − p₁) at level (1 − α) = (p₂ − p₁) ± zα/2 × √[ p₁(1−p₁)/n₁ + p₂(1−p₂)/n₂ ]
What Makes This A/B Test Calculator Different
- Live preview before you submit — type any of the four counts and watch the rates, lift, z-score, p-value, and verdict update in real time.
- Visual significance check — an animated standard normal curve shows exactly where your z-statistic falls relative to the rejection regions.
- Confidence interval forest plot — see the 95% intervals for both rates side-by-side. Non-overlapping bars are the visual signal of a winner.
- Plain-language verdict — green/amber/red banner instead of a bare p-value. Saying "Variant B wins" beats saying "p = 0.028" to most stakeholders.
- Statistical power readout — knows when the test is underpowered and recommends the per-arm sample size needed at 80% power.
- Bayesian-style "P(B > A)" — a complementary view to the frequentist p-value that many product teams find more intuitive.
- Quick example presets — load a clear-win, tight-call, no-signal, or loss scenario in one click and explore how the numbers move.
Reading the Verdict
- Green — Significant win. p-value ≤ α and variant rate > control rate. The lift is unlikely to be due to chance; you can roll out B.
- Red — Significant loss. p-value ≤ α but variant rate < control rate. B is genuinely worse; keep A and investigate.
- Amber — Close to threshold. p-value is near α. Collect more traffic before deciding.
- Grey — No signal yet. The data is consistent with no real difference. Either keep running or stop and try a bigger change.
Why You Should Not Stop Early on a Significant P-Value
Repeatedly checking a test and stopping the moment p-value < 0.05 (often called "peeking") inflates the false-positive rate dramatically — sometimes to 30% or higher for a nominal 5% test. Decide the sample size in advance with a power calculation, run the experiment to that target, and only then evaluate significance. The required per-arm sample size shown by this calculator is a good target when planning future tests.
Sample Size Planning
If your test is underpowered, the calculator recommends a per-arm sample size using the standard two-proportion power formula:
n / arm ≈ (zα/2 · √[2p̄(1−p̄)] + zβ · √[p₁(1−p₁) + p₂(1−p₂)])² / (p₂ − p₁)²
where p̄ is the average of p₁ and p₂ and zβ is the standard normal quantile for the target power (0.84 for 80%).
Plug your historical baseline rate and the smallest lift you would care about into the formula — that is the sample size to target before launching a new test.
Common Pitfalls in A/B Testing
- Peeking — checking results daily and stopping at the first significant p-value inflates false positives. Use sequential testing or wait for the planned sample size.
- Tiny samples — at fewer than a few hundred conversions per arm the normal approximation breaks down. Consider Fisher's exact test instead.
- Multiple comparisons — running ten tests and reporting only the winner inflates the false-positive rate. Apply a Bonferroni correction or run pre-registered confirmatory tests.
- Novelty effects — variant B may look great in the first week purely because users notice the change. Let the test run long enough for the effect to stabilise.
- Survivorship bias — filtering visitors after randomisation breaks the test. Always compute the test on the full randomised population.
- Misaligned measurement window — collect data for both arms over identical time windows. Weekend and weekday traffic mix shifts the baseline rate.
One-Tailed vs Two-Tailed Tests
A two-tailed test asks whether B differs from A in any direction. It is the right default when you genuinely could roll out either variant. A one-tailed test only credits a result in the pre-specified direction (typically: B beats A) and roughly halves the p-value when the data points that way — but you must commit to the direction before looking at the data. Switching to one-tailed after seeing the result is a common form of p-hacking.
Reading the Confidence Interval
The 95% confidence interval for the difference in rates tells you the plausible range of true lifts. If the interval is entirely above zero, B is a winner; entirely below zero, B is a loser; crossing zero, the data is consistent with no real difference. The width of the interval is a measure of how precise your estimate is — narrower means more data.
FAQ
What does the A/B test significance calculator do?
It applies a two-proportion z-test to your control and variant conversion data and tells you whether the observed difference in conversion rates is unlikely to be explained by random chance. It reports the p-value, a confidence interval for the difference, the statistical power for the observed effect, the lift, and a plain-language verdict.
What confidence level should I use for an A/B test?
95% confidence (α = 0.05) is the industry standard for product and marketing tests. Use 99% for high-impact rollouts where a false positive is costly, and 90% only for early exploration where you accept a higher false-positive risk.
Should I run a one-tailed or two-tailed test?
Use two-tailed when you only care that B differs from A in either direction. Use one-tailed when you have a directional hypothesis decided in advance, such as B is expected to beat A, and you are willing to ignore any opposite-direction signal. Most product teams should default to two-tailed.
How is the p-value calculated?
The pooled rate p̂ is computed from the combined conversions and visitors. The standard error is √[p̂(1−p̂)(1/n₁ + 1/n₂)]. The z-statistic is the rate difference divided by that standard error. The two-tailed p-value is 2 × (1 − Φ(|z|)) where Φ is the standard normal cumulative distribution function.
What is statistical power and why does it matter?
Power is the probability that the test detects a real effect of the observed size given the current sample size. Power below 80% means the test is likely too small to confirm the lift even if it is real. The calculator reports power and the per-arm sample size you would need to reach 80%.
Can I stop the test as soon as the p-value drops below 0.05?
No. Peeking and stopping early inflates the false-positive rate well above the nominal α. Decide the sample size in advance using a power calculation, run the test to completion, and only then evaluate significance. The required sample size shown by this calculator is a good target.
What if my conversion rate is very low (e.g. under 1%)?
The normal approximation can be inaccurate when np or n(1−p) is small. As a rule of thumb, you want at least 30 conversions in each arm, ideally 100+. For very low-rate tests, consider Fisher's exact test as a more conservative alternative.
What does P(B > A) mean?
Under a non-informative (uniform-style) prior on each rate, the data implies a posterior probability that variant B has a higher true conversion rate than variant A. It is a Bayesian companion to the frequentist p-value and is often easier to communicate to non-statisticians ("85% confident B is better" beats "p = 0.03").
Reference this content, page, or tool as:
"A/B Test Significance Calculator" at https://MiniWebtool.com// from MiniWebtool, https://MiniWebtool.com/
by miniwebtool team. Updated: 2026-05-17
You can also try our AI Math Solver GPT to solve your math problems through natural language question and answer.