Open in GitHub

Open In Colab

Alt text

test

  • This hypothetical experiment tests two Landing Page CTAs for a class booking platform:
  • Control: "Schedule" vs. Treatment: "Book Now"
  • The sample size is ~14K users / group
  • I will use Z-Test for Two Proportions, which is good for large sample sizes
  • I am trying to prove that the treatment performs better than the control because the team is interested in moving forward with the treatment

result

  • It was established from the test that the treatment performed better with significance (at alpha=0.05). The treatment has a higher conversion rate than the control by at least 0.23%+ points. The practical significance is very small by cohen's h standards (cohen's h = 0.03) BUT the cost of implementing the treatment is low, the volume is high, and there will be meaningful business impact for the studio customers (signficantly improved booking rate, lower confusion, increased revenue). The result had the full desired statistical power (>80%)

recommendation

  • Due to the significance, power, and business impact, I will recommend moving forward with implementing the treatment

tl;dr for results

  • Skip to "Results Summary" at the end

Alt text

In [ ]:
install.packages('pwr')
install.packages('glue')
In [ ]:
library(pwr)
library(glue)

Alt text

In [ ]:
alpha <- 0.05            # Significance level
power <- 0.80            # Statistical power (Probability of detecting an effect when it exists; 0.8 is standard)
control <- 0.04          # Baseline rate
effect <- 0.15           # Desired relative effect (e.g., 5% lift over baseline)
mde <- control * effect  # Minimum Detectable Effect (MDE): diff you want to detect in absolute terms
treatment= control + mde # Treatment rate (includes effect)
print(paste('Control:',control))
print(paste('Treatment:',treatment))
[1] "Control: 0.04"
[1] "Treatment: 0.046"
In [ ]:
p_1=treatment
p_2=control
p1_label = "Treatment"
p2_label = "Control"

alternative = "greater"
# in reference to p1:
  # p1 is "greater" than p2
  # p1 is "less" than p2
  # p1 is different from ()"two.sided" p2

hypothesis <- switch(alternative,
  greater = sprintf("%s (%.4f) is greater than %s (%.4f)", p1_label, p_1, p2_label, p_2),
  less = sprintf("%s (%.4f) is less than %s (%.4f)", p1_label, p_1, p2_label, p_2),
  two.sided = sprintf("%s (%.4f) is different from %s (%.4f)", p1_label, p_1, p2_label, p_2),
)

cat("Hypothesis:",hypothesis)
Hypothesis: Treatment (0.0460) is greater than Control (0.0400)

Alt text

In [ ]:
# Cohen's h (standardized effect size for proportions)

effect_size = ES.h(treatment, control)

cat(sprintf("Minimum Detectable Effect (MDE): %.3f\n", mde))
cat(sprintf("Effect Size (Cohen's h): %.3f\n", effect_size))
Minimum Detectable Effect (MDE): 0.006
Effect Size (Cohen's h): 0.030

Cohen's h benchmarks:

0.2 = small effect

0.5 = medium effect

0.8 = large effect

If the effect is tiny, it will require a very large sample size to detect.

Alt text

Calculate minimum sample size for each group (cell) for one-sided test:

  • A one-sided test is used when you want to test if one group performs specifically better or worse than the other (a directional hypothesis).
In [ ]:
# Determine the minimum number of samples for each group

# pwr.2p.test requires inputting the effect size
result1 <- pwr.2p.test(h=effect_size, sig.level=alpha, power=power,alternative=alternative)
In [ ]:
# Inputting effect
cat("Inputting effect:\n")
cat(paste("(alternative)", alternative, ": n =", round(result1$n)), "\n")
Inputting effect:
(alternative) greater : n = 14118 

Alt text

In [ ]:
n_observations_control <- 14500
n_observations_treatment <- 14550

conversions_control <- 582
conversions_treatment <- 675

n1 <- n_observations_treatment
n2 <- n_observations_control
In [ ]:
print(p1_label) # Set above in test design
print(p2_label)
[1] "Treatment"
[1] "Control"

Alt text

In [ ]:
conv_rate_control = (conversions_control / n_observations_control)
conv_rate_treatment = (conversions_treatment / n_observations_treatment)

p1=conv_rate_treatment # Assign p1 vs. p2, test alternative references p1
p2=conv_rate_control

c1=conversions_treatment
c2=conversions_control

n1=n_observations_treatment
n2=n_observations_control
In [ ]:
print(glue("Control Conversion Rate: {round(conv_rate_control * 100, 2)}%"))
print(glue("Treatment Conversion Rate: {round(conv_rate_treatment * 100, 2)}%"))
Control Conversion Rate: 4.01%
Treatment Conversion Rate: 4.64%
In [ ]:
result_hypothesis <- switch(alternative,
  greater = sprintf("%s (%.4f) is greater than %s (%.4f)", p1_label, p1, p2_label, p2),
  less = sprintf("%s (%.4f) is less than %s (%.4f)", p1_label, p1, p2_label, p2),
  two.sided = sprintf("%s (%.4f) is different from %s (%.4f)", p1_label, p1, p2_label, p2),
)

cat("Result Hypothesis:",result_hypothesis)
Result Hypothesis: Treatment (0.0464) is greater than Control (0.0401)

Alt text

In [ ]:
# Uplift
uplift = (p1 - p2) / p2

# Absolute Difference
abs_diff = abs(p1 - p2)

# Cohen's h function
proportion_effectsize <- function(p1, p2) {
  2 * asin(sqrt(p1)) - 2 * asin(sqrt(p2))
}

h <- proportion_effectsize(p1, p2)

# Interpret effect size
interpret_h <- function(h) {
  if (abs(h) < 0.2) return("negligible")
  if (abs(h) < 0.5) return("small")
  if (abs(h) < 0.8) return("medium")
  return("large")
}
cat(sprintf("Absolute difference: %.3f (%.1f%%)\n", abs_diff, abs_diff * 100))
print(glue("Uplift: {round(uplift * 100, 2)}%"))
cat(sprintf("Cohen's h: %.3f\n", h))
cat(sprintf("Effect size interpretation: %s\n", interpret_h(h)))
Absolute difference: 0.006 (0.6%)
Uplift: 15.58%
Cohen's h: 0.031
Effect size interpretation: negligible

Alt text

Run test:¶

prop.test is a common way to do a z-test for 2 proportions in R. By default, it performs a chi-squared test with Yates continuity correction, but when you set the correction = FALSE, it becomes mathematically equivalent to the two-proportion z-test:

In [ ]:
# Vectorize successes and totals for statistical test
x <- c(c1, c2)  # successes
n <- c(n1, n2)  # totals

# Run two-proportion test
# Correction not needed with large sample size
test_result <- prop.test(x = x, n = n, alternative = alternative, correct = FALSE)

print(test_result)
	2-sample test for equality of proportions without continuity correction

data:  x out of n
X-squared = 6.8612, df = 1, p-value = 0.004404
alternative hypothesis: greater
95 percent confidence interval:
 0.002327633 1.000000000
sample estimates:
    prop 1     prop 2 
0.04639175 0.04013793 

Confirming with manual version:¶

In [ ]:
p_pool <- (c1 + c2) / (n1 + n2)

se_pool <- sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

z_stat <- (p1 - p2) / se_pool

# One-tailed test p1 > p2: p-value is the upper tail probability
  # 1 - (cume probability up to test statistic z under the standard normal distribution: P(Z≤zstat))
  # If z stat is large and positive, the p-value will be small
p_value_one_tailed <- 1 - pnorm(z_stat)

cat("Manual One-tailed Z test:\n")
cat("Z =", z_stat, "\n")
cat("P-value =", p_value_one_tailed, "\n")
Manual One-tailed Z test:
Z = 2.619381 
P-value = 0.004404473 

Alt text

In [ ]:
# Extract p-value from result
p_value <- test_result$p.value

print(sprintf("p-value: %.4f", p_value))
[1] "p-value: 0.0044"

Because the p-value (0.0044) is less than alpha (0.050), this result is statistically significant at the 95% confidence level.

Alt text

Confidence interval for difference in proportions (unpooled):¶

In [ ]:
# Difference in proportions
diff <- p1 - p2

# Critical z value for 95% confidence (one-tailed 95% test)
z <- qnorm(0.95)  # 95% quantile for one-sided CI

# Standard error of difference (unpooled)
se_diff <- sqrt((p1 * (1 - p1) / n1) + (p2 * (1 - p2) / n2))

# Margin of error
moe <- z * se_diff

# One-tailed confidence interval (lower bound only, since testing p1 > p2)
lower <- diff - moe
upper <- Inf  # Upper bound unbounded (infinity) in one-tailed CI for p1 > p2

cat("One-tailed 95% Confidence Interval (unpooled):", lower, "to", upper, "\n")
One-tailed 95% Confidence Interval (unpooled): 0.002327633 to Inf 

We are 95% confident that the true difference in conversion rates (e.g., p₁ - p₂) is at least 0.23 percentage points. Because the interval does not include 0, this result is statistically significant at the confidence level of 95%.

Confidence Interval for conversion rate (unpooled):¶

In [ ]:
# Calculate confidence interval for conversion rate:
se_p1 <- sqrt(p1 * (1 - p1) / n1)
lower_ci_p1 <- p1 - 1.96 * se_p1
upper_ci_p1 <- p1 + 1.96 * se_p1

cat(sprintf("%s 95%% CI: %.4f to %.4f\n",p1_label, lower_ci_p1, upper_ci_p1))

se_p2 <- sqrt(p2 * (1 - p2) / n2)
lower_ci_p2 <- p2 - 1.96 * se_p2
upper_ci_p2 <- p2 + 1.96 * se_p2

cat(sprintf("%s 95%% CI: %.4f to %.4f\n",p2_label, lower_ci_p2, upper_ci_p2))
Treatment 95% CI: 0.0430 to 0.0498
Control 95% CI: 0.0369 to 0.0433

If you repeated your experiment or data collection many times under the same conditions, then 95% of those calculated confidence intervals would contain the true population conversion rate

Alt text

In [ ]:
# Effective sample size (harmonic mean for unequal n)
n_effective <- (2 * n1 * n2) / (n1 + n2)

# Calculate power
power_result <- pwr.2p.test(h = h, n = n_effective, sig.level = alpha, alternative = alternative)
print(power_result)
     Difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.03075801
              n = 14524.96
      sig.level = 0.05
          power = 0.8355543
    alternative = greater

NOTE: same sample sizes

In [ ]:
# Extract the power
power_pct <- round(power_result$power * 100, 1)

cat("Result Power:", power_pct, "%\n\n")
Result Power: 83.6 %

Our test was adequately powered (e.g., ~83% power), meaning we had a strong chance of detecting a true difference if one existed.

Alt text

In [ ]:
print(glue("Control Conversion Rate: {round(conv_rate_control * 100, 2)}%"))
print(glue("Treatment Conversion Rate: {round(conv_rate_treatment * 100, 2)}%"))
cat("\n")
print(paste("Result Hypothesis:",result_hypothesis))
cat("\n")
cat(sprintf("Absolute difference: %.3f (%.1f%%)\n", abs_diff, abs_diff * 100))
print(glue("Uplift: {round(uplift * 100, 2)}%"))
cat(sprintf("Cohen's h: %.3f\n", h))
cat(sprintf("Effect size interpretation: %s\n", interpret_h(h)))
print(test_result)
print(sprintf("p-value: %.4f", p_value))
cat("\n")
cat(sprintf("%s 95%% CI: %.4f to %.4f\n",p1_label, lower_ci_p1, upper_ci_p1))
cat(sprintf("%s 95%% CI: %.4f to %.4f\n",p2_label, lower_ci_p2, upper_ci_p2))
cat("\n")
cat("One-tailed 95% Confidence Interval for Diff (unpooled):", lower, "to", upper, "\n")
cat("\n")
cat("Result Power:", power_pct, "%\n\n")
Control Conversion Rate: 4.01%
Treatment Conversion Rate: 4.64%

[1] "Result Hypothesis: Treatment (0.0464) is greater than Control (0.0401)"

Absolute difference: 0.006 (0.6%)
Uplift: 15.58%
Cohen's h: 0.031
Effect size interpretation: negligible

	2-sample test for equality of proportions without continuity correction

data:  x out of n
X-squared = 6.8612, df = 1, p-value = 0.004404
alternative hypothesis: greater
95 percent confidence interval:
 0.002327633 1.000000000
sample estimates:
    prop 1     prop 2 
0.04639175 0.04013793 

[1] "p-value: 0.0044"

Treatment 95% CI: 0.0430 to 0.0498
Control 95% CI: 0.0369 to 0.0433

One-tailed 95% Confidence Interval for Diff (unpooled): 0.002327633 to Inf 

Result Power: 83.6 %

performance:. With 95% confidence, the treatment has a higher conversion rate than the control by at least 0.23%+ points (based on the lower bound CI of 0.0023). This supports the hypothesis that treatment is better than control.

significance:. Because the p-value (0.0044) is less than alpha (0.050), and the 95% confidence interval for the difference does not contain 0, this result is statistically significant at the 95% confidence level. The practical significance is low (cohen's h = 0.03) but the business impact is meaningful in that it will signficantly improve booking rate, solve confusion from users, and generate additional revenue for customers.

power:. Our test was adequately powered (e.g., ~83% power vs. 80% desired), meaning we had a strong chance of detecting a true difference if one existed.

Alt text

Due to the significance, power, and meaningful business impact, I will recommend moving forward with implementing the treatment