
test
- This hypothetical experiment tests two Landing Pages on Sweet Treats desserts website:
- Control: Popsicle image vs. Treatment: Ice cream image
- Ice cream is the new top product and we want to align the landing page accordingly
- I am trying to prove that the treatment performs better than the control because the team is interested in moving forward with the treatment
- The initial sample size is 8,000 users but the test gets cut short (due to constraints below) and our sample is cut to 890 users and I need to use Fisher's Exact Test for small sample sizes where a cell has < 10 users
constraints
- There is interest in switching the landing page over sooner due to running out of the orange popsicles, so the results give as much information as I will get about how the treatment performed
result
- It was established from the test that the treatment performed better with significance (at alpha=0.05). The practical significance is low in terms of cohen's h (cohen's h = 0.04) but the change is important for accurate representation of the products. It did not have the full desired statistical power (60% vs. 80%)
recommendation
- In this case, given the constraints, I am comfortable enough with the treatment performing some level higher than the control with significance and not vice versa that I recommend moving forward with implementing the treatment
tl;dr
- Skip to "Results Summary" at the end


install.packages("exact2x2")
install.packages("statmod")
library(exact2x2)
library(statmod)
library(glue)

alpha <- 0.05 # Significance level
power <- 0.80 # Statistical power (Probability of detecting an effect when it exists; 0.8 is standard)
control <- 0.02 # Baseline rate
effect <- 0.30 # Desired relative effect (e.g., 5% lift over baseline)
mde <- control * effect # Minimum Detectable Effect (MDE) - difference you want to detect in absolute terms
treatment <- control + mde # Treatment rate (includes effect)
print(paste('Control:',control))
print(paste('Treatment:',treatment))
[1] "Control: 0.02" [1] "Treatment: 0.026"
p_1 <- treatment
p_2 <-control
p1_label <- "Treatment"
p2_label <- "Control"
alternative = "greater" # in reference to p1:
# p1 is "greater" than p2
# p1 is "less" than p2
# p1 is different from ()"two.sided" p2
hypothesis <- switch(alternative,
greater = sprintf("%s (%.4f) is greater than %s (%.4f)", p1_label, p_1, p2_label, p_2),
less = sprintf("%s (%.4f) is less than %s (%.4f)", p1_label, p_1, p2_label, p_2),
two.sided = sprintf("%s (%.4f) is different from %s (%.4f)", p1_label, p_1, p2_label, p_2),
)
cat("Hypothesis:",hypothesis)
Hypothesis: Treatment (0.0260) is greater than Control (0.0200)

# Cohen's h (standardized effect size for proportions)
proportion_effectsize <- function(treatment, control) {
2 * asin(sqrt(treatment)) - 2 * asin(sqrt(control))
}
effect_size <- proportion_effectsize(treatment, control)
cat(sprintf("Minimum Detectable Effect (MDE): %.3f\n", mde))
cat(sprintf("Effect Size (Cohen's h): %.3f\n", effect_size))
Minimum Detectable Effect (MDE): 0.006 Effect Size (Cohen's h): 0.040
Cohen's h benchmarks:
0.2 = small effect
0.5 = medium effect
0.8 = large effect
If the effect is tiny, it will require a very large sample size to detect.

Calculate minimum sample size for each group (cell) for one-sided test:
- A one-sided test is used when you want to test if one group performs specifically better or worse than the other (a directional hypothesis).
simulate_fisher_power <- function(p1, p2, n1, n2, alpha, reps = 1000, alternative, seed=100) {
set.seed(seed)
rejects <- replicate(reps, {
x1 <- rbinom(1, n1, p1)
x2 <- rbinom(1, n2, p2)
# Creates contingency table, x1 is reference for hypothesis
tbl <- matrix(c(x1, n1 - x1, x2, n2 - x2), nrow = 2, byrow = TRUE)
fisher.test(tbl, alternative = alternative)$p.value < alpha
})
mean(rejects)
}
Test out different #s manually to find optimal sample size to reach power:
n_1 <- 8000 # input sample size for group 1
n_2 <- 8000 # input sample size for group 2
# Estimate Power at a Fixed Sample Size:
estimated_power <- simulate_fisher_power(p1=p_1, p2=p_2, n_1, n_2, alpha=alpha, alternative = alternative)
cat(sprintf("Power for manual estimate: %.3f\n", estimated_power))
Power for manual estimate: 0.809

control_conversions=9
treatment_conversions=20
control_no_conversions=450
treatment_no_conversions=440
print(p1_label) # set above in test design
print(p2_label)
[1] "Treatment" [1] "Control"

table <- matrix(c(control_conversions, control_no_conversions, treatment_conversions, treatment_no_conversions), nrow = 2, byrow = TRUE)
colnames(table) <- c("Converted", "Not_Converted")
rownames(table) <- c("Control", "Treatment")
print(table)
Converted Not_Converted Control 9 450 Treatment 20 440
# Flip rows if p1_label is not in the first row
if (rownames(table)[1] != p1_label) {
table_to_use <- table[c(2, 1), ]
}
print(table_to_use)
Converted Not_Converted Treatment 20 440 Control 9 450

n1 <- sum(table_to_use[1, ]) # Reference group (row 1)
n2 <- sum(table_to_use[2, ]) # (row 2)
p1 <- table_to_use[1, "Converted"] / n1 # Reference group (row 1) conversion rate
p2 <- table_to_use[2, "Converted"] / n2 # (row 2) conversion rate
groups <- c(p1 = p1_label, p2 = p2_label)
print(glue("p1: ","{groups['p1']} Conversion Rate: {round(p1 * 100, 2)}%"))
print(glue("p2: ","{groups['p2']} Conversion Rate: {round(p2 * 100, 2)}%"))
p1: Treatment Conversion Rate: 4.35% p2: Control Conversion Rate: 1.96%
result_hypothesis <- switch(alternative,
greater = sprintf("%s (%.4f) is greater than %s (%.4f)", p1_label, p1, p2_label, p2),
less = sprintf("%s (%.4f) is less than %s (%.4f)", p1_label, p1, p2_label, p2),
two.sided = sprintf("%s (%.4f) is different from %s (%.4f)", p1_label, p1, p2_label, p2),
)
cat("Result Hypothesis:",result_hypothesis)
Result Hypothesis: Treatment (0.0435) is greater than Control (0.0196)

# Absolute Difference
abs_diff <- abs(p1 - p2)
# Cohen's h function
proportion_effectsize <- function(control, treatment) {
2 * asin(sqrt(treatment)) - 2 * asin(sqrt(control))
}
h <- proportion_effectsize(control, treatment)
cat(sprintf("Absolute difference: %.3f (%.1f%%)\n", abs_diff, abs_diff * 100))
cat(sprintf("Cohen's h: %.3f\n", h))
# Interpret effect size
interpret_h <- function(h) {
if (abs(h) < 0.2) return("negligible")
if (abs(h) < 0.5) return("small")
if (abs(h) < 0.8) return("medium")
return("large")
}
cat(sprintf("Effect size interpretation: %s\n", interpret_h(h)))
Absolute difference: 0.024 (2.4%) Cohen's h: 0.040 Effect size interpretation: negligible

Use Fisher's Exact Test since a cell (control converted) has < 10 users:
print(table_to_use)
Converted Not_Converted Treatment 20 440 Control 9 450
Run test:
result <- exact2x2(table_to_use, alternative = alternative, conf.level = 1 - alpha, tsmethod="central")
print(result)
One-sided Fisher's Exact Test data: table_to_use p-value = 0.02908 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 1.098979 Inf sample estimates: odds ratio 2.270797

p_value <- result$p.value
print(paste("p-value: ",round(p_value,3)))
[1] "p-value: 0.029"
Because the p-value (0.029) is less than alpha (0.050), this result is statistically significant at the 95% confidence level.

lower_ci <- result$conf.int[1]
print(lower_ci)
[1] 1.098979
95% CI Lower Bound for Odds Ratio: 1.099.
With 95% confidence, the treatment group has higher odds of conversion than the control group. The odds of conversion in the treatment group are at least 9.9% higher than in the control group. This supports the hypothesis that treatment is better than control.
Because the interval does not include 1, this result is statistically significant at the 95% confidence level.

set.seed(100)
result_power <- power.fisher.test(n1 = n1, n2 = n2, p1 = p1, p2 = p2,
alpha = alpha,
alternative = alternative,
nsim = 10000)
print(paste("Result Power:",round(result_power*100,1),"%"))
[1] "Result Power: 60.6 %"
Result Power: 60.6 %
Our test was underpowered (e.g., only ~60% power vs. 80% desired), meaning there was a higher chance we failed to detect a true difference due to limited sample size. As a result, while the effect appears meaningful, we cannot be statistically confident in it without further data and cannot give a confident estimate in incremental revenue from the test.

cat("\n")
print(table_to_use)
cat("\n")
print(glue("p1: ","{groups['p1']} Conversion Rate: {round(p1 * 100, 2)}%"))
print(glue("p2: ","{groups['p2']} Conversion Rate: {round(p2 * 100, 2)}%"))
cat("\n")
cat("Result Hypothesis:",result_hypothesis)
cat("\n")
cat("\n")
cat(sprintf("Absolute difference: %.3f (%.1f%%)\n", abs_diff, abs_diff * 100))
cat(sprintf("Cohen's h: %.3f\n", h))
cat(sprintf("Effect size interpretation: %s\n", interpret_h(h)))
cat("\n")
print(result)
print(paste("Result Power:",round(result_power*100,1),"%"))
Converted Not_Converted
Treatment 20 440
Control 9 450
p1: Treatment Conversion Rate: 4.35%
p2: Control Conversion Rate: 1.96%
Result Hypothesis: Treatment (0.0435) is greater than Control (0.0196)
Absolute difference: 0.024 (2.4%)
Cohen's h: 0.040
Effect size interpretation: negligible
One-sided Fisher's Exact Test
data: table_to_use
p-value = 0.02908
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
1.098979 Inf
sample estimates:
odds ratio
2.270797
[1] "Result Power: 60.6 %"
95% CI Lower Bound for Odds Ratio: 1.099.
performance:
With 95% confidence, the treatment group has higher odds of conversion than the control group. The odds of conversion in the treatment group are at least 9.9% higher than in the control group based on the 95% CI lower bound odds ratio of 1.099. This supports the hypothesis that treatment is better than control.
significance:
Because the interval does not include 1, and the p-value (0.029) is less than alpha (0.050), this result is statistically significant at the 95% confidence level. The practical significance is low (cohen's h = 0.04) but the business impact is high based on domain knowledge.
power:
Our test was somewhat underpowered (e.g., only ~60% power vs. 80% desired), meaning there was
a higher chance we failed to detect a true difference due to limited sample size.
As a result, while the effect appears meaningful, we cannot be statistically
confident in it without further data and cannot give a confident
estimate in incremental revenue from the test.

- In this case, given the constraints in prematurely needing to switch to a landing page that features current products, I am comfortable enough with the treatment performing some level higher than the control with significance and not vice versa that I recommend moving forward with implementing the treatment