Open in GitHub

Open In Colab

Alt text

test

  • This hypothetical experiment tests two Landing Pages on Sweet Treats desserts website:
  • Control: Popsicle image vs. Treatment: Ice cream image
  • Ice cream is the new top product and we want to align the landing page accordingly
  • I am trying to prove that the treatment performs better than the control because the team is interested in moving forward with the treatment
  • The initial sample size is 8,000 users but the test gets cut short (due to constraints below) and our sample is cut to 890 users and I need to use Fisher's Exact Test for small sample sizes where a cell has < 10 users

constraints

  • There is interest in switching the landing page over sooner due to running out of the orange popsicles, so the results give as much information as I will get about how the treatment performed

result

  • It was established from the test that the treatment performed better with significance (at alpha=0.05). The practical significance is low in terms of cohen's h (cohen's h = 0.04) but the change is important for accurate representation of the products. It did not have the full desired statistical power (60% vs. 80%)

recommendation

  • In this case, given the constraints, I am comfortable enough with the treatment performing some level higher than the control with significance and not vice versa that I recommend moving forward with implementing the treatment

tl;dr

  • Skip to "Results Summary" at the end

Alt text

Alt text

In [ ]:
install.packages("exact2x2")
install.packages("statmod")
In [ ]:
library(exact2x2)
library(statmod)
library(glue)

Alt text

In [ ]:
alpha <- 0.05            # Significance level
power <- 0.80            # Statistical power (Probability of detecting an effect when it exists; 0.8 is standard)
control <- 0.02          # Baseline rate
effect <- 0.30           # Desired relative effect (e.g., 5% lift over baseline)
mde <- control * effect  # Minimum Detectable Effect (MDE) - difference you want to detect in absolute terms
treatment <- control + mde # Treatment rate (includes effect)
print(paste('Control:',control))
print(paste('Treatment:',treatment))
[1] "Control: 0.02"
[1] "Treatment: 0.026"
In [ ]:
p_1 <- treatment
p_2 <-control
p1_label <- "Treatment"
p2_label <- "Control"

alternative = "greater" # in reference to p1:
# p1 is "greater" than p2
# p1 is "less" than p2
# p1 is different from ()"two.sided" p2

hypothesis <- switch(alternative,
  greater = sprintf("%s (%.4f) is greater than %s (%.4f)", p1_label, p_1, p2_label, p_2),
  less = sprintf("%s (%.4f) is less than %s (%.4f)", p1_label, p_1, p2_label, p_2),
  two.sided = sprintf("%s (%.4f) is different from %s (%.4f)", p1_label, p_1, p2_label, p_2),
)

cat("Hypothesis:",hypothesis)
Hypothesis: Treatment (0.0260) is greater than Control (0.0200)

Alt text

In [ ]:
# Cohen's h (standardized effect size for proportions)
proportion_effectsize <- function(treatment, control) {
  2 * asin(sqrt(treatment)) - 2 * asin(sqrt(control))
}
effect_size <- proportion_effectsize(treatment, control)

cat(sprintf("Minimum Detectable Effect (MDE): %.3f\n", mde))
cat(sprintf("Effect Size (Cohen's h): %.3f\n", effect_size))
Minimum Detectable Effect (MDE): 0.006
Effect Size (Cohen's h): 0.040

Cohen's h benchmarks:

0.2 = small effect

0.5 = medium effect

0.8 = large effect

If the effect is tiny, it will require a very large sample size to detect.

Alt text

Calculate minimum sample size for each group (cell) for one-sided test:

  • A one-sided test is used when you want to test if one group performs specifically better or worse than the other (a directional hypothesis).
In [ ]:
simulate_fisher_power <- function(p1, p2, n1, n2, alpha, reps = 1000, alternative, seed=100) {
  set.seed(seed)
  rejects <- replicate(reps, {
    x1 <- rbinom(1, n1, p1)
    x2 <- rbinom(1, n2, p2)

    # Creates contingency table, x1 is reference for hypothesis
    tbl <- matrix(c(x1, n1 - x1, x2, n2 - x2), nrow = 2, byrow = TRUE)

    fisher.test(tbl, alternative = alternative)$p.value < alpha
  })

  mean(rejects)
}

Test out different #s manually to find optimal sample size to reach power:

In [ ]:
n_1 <- 8000   # input sample size for group 1
n_2 <- 8000   # input sample size for group 2

# Estimate Power at a Fixed Sample Size:
estimated_power <- simulate_fisher_power(p1=p_1, p2=p_2, n_1, n_2, alpha=alpha, alternative = alternative)
cat(sprintf("Power for manual estimate: %.3f\n", estimated_power))
Power for manual estimate: 0.809

Alt text

In [ ]:
control_conversions=9
treatment_conversions=20
control_no_conversions=450
treatment_no_conversions=440
In [ ]:
print(p1_label) # set above in test design
print(p2_label)
[1] "Treatment"
[1] "Control"

Alt text

In [ ]:
table <- matrix(c(control_conversions, control_no_conversions, treatment_conversions, treatment_no_conversions), nrow = 2, byrow = TRUE)
colnames(table) <- c("Converted", "Not_Converted")
rownames(table) <- c("Control", "Treatment")
print(table)
          Converted Not_Converted
Control           9           450
Treatment        20           440
In [ ]:
# Flip rows if p1_label is not in the first row
if (rownames(table)[1] != p1_label) {
  table_to_use <- table[c(2, 1), ]
}

print(table_to_use)
          Converted Not_Converted
Treatment        20           440
Control           9           450

Alt text

In [ ]:
n1 <- sum(table_to_use[1, ])           # Reference group (row 1)
n2 <- sum(table_to_use[2, ])           # (row 2)

p1 <- table_to_use[1, "Converted"] / n1   # Reference group (row 1) conversion rate
p2 <- table_to_use[2, "Converted"] / n2   # (row 2) conversion rate

groups <- c(p1 = p1_label, p2 = p2_label)

print(glue("p1: ","{groups['p1']} Conversion Rate: {round(p1 * 100, 2)}%"))
print(glue("p2: ","{groups['p2']} Conversion Rate: {round(p2 * 100, 2)}%"))
p1: Treatment Conversion Rate: 4.35%
p2: Control Conversion Rate: 1.96%
In [ ]:
result_hypothesis <- switch(alternative,
  greater = sprintf("%s (%.4f) is greater than %s (%.4f)", p1_label, p1, p2_label, p2),
  less = sprintf("%s (%.4f) is less than %s (%.4f)", p1_label, p1, p2_label, p2),
  two.sided = sprintf("%s (%.4f) is different from %s (%.4f)", p1_label, p1, p2_label, p2),
)

cat("Result Hypothesis:",result_hypothesis)
Result Hypothesis: Treatment (0.0435) is greater than Control (0.0196)

Alt text

In [ ]:
# Absolute Difference
abs_diff <- abs(p1 - p2)

# Cohen's h function
proportion_effectsize <- function(control, treatment) {
  2 * asin(sqrt(treatment)) - 2 * asin(sqrt(control))
}

h <- proportion_effectsize(control, treatment)

cat(sprintf("Absolute difference: %.3f (%.1f%%)\n", abs_diff, abs_diff * 100))
cat(sprintf("Cohen's h: %.3f\n", h))

# Interpret effect size
interpret_h <- function(h) {
  if (abs(h) < 0.2) return("negligible")
  if (abs(h) < 0.5) return("small")
  if (abs(h) < 0.8) return("medium")
  return("large")
}
cat(sprintf("Effect size interpretation: %s\n", interpret_h(h)))
Absolute difference: 0.024 (2.4%)
Cohen's h: 0.040
Effect size interpretation: negligible

Alt text

Use Fisher's Exact Test since a cell (control converted) has < 10 users:

In [ ]:
print(table_to_use)
          Converted Not_Converted
Treatment        20           440
Control           9           450

Run test:

In [ ]:
result <-  exact2x2(table_to_use, alternative = alternative, conf.level = 1 - alpha, tsmethod="central")
print(result)
	One-sided Fisher's Exact Test

data:  table_to_use
p-value = 0.02908
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 1.098979      Inf
sample estimates:
odds ratio 
  2.270797 

Alt text

In [ ]:
p_value <- result$p.value

print(paste("p-value: ",round(p_value,3)))
[1] "p-value:  0.029"

Because the p-value (0.029) is less than alpha (0.050), this result is statistically significant at the 95% confidence level.

Alt text

In [ ]:
lower_ci <- result$conf.int[1]
print(lower_ci)
[1] 1.098979

95% CI Lower Bound for Odds Ratio: 1.099.

With 95% confidence, the treatment group has higher odds of conversion than the control group. The odds of conversion in the treatment group are at least 9.9% higher than in the control group. This supports the hypothesis that treatment is better than control.

Because the interval does not include 1, this result is statistically significant at the 95% confidence level.

Alt text

In [ ]:
set.seed(100)
result_power <- power.fisher.test(n1 = n1, n2 = n2, p1 = p1, p2 = p2,
                           alpha = alpha,
                           alternative = alternative,
                           nsim = 10000)
print(paste("Result Power:",round(result_power*100,1),"%"))
[1] "Result Power: 60.6 %"

Result Power: 60.6 %

Our test was underpowered (e.g., only ~60% power vs. 80% desired), meaning there was a higher chance we failed to detect a true difference due to limited sample size. As a result, while the effect appears meaningful, we cannot be statistically confident in it without further data and cannot give a confident estimate in incremental revenue from the test.

Alt text

In [ ]:
cat("\n")
print(table_to_use)
cat("\n")
print(glue("p1: ","{groups['p1']} Conversion Rate: {round(p1 * 100, 2)}%"))
print(glue("p2: ","{groups['p2']} Conversion Rate: {round(p2 * 100, 2)}%"))
cat("\n")
cat("Result Hypothesis:",result_hypothesis)
cat("\n")
cat("\n")
cat(sprintf("Absolute difference: %.3f (%.1f%%)\n", abs_diff, abs_diff * 100))
cat(sprintf("Cohen's h: %.3f\n", h))
cat(sprintf("Effect size interpretation: %s\n", interpret_h(h)))
cat("\n")
print(result)
print(paste("Result Power:",round(result_power*100,1),"%"))
          Converted Not_Converted
Treatment        20           440
Control           9           450

p1: Treatment Conversion Rate: 4.35%
p2: Control Conversion Rate: 1.96%

Result Hypothesis: Treatment (0.0435) is greater than Control (0.0196)

Absolute difference: 0.024 (2.4%)
Cohen's h: 0.040
Effect size interpretation: negligible


	One-sided Fisher's Exact Test

data:  table_to_use
p-value = 0.02908
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 1.098979      Inf
sample estimates:
odds ratio 
  2.270797 

[1] "Result Power: 60.6 %"

95% CI Lower Bound for Odds Ratio: 1.099.

performance:
With 95% confidence, the treatment group has higher odds of conversion than the control group. The odds of conversion in the treatment group are at least 9.9% higher than in the control group based on the 95% CI lower bound odds ratio of 1.099. This supports the hypothesis that treatment is better than control.

significance:
Because the interval does not include 1, and the p-value (0.029) is less than alpha (0.050), this result is statistically significant at the 95% confidence level. The practical significance is low (cohen's h = 0.04) but the business impact is high based on domain knowledge.

power:
Our test was somewhat underpowered (e.g., only ~60% power vs. 80% desired), meaning there was a higher chance we failed to detect a true difference due to limited sample size. As a result, while the effect appears meaningful, we cannot be statistically confident in it without further data and cannot give a confident estimate in incremental revenue from the test.

Alt text

  • In this case, given the constraints in prematurely needing to switch to a landing page that features current products, I am comfortable enough with the treatment performing some level higher than the control with significance and not vice versa that I recommend moving forward with implementing the treatment