sampleSize_parallel_2A3E.Rmd
In the SimTOST
R package, which is specifically designed
for sample size estimation for bioequivalence studies, hypothesis
testing is based on the Two One-Sided Tests (TOST) procedure. (Sozu et al.
2015) In TOST, the equivalence test is framed as a comparison
between the the null hypothesis of ‘new product is worse by a clinically
relevant quantity’ and the alternative hypothesis of ‘difference between
products is too small to be clinically relevant’. This vignette focuses
on a parallel design, with 2 arms/treatments and 3 primary
endpoints.
In many studies, the aim is to evaluate equivalence across multiple primary endpoints. The European Medicines Agency (EMA) recommends demonstrating bioequivalence for both Area Under the Curve (AUC) and maximum concentration (Cmax) when assessing pharmacokinetic properties. This vignette presents advanced techniques for calculating sample size in parallel trial designs involving three treatment arms and two endpoints.
As an illustrative example, we consider published data from the phase-1 trial NCT01922336. This trial measured the pharmacokinetics (PK) of SB2 compared to its EU-sourced reference product (EU_Remicade). The following PK measures were reported following a single dose of SB2 or its EU reference product Remicade (Shin et al. 2015):
PK measure | SB2 | Remicade (EU) |
---|---|---|
AUCinf (g*h/mL) | 38,703 11,114 | 39,360 12,332 |
AUClast (g*h/mL) | 36,862 9133 | 37,022 9398 |
Cmax (g/mL) | 127.0 16.9 | 126.2 17.9 |
The following sections describe strategies for determining the sample size required for a parallel-group trial aimed at establishing equivalence across three co-primary endpoints. The Ratio of Means (ROM) approach will be used to assess equivalence.
A critical step in this process is defining the lower and upper equivalence boundaries for each endpoint. These boundaries set the acceptable ROM range within which equivalence is established. For simplicity, a consistent equivalence range of 0.8 to 1.25 is applied to all endpoints.
A conservative approach to sample size calculation involves testing each pharmacokinetic (PK) measure independently. This approach assumes that endpoints are uncorrelated and that equivalence is to be demonstrated for each endpoint separately. Consequently, the overall sample size required for the trial is the sum of the sample sizes calculated for each PK measure separately.
library(SimTOST)
# Sample size calculation for AUCinf
(sim_AUCinf <- sampleSize(
power = 0.9, # Target power
alpha = 0.05, # Significance level
arm_names = c("SB2", "EU_Remicade"), # Names of trial arms
list_comparator = list("EMA" = c("SB2","EU_Remicade")), # Comparator configuration
mu_list = list("SB2" = 38703, "EU_Remicade" = 39360), # Mean values
sigma_list = list("SB2" = 11114, "EU_Remicade" = 12332), # Standard deviation values
list_lequi.tol = list("EMA" = 0.80), # Lower equivalence margin
list_uequi.tol = list("EMA" = 1.25), # Upper equivalence margin
nsim = 1000 # Number of stochastic simulations
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#>
#> Comparisons:
#> SB2 vs. EU_Remicade
#> - Endpoints Tested: y1
#> -------------------------------------------------------------
#> Parameter Value
#> Total Sample Size 78
#> Achieved Power 90.5
#> Power Confidence Interval 88.5 - 92.2
#> -------------------------------------------------------------
# Sample size calculation for AUClast
(sim_AUClast <- sampleSize(
power = 0.9, # Target power
alpha = 0.05, # Significance level
arm_names = c("SB2", "EU_Remicade"), # Names of trial arms
list_comparator = list("EMA" = c("SB2", "EU_Remicade")), # Comparator configuration
mu_list = list("SB2" = 36862, "EU_Remicade" = 37022), # Mean values
sigma_list = list("SB2" = 9133, "EU_Remicade" = 9398), # Standard deviation values
list_lequi.tol = list("EMA" = 0.80), # Lower equivalence margin
list_uequi.tol = list("EMA" = 1.25), # Upper equivalence margin
nsim = 1000 # Number of stochastic simulations
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#>
#> Comparisons:
#> SB2 vs. EU_Remicade
#> - Endpoints Tested: y1
#> -------------------------------------------------------------
#> Parameter Value
#> Total Sample Size 54
#> Achieved Power 90.4
#> Power Confidence Interval 88.4 - 92.1
#> -------------------------------------------------------------
# Sample size calculation for Cmax
(sim_Cmax <- sampleSize(
power = 0.9, # Target power
alpha = 0.05, # Significance level
arm_names = c("SB2", "EU_Remicade"), # Names of trial arms
list_comparator = list("EMA" = c("SB2", "EU_Remicade")), # Comparator configuration
mu_list = list("SB2" = 127.0, "EU_Remicade" = 126.2), # Mean values
sigma_list = list("SB2" = 16.9, "EU_Remicade" = 17.9), # Standard deviation values
list_lequi.tol = list("EMA" = 0.80), # Lower equivalence margin
list_uequi.tol = list("EMA" = 1.25), # Upper equivalence margin
nsim = 1000 # Number of stochastic simulations
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#>
#> Comparisons:
#> SB2 vs. EU_Remicade
#> - Endpoints Tested: y1
#> -------------------------------------------------------------
#> Parameter Value
#> Total Sample Size 18
#> Achieved Power 91.2
#> Power Confidence Interval 89.2 - 92.8
#> -------------------------------------------------------------
When testing each PK measure independently, the total sample size is 78 for AUCinf, 54 for AUClast, and 18 for Cmax. This means that we would have to enroll 78 + 54 + 18 = 150 patients in order to reject . Note that the significance level of this combined test is then . For context, the original trial was a randomized, single-blind, three-arm, parallel-group study conducted in 159 healthy subjects, slightly more than the 150 patients estimated to be necessary.
This approach focuses on simultaneous testing of PK measures while assuming independence between endpoints. Unlike the previous approach, which tested each PK measure independently, this approach integrates comparisons across multiple endpoints while directly controlling the overall Type I error rate at a pre-specified level.
The arithmetic means and standard deviations for each endpoint and treatment arm are defined as follows:
mu_list <- list(
SB2 = c(AUCinf = 38703, AUClast = 36862, Cmax = 127.0),
EUREF = c(AUCinf = 39360, AUClast = 37022, Cmax = 126.2)
)
sigma_list <- list(
SB2 = c(AUCinf = 11114, AUClast = 9133, Cmax = 16.9),
EUREF = c(AUCinf = 12332, AUClast = 9398, Cmax = 17.9)
)
Subsequently, we define the equivalence boundaries:
list_comparator <- list("EMA" = c("SB2", "EUREF"))
list_lequi.tol <- list("EMA" = c(AUCinf = 0.8, AUClast = 0.8, Cmax = 0.8))
list_uequi.tol <- list("EMA" = c(AUCinf = 1.25, AUClast = 1.25, Cmax = 1.25))
Sample size calculation can then be implemented as follows:
(N_ss <- sampleSize(power = 0.9, # target power
alpha = 0.05,
mu_list = mu_list,
sigma_list = sigma_list,
list_comparator = list_comparator,
list_lequi.tol = list_lequi.tol,
list_uequi.tol = list_uequi.tol,
dtype = "parallel",
ctype = "ROM",
vareq = TRUE,
lognorm = TRUE,
nsim = 1000,
seed = 1234))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#>
#> Comparisons:
#> SB2 vs. EUREF
#> - Endpoints Tested: AUCinf, AUClast, Cmax
#> (multiple co-primary endpoints, m = 3 )
#> -------------------------------------------------------------
#> Parameter Value
#> Total Sample Size 84
#> Achieved Power 90
#> Power Confidence Interval 87.9 - 91.8
#> -------------------------------------------------------------
We can inspect the sample size requirements in more detail as follows:
N_ss$response
#> n_iter n_drop n_SB2 n_EUREF n_total power power_LCI power_UCI
#> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 42 0 42 42 84 0.9 0.8793091 0.9175476
Incorporating the correlations between endpoints in sample size calculations for continuous-valued co-primary endpoints offers significant advantages (Sozu et al. 2015). Adding more endpoints typically reduces power if such correlations are not accounted for. However, by including positive correlations in the calculations, power can be increased, and the required sample sizes may consequently be reduced.
For this scenario, we proceed with the same values used previously but now assume that a correlation exists between endpoints. Specifically, we set , assuming a common correlation across all endpoints.
If correlations differ between endpoints, they can be specified
individually using a correlation matrix (cor_mat
), allowing
for greater flexibility in the analysis.
(N_mult_corr <- sampleSize(power = 0.9, # target power
alpha = 0.05,
mu_list = mu_list,
sigma_list = sigma_list,
list_comparator = list_comparator,
list_lequi.tol = list_lequi.tol,
list_uequi.tol = list_uequi.tol,
rho = 0.6,
dtype = "parallel",
ctype = "ROM",
vareq = TRUE,
lognorm = TRUE,
nsim = 1000,
seed = 1234))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#>
#> Comparisons:
#> SB2 vs. EUREF
#> - Endpoints Tested: AUCinf, AUClast, Cmax
#> (multiple co-primary endpoints, m = 3 )
#> -------------------------------------------------------------
#> Parameter Value
#> Total Sample Size 82
#> Achieved Power 90.4
#> Power Confidence Interval 88.4 - 92.1
#> -------------------------------------------------------------
The required total sample size for this example is 82. This is 2 fewer patients than the scenario in which endpoints are assumed to be uncorrelated.
Imagine that we are interested in demonstrating equivalence for at least primary endpoints. Unlike the previous scenarios, in which equivalence was required for all endpoints, this scenario requires an adjustment for multiplicity to control the family-wise error rate. For example, when , we can use the Bonferroni correction:
(N_mp_bon <- sampleSize(
power = 0.9, # Target power
alpha = 0.05, # Significance level
mu_list = mu_list, # List of means
sigma_list = sigma_list, # List of standard deviations
list_comparator = list_comparator, # Comparator configurations
list_lequi.tol = list_lequi.tol, # Lower equivalence boundaries
list_uequi.tol = list_uequi.tol, # Upper equivalence boundaries
rho = 0.6, # Correlation between endpoints
dtype = "parallel", # Trial design type
ctype = "ROM", # Test type (Ratio of Means)
vareq = TRUE, # Assume equal variances
lognorm = TRUE, # Log-normal distribution assumption
k = c("EMA" = 1), # Demonstrate equivalence for at least 1 endpoint
adjust = "bon", # Bonferroni adjustment method
nsim = 1000, # Number of stochastic simulations
seed = 1234 # Random seed for reproducibility
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#>
#> Comparisons:
#> SB2 vs. EUREF
#> - Endpoints Tested: AUCinf, AUClast, Cmax
#> (multiple primary endpoints, k = 1 )
#> - Multiplicity Correction: Bonferroni
#> Adjusted Significance Levels: alpha = 0.0167
#>
#> -------------------------------------------------------------
#> Parameter Value
#> Total Sample Size 24
#> Achieved Power 90.6
#> Power Confidence Interval 88.6 - 92.3
#> -------------------------------------------------------------
As mentioned in the Introduction, Bonferroni adjustment is often overly conservative, especially in scenarios with correlated tests. A less restrictive alternative is the Sidak correction, which accounts for the joint probability of all tests being non-significant, making it mathematically less conservative than the Bonferroni method.
(N_mp_sid <- sampleSize(
power = 0.9, # Target power
alpha = 0.05, # Significance level
mu_list = mu_list, # List of means
sigma_list = sigma_list, # List of standard deviations
list_comparator = list_comparator, # Comparator configurations
list_lequi.tol = list_lequi.tol, # Lower equivalence boundaries
list_uequi.tol = list_uequi.tol, # Upper equivalence boundaries
rho = 0.6, # Correlation between endpoints
dtype = "parallel", # Trial design type
ctype = "ROM", # Test type (Ratio of Means)
vareq = TRUE, # Assume equal variances
lognorm = TRUE, # Log-normal distribution assumption
k = c("EMA" = 1), # Demonstrate equivalence for at least 1 endpoint
adjust = "sid", # Sidak adjustment method
nsim = 1000, # Number of stochastic simulations
seed = 1234 # Random seed for reproducibility
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#>
#> Comparisons:
#> SB2 vs. EUREF
#> - Endpoints Tested: AUCinf, AUClast, Cmax
#> (multiple primary endpoints, k = 1 )
#> - Multiplicity Correction: Sidak
#> Adjusted Significance Levels: alpha = 0.017
#>
#> -------------------------------------------------------------
#> Parameter Value
#> Total Sample Size 24
#> Achieved Power 90.6
#> Power Confidence Interval 88.6 - 92.3
#> -------------------------------------------------------------
When , Bonferroni and Sidak correction methods become increasingly conservative. A more flexible approach is the k-adjustment, which specifically accounts for the number of tests and the number of endpoints required for equivalence.
(N_mp_k <- sampleSize(
power = 0.9, # Target power
alpha = 0.05, # Significance level
mu_list = mu_list, # List of means
sigma_list = sigma_list, # List of standard deviations
list_comparator = list_comparator, # Comparator configurations
list_lequi.tol = list_lequi.tol, # Lower equivalence boundaries
list_uequi.tol = list_uequi.tol, # Upper equivalence boundaries
rho = 0.6, # Correlation between endpoints
dtype = "parallel", # Trial design type
ctype = "ROM", # Test type (Ratio of Means)
vareq = TRUE, # Assume equal variances
lognorm = TRUE, # Log-normal distribution assumption
k = c("EMA" = 2), # Demonstrate equivalence for at least 2 endpoints
adjust = "k", # Adjustment method
nsim = 1000, # Number of stochastic simulations
seed = 1234 # Random seed for reproducibility
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#>
#> Comparisons:
#> SB2 vs. EUREF
#> - Endpoints Tested: AUCinf, AUClast, Cmax
#> (multiple primary endpoints, k = 2 )
#> - Multiplicity Correction: k-adjustment
#> Adjusted Significance Levels: alpha = 0.0333
#>
#> -------------------------------------------------------------
#> Parameter Value
#> Total Sample Size 54
#> Achieved Power 91.6
#> Power Confidence Interval 89.7 - 93.2
#> -------------------------------------------------------------
Hierarchical testing is a structured approach that allows for a more nuanced evaluation of endpoints. Unlike a simple setup where at least endpoints must pass, hierarchical testing enforces that some endpoints are more critical and must always pass before proceeding to secondary endpoints. This ensures that primary endpoints receive higher priority, while still allowing flexibility in the evaluation of secondary endpoints.
In this example, the trial follows a hierarchical testing strategy, with Cmax as the primary endpoint. Equivalence testing begins with Cmax; if established, the analysis proceeds to the secondary endpoints AUCinf and AUClast. The trial is considered successful if equivalence holds for Cmax and at least one () of the secondary endpoints.
To implement this advanced hierarchical testing approach in SimTOST, we:
adjust = "seq"
.type_y
argument:
Cmax
(coded as 1) is the most
critical endpoint and must always pass.AUCinf
and AUClast
(coded as 2) are less critical and only evaluated if Cmax
passes.k=1
, ensuring that at least one of the two
secondary endpoints (AUCinf
or AUClast
) must
pass for the trial to be considered successful.The following code demonstrates how to apply hierarchical testing in SimTOST
(N_mp_seq <- sampleSize(
power = 0.9, # Target power
alpha = 0.05, # Significance level
mu_list = mu_list, # List of means
sigma_list = sigma_list, # List of standard deviations
list_comparator = list_comparator, # Comparator configurations
list_lequi.tol = list_lequi.tol, # Lower equivalence boundaries
list_uequi.tol = list_uequi.tol, # Upper equivalence boundaries
rho = 0.6, # Correlation between endpoints
dtype = "parallel", # Trial design type
ctype = "ROM", # Test type (Ratio of Means)
vareq = TRUE, # Assume equal variances
lognorm = TRUE, # Log-normal distribution assumption
adjust = "seq", # Sequential adjustment method
type_y = c("AUCinf" = 2, "AUClast" = 2, "Cmax" = 1), # Endpoint types
k = c("EMA" = 1), # Demonstrate equivalence for all 3 endpoints
nsim = 1000, # Number of stochastic simulations
seed = 1234 # Random seed for reproducibility
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#>
#> Comparisons:
#> SB2 vs. EUREF
#> - Endpoints Tested: AUCinf, AUClast, Cmax
#> (multiple primary endpoints, k = 1 )
#> - Multiplicity Correction: Sequential
#> Adjusted Significance Levels: alpha = 0.025; 0.025; 0.050
#>
#> -------------------------------------------------------------
#> Parameter Value
#> Total Sample Size 58
#> Achieved Power 91.3
#> Power Confidence Interval 89.3 - 92.9
#> -------------------------------------------------------------
The hierarchical testing strategy ensured that Cmax, the primary endpoint, had to pass before testing proceeded to the secondary endpoints AUCinf and AUClast. If Cmax failed, the trial was considered unsuccessful without evaluating the secondary endpoints. However, if Cmax passed, at least one of the two secondary endpoints had to demonstrate equivalence for the trial to be considered successful.
In this particular study design, a total of 58 patients were required to achieve an overall power of 90%. Previously, 54 patients were sufficient to demonstrate equivalence for at least two endpoints without enforcing a hierarchical structure. The increase in sample size by 4 additional patients was necessary to ensure equivalence for Cmax, the designated primary endpoint. This highlights the impact of hierarchical testing, where primary endpoints must be adequately powered before secondary endpoints are considered.