Bioequivalence Tests for Parallel Trial Designs: 2 Arms, 3 Endpoints • SimTOST

In the SimTOST R package, which is specifically designed for sample size estimation for bioequivalence studies, hypothesis testing is based on the Two One-Sided Tests (TOST) procedure. (Sozu et al. 2015) In TOST, the equivalence test is framed as a comparison between the the null hypothesis of ‘new product is worse by a clinically relevant quantity’ and the alternative hypothesis of ‘difference between products is too small to be clinically relevant’. This vignette focuses on a parallel design, with 2 arms/treatments and 3 primary endpoints.

Introduction

In many studies, the aim is to evaluate equivalence across multiple primary endpoints. The European Medicines Agency (EMA) recommends demonstrating bioequivalence for both Area Under the Curve (AUC) and maximum concentration (Cmax) when assessing pharmacokinetic properties. This vignette presents advanced techniques for calculating sample size in parallel trial designs involving three treatment arms and two endpoints.

As an illustrative example, we consider published data from the phase-1 trial NCT01922336. This trial measured the pharmacokinetics (PK) of SB2 compared to its EU-sourced reference product (EU_Remicade). The following PK measures were reported following a single dose of SB2 or its EU reference product Remicade (Shin et al. 2015):

Primary PK measures between test and reference product. Data represent arithmetic mean +- standard deviation.
PK measure	SB2	Remicade (EU)
AUCinf ( $\mu$ g*h/mL)	38,703 $\pm$ 11,114	39,360 $\pm$ 12,332
AUClast ( $\mu$ g*h/mL)	36,862 $\pm$ 9133	37,022 $\pm$ 9398
Cmax ( $\mu$ g/mL)	127.0 $\pm$ 16.9	126.2 $\pm$ 17.9

Testing multiple co-primary endpoints

The following sections describe strategies for determining the sample size required for a parallel-group trial aimed at establishing equivalence across three co-primary endpoints. The Ratio of Means (ROM) approach will be used to assess equivalence.

A critical step in this process is defining the lower and upper equivalence boundaries for each endpoint. These boundaries set the acceptable ROM range within which equivalence is established. For simplicity, a consistent equivalence range of 0.8 to 1.25 is applied to all endpoints.

Independent Testing of Co-Primary Endpoints

A conservative approach to sample size calculation involves testing each pharmacokinetic (PK) measure independently. This approach assumes that endpoints are uncorrelated and that equivalence is to be demonstrated for each endpoint separately. Consequently, the overall sample size required for the trial is the sum of the sample sizes calculated for each PK measure separately.

library(SimTOST)

# Sample size calculation for AUCinf
(sim_AUCinf <- sampleSize(
  power = 0.9,                                # Target power
  alpha = 0.05,                               # Significance level
  arm_names = c("SB2", "EU_Remicade"),        # Names of trial arms
  list_comparator = list("EMA" = c("SB2","EU_Remicade")),   # Comparator configuration
  mu_list = list("SB2" = 38703, "EU_Remicade" = 39360),     # Mean values
  sigma_list = list("SB2" = 11114, "EU_Remicade" = 12332),  # Standard deviation values
  list_lequi.tol = list("EMA" = 0.80),        # Lower equivalence margin
  list_uequi.tol = list("EMA" = 1.25),        # Upper equivalence margin
  nsim = 1000                                 # Number of stochastic simulations
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#> 
#> Comparisons:
#>    SB2 vs. EU_Remicade 
#>     - Endpoints Tested: y1 
#> -------------------------------------------------------------
#>                  Parameter       Value
#>          Total Sample Size          78
#>             Achieved Power        90.5
#>  Power Confidence Interval 88.5 - 92.2
#> -------------------------------------------------------------

# Sample size calculation for AUClast
(sim_AUClast <- sampleSize(
  power = 0.9,                                # Target power
  alpha = 0.05,                               # Significance level
  arm_names = c("SB2", "EU_Remicade"),        # Names of trial arms
  list_comparator = list("EMA" = c("SB2", "EU_Remicade")),  # Comparator configuration
  mu_list = list("SB2" = 36862, "EU_Remicade" = 37022),     # Mean values
  sigma_list = list("SB2" = 9133, "EU_Remicade" = 9398),    # Standard deviation values
  list_lequi.tol = list("EMA" = 0.80),        # Lower equivalence margin
  list_uequi.tol = list("EMA" = 1.25),        # Upper equivalence margin
  nsim = 1000                                 # Number of stochastic simulations
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#> 
#> Comparisons:
#>    SB2 vs. EU_Remicade 
#>     - Endpoints Tested: y1 
#> -------------------------------------------------------------
#>                  Parameter       Value
#>          Total Sample Size          54
#>             Achieved Power        90.4
#>  Power Confidence Interval 88.4 - 92.1
#> -------------------------------------------------------------


# Sample size calculation for Cmax
(sim_Cmax <- sampleSize(
  power = 0.9,                                # Target power
  alpha = 0.05,                               # Significance level
  arm_names = c("SB2", "EU_Remicade"),        # Names of trial arms
  list_comparator = list("EMA" = c("SB2", "EU_Remicade")),  # Comparator configuration
  mu_list = list("SB2" = 127.0, "EU_Remicade" = 126.2),     # Mean values
  sigma_list = list("SB2" = 16.9, "EU_Remicade" = 17.9),    # Standard deviation values
  list_lequi.tol = list("EMA" = 0.80),        # Lower equivalence margin
  list_uequi.tol = list("EMA" = 1.25),        # Upper equivalence margin
  nsim = 1000                                 # Number of stochastic simulations
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#> 
#> Comparisons:
#>    SB2 vs. EU_Remicade 
#>     - Endpoints Tested: y1 
#> -------------------------------------------------------------
#>                  Parameter       Value
#>          Total Sample Size          18
#>             Achieved Power        91.2
#>  Power Confidence Interval 89.2 - 92.8
#> -------------------------------------------------------------

When testing each PK measure independently, the total sample size is 78 for AUCinf, 54 for AUClast, and 18 for Cmax. This means that we would have to enroll 78 + 54 + 18 = 150 patients in order to reject $H_0$ . Note that the significance level of this combined test is then $0.05^3$ . For context, the original trial was a randomized, single-blind, three-arm, parallel-group study conducted in 159 healthy subjects, slightly more than the 150 patients estimated to be necessary.

Simultaneous Testing of Independent Co-Primary Endpoints

This approach focuses on simultaneous testing of PK measures while assuming independence between endpoints. Unlike the previous approach, which tested each PK measure independently, this approach integrates comparisons across multiple endpoints while directly controlling the overall Type I error rate at a pre-specified level.

The arithmetic means and standard deviations for each endpoint and treatment arm are defined as follows:

mu_list <- list(
  SB2 = c(AUCinf = 38703, AUClast = 36862, Cmax = 127.0),
  EUREF = c(AUCinf = 39360, AUClast = 37022, Cmax = 126.2)
)

sigma_list <- list(
  SB2 = c(AUCinf = 11114, AUClast = 9133, Cmax = 16.9),
  EUREF = c(AUCinf = 12332, AUClast = 9398, Cmax = 17.9)
)

Subsequently, we define the equivalence boundaries:

list_comparator <- list("EMA" = c("SB2", "EUREF"))
list_lequi.tol <- list("EMA" = c(AUCinf = 0.8, AUClast = 0.8, Cmax = 0.8))
list_uequi.tol <- list("EMA" = c(AUCinf = 1.25, AUClast = 1.25, Cmax = 1.25))

Sample size calculation can then be implemented as follows:

(N_ss <- sampleSize(power = 0.9, # target power
                    alpha = 0.05,
                    mu_list = mu_list,
                    sigma_list = sigma_list,
                    list_comparator = list_comparator,
                    list_lequi.tol = list_lequi.tol,
                    list_uequi.tol = list_uequi.tol,
                    dtype = "parallel",
                    ctype = "ROM",
                    vareq = TRUE,
                    lognorm = TRUE,
                    nsim = 1000,
                    seed = 1234))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#> 
#> Comparisons:
#>    SB2 vs. EUREF 
#>     - Endpoints Tested: AUCinf, AUClast, Cmax 
#>       (multiple co-primary endpoints, m =  3 )
#> -------------------------------------------------------------
#>                  Parameter       Value
#>          Total Sample Size          84
#>             Achieved Power          90
#>  Power Confidence Interval 87.9 - 91.8
#> -------------------------------------------------------------

We can inspect the sample size requirements in more detail as follows:

N_ss$response
#>    n_iter n_drop n_SB2 n_EUREF n_total power power_LCI power_UCI
#>     <num>  <num> <num>   <num>   <num> <num>     <num>     <num>
#> 1:     42      0    42      42      84   0.9 0.8793091 0.9175476

Simultaneous Testing of Correlated Co-Primary Endpoints

Incorporating the correlations between endpoints in sample size calculations for continuous-valued co-primary endpoints offers significant advantages (Sozu et al. 2015). Adding more endpoints typically reduces power if such correlations are not accounted for. However, by including positive correlations in the calculations, power can be increased, and the required sample sizes may consequently be reduced.

For this scenario, we proceed with the same values used previously but now assume that a correlation exists between endpoints. Specifically, we set $\rho = 0.6$ , assuming a common correlation across all endpoints.

If correlations differ between endpoints, they can be specified individually using a correlation matrix (cor_mat), allowing for greater flexibility in the analysis.

(N_mult_corr <- sampleSize(power = 0.9, # target power
                           alpha = 0.05,
                           mu_list = mu_list,
                           sigma_list = sigma_list,
                           list_comparator = list_comparator,
                           list_lequi.tol = list_lequi.tol,
                           list_uequi.tol = list_uequi.tol,
                           rho = 0.6,
                           dtype = "parallel",
                           ctype = "ROM",
                           vareq = TRUE,
                           lognorm = TRUE,
                           nsim = 1000,
                           seed = 1234))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#> 
#> Comparisons:
#>    SB2 vs. EUREF 
#>     - Endpoints Tested: AUCinf, AUClast, Cmax 
#>       (multiple co-primary endpoints, m =  3 )
#> -------------------------------------------------------------
#>                  Parameter       Value
#>          Total Sample Size          82
#>             Achieved Power        90.4
#>  Power Confidence Interval 88.4 - 92.1
#> -------------------------------------------------------------

The required total sample size for this example is 82. This is 2 fewer patients than the scenario in which endpoints are assumed to be uncorrelated.

Testing multiple primary endpoints

Simultaneous Testing of Primary Endpoints

Imagine that we are interested in demonstrating equivalence for at least $k$ primary endpoints. Unlike the previous scenarios, in which equivalence was required for all endpoints, this scenario requires an adjustment for multiplicity to control the family-wise error rate. For example, when $k=1$ , we can use the Bonferroni correction:

(N_mp_bon <- sampleSize(
  power = 0.9,               # Target power
  alpha = 0.05,              # Significance level
  mu_list = mu_list,         # List of means
  sigma_list = sigma_list,   # List of standard deviations
  list_comparator = list_comparator,  # Comparator configurations
  list_lequi.tol = list_lequi.tol,    # Lower equivalence boundaries
  list_uequi.tol = list_uequi.tol,    # Upper equivalence boundaries
  rho = 0.6,                 # Correlation between endpoints
  dtype = "parallel",        # Trial design type
  ctype = "ROM",             # Test type (Ratio of Means)
  vareq = TRUE,              # Assume equal variances
  lognorm = TRUE,            # Log-normal distribution assumption
  k = c("EMA" = 1),          # Demonstrate equivalence for at least 1 endpoint
  adjust = "bon",            # Bonferroni adjustment method
  nsim = 1000,               # Number of stochastic simulations
  seed = 1234                # Random seed for reproducibility
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#> 
#> Comparisons:
#>    SB2 vs. EUREF 
#>     - Endpoints Tested: AUCinf, AUClast, Cmax 
#>       (multiple primary endpoints, k =  1 )
#>     - Multiplicity Correction: Bonferroni 
#>       Adjusted Significance Levels: alpha = 0.0167 
#> 
#> -------------------------------------------------------------
#>                  Parameter       Value
#>          Total Sample Size          24
#>             Achieved Power        90.6
#>  Power Confidence Interval 88.6 - 92.3
#> -------------------------------------------------------------

As mentioned in the Introduction, Bonferroni adjustment is often overly conservative, especially in scenarios with correlated tests. A less restrictive alternative is the Sidak correction, which accounts for the joint probability of all tests being non-significant, making it mathematically less conservative than the Bonferroni method.

(N_mp_sid <- sampleSize(
  power = 0.9,               # Target power
  alpha = 0.05,              # Significance level
  mu_list = mu_list,         # List of means
  sigma_list = sigma_list,   # List of standard deviations
  list_comparator = list_comparator,  # Comparator configurations
  list_lequi.tol = list_lequi.tol,    # Lower equivalence boundaries
  list_uequi.tol = list_uequi.tol,    # Upper equivalence boundaries
  rho = 0.6,                 # Correlation between endpoints
  dtype = "parallel",        # Trial design type
  ctype = "ROM",             # Test type (Ratio of Means)
  vareq = TRUE,              # Assume equal variances
  lognorm = TRUE,            # Log-normal distribution assumption
  k = c("EMA" = 1),          # Demonstrate equivalence for at least 1 endpoint
  adjust = "sid",            # Sidak adjustment method
  nsim = 1000,               # Number of stochastic simulations
  seed = 1234                # Random seed for reproducibility
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#> 
#> Comparisons:
#>    SB2 vs. EUREF 
#>     - Endpoints Tested: AUCinf, AUClast, Cmax 
#>       (multiple primary endpoints, k =  1 )
#>     - Multiplicity Correction: Sidak 
#>       Adjusted Significance Levels: alpha = 0.017 
#> 
#> -------------------------------------------------------------
#>                  Parameter       Value
#>          Total Sample Size          24
#>             Achieved Power        90.6
#>  Power Confidence Interval 88.6 - 92.3
#> -------------------------------------------------------------

When $k>1$ , Bonferroni and Sidak correction methods become increasingly conservative. A more flexible approach is the k-adjustment, which specifically accounts for the number of tests and the number of endpoints required for equivalence.

(N_mp_k <- sampleSize(
  power = 0.9,               # Target power
  alpha = 0.05,              # Significance level
  mu_list = mu_list,         # List of means
  sigma_list = sigma_list,   # List of standard deviations
  list_comparator = list_comparator,  # Comparator configurations
  list_lequi.tol = list_lequi.tol,    # Lower equivalence boundaries
  list_uequi.tol = list_uequi.tol,    # Upper equivalence boundaries
  rho = 0.6,                 # Correlation between endpoints
  dtype = "parallel",        # Trial design type
  ctype = "ROM",             # Test type (Ratio of Means)
  vareq = TRUE,              # Assume equal variances
  lognorm = TRUE,            # Log-normal distribution assumption
  k = c("EMA" = 2),          # Demonstrate equivalence for at least 2 endpoints
  adjust = "k",              # Adjustment method
  nsim = 1000,               # Number of stochastic simulations
  seed = 1234                # Random seed for reproducibility
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#> 
#> Comparisons:
#>    SB2 vs. EUREF 
#>     - Endpoints Tested: AUCinf, AUClast, Cmax 
#>       (multiple primary endpoints, k =  2 )
#>     - Multiplicity Correction: k-adjustment 
#>       Adjusted Significance Levels: alpha = 0.0333 
#> 
#> -------------------------------------------------------------
#>                  Parameter       Value
#>          Total Sample Size          54
#>             Achieved Power        91.6
#>  Power Confidence Interval 89.7 - 93.2
#> -------------------------------------------------------------

Hierarchical Testing of Endpoints

Hierarchical testing is a structured approach that allows for a more nuanced evaluation of endpoints. Unlike a simple setup where at least $k$ endpoints must pass, hierarchical testing enforces that some endpoints are more critical and must always pass before proceeding to secondary endpoints. This ensures that primary endpoints receive higher priority, while still allowing flexibility in the evaluation of secondary endpoints.

In this example, the trial follows a hierarchical testing strategy, with Cmax as the primary endpoint. Equivalence testing begins with Cmax; if established, the analysis proceeds to the secondary endpoints AUCinf and AUClast. The trial is considered successful if equivalence holds for Cmax and at least one ( $k \geq 1$ ) of the secondary endpoints.

To implement this advanced hierarchical testing approach in SimTOST, we:

Use hierarchical testing by setting adjust = "seq".
Define the endpoint hierarchy using the type_y argument:
- Primary endpoint: Cmax (coded as 1) is the most critical endpoint and must always pass.
- Secondary endpoints: AUCinf and AUClast (coded as 2) are less critical and only evaluated if Cmax passes.
Set k=1, ensuring that at least one of the two secondary endpoints (AUCinf or AUClast) must pass for the trial to be considered successful.

The following code demonstrates how to apply hierarchical testing in SimTOST

(N_mp_seq <- sampleSize(
  power           = 0.9,                              # Target power
  alpha           = 0.05,                             # Significance level
  mu_list         = mu_list,                          # List of means
  sigma_list      = sigma_list,                       # List of standard deviations
  list_comparator = list_comparator,                  # Comparator configurations
  list_lequi.tol  = list_lequi.tol,                   # Lower equivalence boundaries
  list_uequi.tol  = list_uequi.tol,                   # Upper equivalence boundaries
  rho             = 0.6,                              # Correlation between endpoints
  dtype           = "parallel",                       # Trial design type
  ctype           = "ROM",                            # Test type (Ratio of Means)
  vareq           = TRUE,                             # Assume equal variances
  lognorm         = TRUE,                             # Log-normal distribution assumption
  adjust          = "seq",                            # Sequential adjustment method
  type_y          = c("AUCinf" = 2, "AUClast" = 2, "Cmax" = 1), # Endpoint types
  k               = c("EMA" = 1),                     # Demonstrate equivalence for all 3 endpoints
  nsim            = 1000,                             # Number of stochastic simulations
  seed            = 1234                              # Random seed for reproducibility
))
#> Sample Size Calculation Results
#> -------------------------------------------------------------
#> Study Design: parallel trial targeting 90% power with a 5% type-I error.
#> 
#> Comparisons:
#>    SB2 vs. EUREF 
#>     - Endpoints Tested: AUCinf, AUClast, Cmax 
#>       (multiple primary endpoints, k =  1 )
#>     - Multiplicity Correction: Sequential 
#>       Adjusted Significance Levels: alpha = 0.025; 0.025; 0.050 
#> 
#> -------------------------------------------------------------
#>                  Parameter       Value
#>          Total Sample Size          58
#>             Achieved Power        91.3
#>  Power Confidence Interval 89.3 - 92.9
#> -------------------------------------------------------------

The hierarchical testing strategy ensured that Cmax, the primary endpoint, had to pass before testing proceeded to the secondary endpoints AUCinf and AUClast. If Cmax failed, the trial was considered unsuccessful without evaluating the secondary endpoints. However, if Cmax passed, at least one of the two secondary endpoints had to demonstrate equivalence for the trial to be considered successful.

In this particular study design, a total of 58 patients were required to achieve an overall power of 90%. Previously, 54 patients were sufficient to demonstrate equivalence for at least two endpoints without enforcing a hierarchical structure. The increase in sample size by 4 additional patients was necessary to ensure equivalence for Cmax, the designated primary endpoint. This highlights the impact of hierarchical testing, where primary endpoints must be adequately powered before secondary endpoints are considered.

References

Shin, Donghoon, Youngdoe Kim, Yoo Seok Kim, Thomas Körnicke, and Rainard Fuhr. 2015. “A Randomized, Phase I Pharmacokinetic Study Comparing SB2 and Infliximab Reference Product (Remicade) in Healthy Subjects.” BioDrugs 29 (6): 381–88. https://doi.org/10.1007/s40259-015-0150-5.

Sozu, Takashi, Tomoyuki Sugimoto, Toshimitsu Hamasaki, and Scott R. Evans. 2015. Sample Size Determination in Clinical Trials with Multiple Endpoints. SpringerBriefs in Statistics. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-22005-5.