Finite Sample Corrections for Average Equivalence Testing

Average (bio)equivalence tests are used to assess if a parameter, like the mean difference in treatment response between two conditions for example, lies within a given equivalence interval, hence allowing to conclude that the conditions have ‘equivalent’ means. The Two One-Sided Tests (TOST) procedure, consisting in testing whether the target parameter is respectively significantly greater and lower than some pre-defined lower and upper equivalence limits, is typically used in this context, usually by checking whether the confidence interval for the target parameter lies within these limits. This intuitive and visual procedure is however known to be conservative, especially in the case of highly variable drugs, where it shows a rapid power loss, often reaching zero, hence making it impossible to conclude for equivalence when it is actually true. Here, we propose a finite sample correction of the TOST procedure, the α-TOST, which consists in a correction of the significance level of the TOST allowing to guarantee a test size (or type-I error rate) of α. This new procedure essentially corresponds to a finite sample and variability correction of the TOST procedure. We show that this procedure is uniformly more powerful than the TOST, easy to compute, and that its operating characteristics outperform the ones of its competitors. A case study about econazole nitrate deposition in porcine skin is used to illustrate the benefits of the proposed method and its advantages compared to other available procedures.


Introduction
Equivalence tests, also known as similarity or parity tests, have gained significant attention during the past two decades.
They originated from the field of pharmacokinetics, 1,2 where they are called bioequivalence tests and have numerous applications in both research and production. 3They find their most common application in the manufacturing of generic medicinal drugs, where, by proving that the generic version has a similar bioavailability to its well-studied brandname counterpart, the manufacturer can considerably shorten the approval process for the generic drug. 4Moreover, equivalence tests have attracted growing interest in other domains and for other types of purposes, such as in production when, for example, the mode of administration is altered or when the production site is changed, 5 or in the social and behavioral sciences for the evaluation of replication results and corroborating risky predictions. 6Very recent literature reflects the expanding use of equivalence tests across a growing range of domains.Examples include the investigation of the equivalence of virtual reality imaging measurements by feature, 7 of cardiovascular responses to stimuli by sex, 8 of children neurodevelopment, 9 chemotherapy efficacy and safety by treatment, 10 of post-stroke functional connectivity patterns by patient group, 11 of risk-taking choices by moral type, 12 and of 2020 US presidential election turnout by political advertising condition. 13Review articles have also appeared, for example, in food sciences, 14 in psychology, 15 in sport sciences, 16 and in pharmaceutical sciences. 17uivalence testing implies defining an equivalence region within which the parameter of interest, such as the difference between outcome means measured under two conditions, would lie, for these conditions to be considered equivalent.Indeed, when comparing two treatments, for example, differences in therapeutic effects that belong to the equivalence region would typically be considered as negligible or irrelevant.This is different from standard equality-of-means hypothesis tests in which the null and the alternative hypothesis are interchanged and the null hypothesis states that both means are equal rather than equivalent.
Formally, a canonical form for the average equivalence problem consists of two independent random variables p θ and p σ ν having the distributions where θ and σ 2 ν respectively denote the target equivalence parameter and its variance, depending on the number of the degrees of freedom ν which is a function of the sample size and total number of parameters.This setting is very general.It covers cases where, for example, the bioequivalence parameter corresponds to the difference in means of the (logarithm of pharmacokinetic) responses between two experimental conditions, or to an element of the parameter vector of a (generalised) linear mixed effect model, like the difference between the slopes of two conditions in a longitudinal study.The hypotheses of interest are given by where δ L and δ U are known constants.Without loss of generality, it can be assumed that the equivalence limits are symmetrical around zero.In this case, c :" δ U " ´δL so that Θ 1 " p´c, cq.Equivalence is typically investigated via the Two One-Sided Tests (TOST) procedure, 18 consisting in testing whether the target parameter is respectively significantly greater than ´c and lower than c, with the test size, or type-I error rate, controlled at the significance (or nominal) level α, usually chosen as 5%. 19More precisely, the TOST is level-α, meaning that its size is smaller or equal to α (see (4) in Section 2.1, for its formal definition).The most common way of assessing equivalence is to use the Interval Inclusion Principle (IIP) and check whether the 100(1 -2α)% Confidence Interval (CI) for the target parameter falls within the equivalence margins p´c, cq. 3,20This strategy has been shown to lead to the same test decision as the TOST procedure if the CI is equi-tailed. 21,22wever, it is well known that this procedure can be conservative as the size of the TOST can be considerably lower than the (specified) significance level α.This induces a drop in power and therefore to a lower probability of detecting truly equivalent mean effects as such.This problem is particularly noticeable in cases where σ ν is relatively large.Such situations may occur, for example, when the sample size is determined using an underestimated standard deviation value obtained from a prior experiment, or with studies involving highly variable drugs and in which the sample size that would be needed to achieve reasonable values for σ ν is unrealistic.For that purpose, Anderson and Hauck 23 proposed a test that has greater power than the TOST for situations where σ ν is relatively large.This test, referred to here as the AH-test, can be liberal and therefore doesn't control the size. 22In some cases, it can also lead to the equivalence declaration (i.e.acceptance of equivalence through the rejection of the null hypothesis in (2) at the α level) when θ, the target parameter of interest, falls outside the equivalence interval. 18Brown et al. 24 constructed an unbiased test that is uniformly more powerful than the TOST, however, it is computationally intensive and its rejection region may exhibit rather irregular shapes in some cases. 22Berger et al. 22 therefore proposed a smoothed version.These tests cannot be assessed using the IIP and the last two are difficult to interpret due to the use of polar coordinates. 25 the specific context of average bioequivalence testing in replicated crossover designs 26 for highly variable drugs, i.e., for cases with relatively large σ ν , regulatory authorities have recommended an alternative approach based on the linear scaling of the bioequivalence limits according to the value of the standard deviation within the reference group, called Scaled Average BioEquivalence (SABE), 27 also referred to as Average BioEquivalence with Expanding limits (ABEL) in some references, 26,28,29 with the constraint that p θ lies within the bioequivalence margins p´c, cq.These recommendations were issued by the European Medicines Agency (EMA) and the US Food and Drug Administration (FDA). 20,30The amount of expansion is limited by the authorities, and several recent publications have shown that the size of the SABE can be larger than the significance level α 20,26,31,32,33 and therefore have proposed different ways to correct for it. 28,34,35,36These corrections ensure that the size is smaller than or equal to α and lead to acceptance regions that change more smoothly with σ ν .
In this paper, as an alternative to previous methods, we propose a finite sample correction of the TOST procedure that simply consists in a correction of the TOST's significance level to guarantee a size-α test when σ ν is known.This correction is design-agnostic and can be used with parallel or (replicated) crossover designs, for example.The corrected significance level α ˚is straightforward to compute and allows to define 100(1 -2α ˚)% CIs used in the classical TOST.
Hence, the α-TOST essentially corresponds to a finite sample continuous variability correction of the TOST procedure, that leads to an increased probability of declaring equivalence when it is true for large values of σ ν while maintaining a size of exactly α when σ ν is known.Indeed, the α-TOST is shown to be uniformly more powerful than the TOST and, for small to moderate values of σ ν , to be nearly equivalent to the TOST with a comparable power as α ˚« α in such cases.Since, in practice, σ ν needs to be estimated from the data, a straightforward estimator for α ˚is also proposed.
It is shown, through an extensive simulation study considering the canonical form defined in (1) and therefore valid in a wide range of settings, that the estimator remains level-α and its size stays close to α.Our simulation study also considers a version of the TOST that adjusts the equivalence limits δ L and δ U instead of the level, to guarantee a size-α test and therefore referred to as the δ´TOST.Our results show that the α´TOST is both more powerful and accurate than the standard TOST and δ-TOST, indicating that, when looking for a design-agnostic correction valid in general settings, a correction on the level (α-TOST) leads to better operating characteristics.A comparison of the performance of these methods to the corrected SABE, that consists in an adjustment on both the equivalence bounds and the level, is presented in Appendix E in a simple paired setting.More adequate and extensive comparisons, considering the different adjustments proposed by regulatory agencies and other authors, including variants such as the corrected SABE, are needed in the specific case of average equivalence testing with replicated crossover designs and are left for further research.
The paper is organized as follows.The α-TOST is presented in Section 2 from a suitable formulation of the TOST.
Its statistical properties as well as a simple algorithm to compute α ˚are also provided.In Section 3, an extensive simulation study is used to compare the empirical performances of the α-TOST, δ-TOST and standard TOST.In Section 4, we consider a case study for which we apply the TOST and the α-TOST, as well as other available methods, in order to showcase the advantages of our proposed design-agnostic approach.Finally, Section 5 discusses some potential extensions.

Equivalence Testing
In this section, we present the methodology for deriving a corrected statistical equivalence test.We first present the TOST and its properties.We then define the α-TOST procedure through a natural correction of the TOST, derive its statistical properties, propose an iterative procedure to compute the corrected level α ˚and show that this procedure is exponentially fast.We also show that the α-TOST is uniformly more powerful than the TOST.

The TOST Procedure
For testing the hypotheses in (2), the TOST uses the two following test statistics: Consequently, equivalence cannot be declared with the TOST for all p σ ν ą p σ max :" c{t α,ν , even for p θ " 0 (see also Figure 6 of Section 4).
Then, for given values of α, σ ν and ν, the TOST's size is defined as the supremum of ( 4), 39 and is given by We can then deduce that the TOST is level-α, by noting that, for σ ν ą 0, we have where T ν denotes a random variable following a t-distribution with ν degrees of freedom, so that lim Thus, while the TOST is indeed level-α, it actually never achieves a size of α, except in the theoretical case of σ ν " 0, as already highlighted by several authors. 36When p σ ν is small, the difference between the size and α is marginal, but as p σ ν approaches or exceeds c{t α,ν , this difference increases, leading to a high probability of the TOST failing to detect equivalence when it exists.As a solution to this issue, we suggest an alternative approach, the α-TOST, that corrects the size of the TOST for a large range of values of p σ ν and still allows to assess equivalence by means of confidence intervals, as depicted in Figure 5 of Section 4.

The α-TOST
A corrected version of the TOST can theoretically be constructed by adjusting the significance level and using α instead of α in the standard TOST procedure, where with ωpγ, c, σ ν , νq defined in (5).The dependence of α ˚on α and ν is omitted from the notation as these quantities are known.A similar type of correction was also used to amend the significance level of the SABE procedure by Labes and Schütz 28 and Ocaña and Muñoz 35 (see also Palmes et al. 40 for power adjustment).However, in these cases, the corrected significance level was reduced (instead of increased like in ( 7)) so that the size does not exceed the significance level of α.The aim of these corrections is therefore not the same as the one proposed here.Furthermore, the size of the α-TOST is guaranteed to be exactly α when σ ν is known, which is not the case for these competing methods.
In Appendix A, we demonstrate that the existence of α ˚relies on a simple condition that is satisfied in most settings of practical importance.In particular, this requirement can be translated into a maximal value for the estimated standard error p σ ν , that is p σ ν ă 2c Φ ´1pα`0.5q .Moreover, since, α ˚pσ ν q is a population size quantity as it depends on the unknown quantity σ ν , a natural estimator for its sample value is given by Hence, in practice, based on the (estimated) corrected significance level p α ˚, the α-TOST procedure rejects the nonequivalence null hypothesis in favour of the equivalence one at the significance level α, if Z L ą t 1´p α ˚,ν and Z U ă ´t1´p α ˚,ν .In Appendix B, we study the asymptotic properties of p α ˚and show that p α ˚" α ˚`o p `ν´1 ˘.Informally, this result implies that the uncertainty associated to p α ˚is (asymptotically) negligible compared to the uncertainty associated to p θ and p σ ν as these terms have slower convergence rates in that p θ " θ `Op pν ´1{2 q and p σ ν " σ ν `Op pν ´1q.
This result also suggests that the α-TOST procedures based on α ˚or on p α ˚are expected to provide very similar finite sample performances.
In Section 3, we consider an extensive Monte Carlo simulation study to compare the empirical performances of different methods when σ ν needs to be estimated.For the 10 4 simulation settings we considered (i.e., 100 values for σ ν and 100 for ν covering most combinations of interest, see Simulation 2 in Table 1), we find that the empirical size of the α-TOST is generally closer to the nominal level α in comparison to the other methods (see Figure 2 in Section 3).We also find that in less that 1% of the settings, the α-TOST procedure can be slightly liberal, with a maximal empirical size of 0.05311 (see Figure 9 in Appendix D).However, this behaviour can mostly be explained by the randomness associated to our large scale simulation.
The corrected significance level p α ˚can easily be computed using the following iterative approach.At iteration k, with k P N, we define with ωpα, c, σ ν , νq given in (5) and where the procedure is initialized at p α ˚p0q " α.This simple iterative approach converges exponentially fast to p α ˚as it can be shown that for some positive constant b (see Appendix C for more details).
Finally, since the conclusion of α-TOST considers an interval computed using a smaller value than t α,ν compared to the TOST, the α-TOST rejection interval is necessarily larger than its TOST counterpart as p α ˚ą α.This implies that the α-TOST is uniformly more powerful than the TOST, and explains cases like the one encountered in the porcine skin case study presented in Section 4, in which equivalence is declared using the α-TOST but not with the TOST (which has an empirical power of zero given p σ ν ).

Simulation Study
In this section, we conduct an extensive Monte Carlo simulation study with parameters settings per simulation reported in Table 1.Simulations 1 to 3, performed under the canonical form defined in (1) and therefore valid in a wide range of settings, assess the empirical performances of the α-TOST and compare them to the ones of the standard TOST and δ-TOST methods, where the latter, defined below, considers a correction on the equivalence limits rather than on the level to reach a size of α.Simulation 4, presented in Appendix E, investigates the empirical performances of these methods with the ones of the design-specific SABE and corrected SABE, where the latter consists in an adjustment on both the level and the equivalence limits.In that simulation, we consider a paired design setting that is closely related to the example considered in our case study and that allows us to estimate the within-subject variability of the reference treatment required by SABE-like methods.All simulations consider a target significance level of 5%, a value of c equal to logp1.25q and 10 5 Monte Carlo samples per configuration.
Formally, the δ-TOST is defined as follows Using the same arguments as in Appendix A, we can easily demonstrate that a unique solution always exists, regardless of the value of the standard error for the δ-TOST.However, an exponentially fast iterative algorithm cannot be used to find the solution for this method.This highlights an important practical advantage of the α-TOST over the δ-TOST and, to a larger extent, over the corrected SABE, which relies on both Monte Carlo integration and numerical optimisation procedures to define its correction (see Appendix E).Table 1: Parameter values used in each simulation, where c denotes the tolerance limit, ν the number of degrees of freedom, θ the target parameter and σ ν its standard deviation, α the target significance level and B the number of Monte Carlo samples per simulation.
In our simulations, the empirical performances of the TOST, α-TOST and δ-TOST are defined using the following steps: 1. Simulation: for a given Monte Carlo sample b " 1, . . ., B: (a) simulate a value for p θ b " N pθ, σ ν q given the values θ and σ ν of interest, (b) simulate a value for t " Simulation 1 investigates the probability of declaring equivalence for varying values of θ allowing to study both the power and the size of each methods for combinations of selected values of ν and σ ν .Simulation results are presented in Figure 1, which shows, for each method of interest, the empirical probability of declaring equivalence as a function of θ for different combinations of values of ν (rows) and σ ν (columns).For small values of σ ν , the empirical performance of all methods are similar.For moderate to large values of σ ν , we can note that the TOST is conservative, with an empirical size far smaller than the nominal level α " 5% when θ " c, and that it quickly reaches an empirical power of 0 for large values of σ ν .On the other hand, the α-TOST and δ-TOST have a higher power throughout, are generally size-α but are a bit conservative for large values of σ ν and relatively small values of ν.This confirms that applying an adjustment to the TOST, either on the level or the equivalence bounds, considerably improves both the size and power in finite samples, especially with larger σ ν where it prevents it from becoming 0.Moreover, this simulation suggests that the α-TOST outperforms the δ-TOST, indicating that an adjustment on the level provides both a more accurate and a more powerful test than an adjustment on the equivalence bounds.
Simulations 2 and 3, respectively performed with θ " logpcq and θ " 0, investigate the empirical size and power for for the TOST (pink circles), the α-TOST (green triangles), and the δ-TOST (blue squares).Refer to the settings of Simulation 1 in Table 1 for details.In all configurations considered here, α-TOST shows a similar or greater power than the TOST and δ-TOST while remaining more accurate in terms of empirical size.θ by realizations of their random variables in (1) to reproduce the parameter estimation process, for all combinations of values of ν and σ ν of interest.Figure 7 shows that the TOST is size-α only for relatively small values of σ ν (below 0.09), and that its size decreases abruptly as σ ν increases to reach 0. We can note that the value of ν does not seem to have an important effect on the size of the TOST.In comparison, Figures 8 and 9 show that both the δ-TOST and α-TOST are size-α for a larger number of combinations of values for σ ν and ν and that the probability of being size-α increases with ν for a given value of σ ν .A comparison of Figures 8 and 9 shows that the α-TOST is both more powerful and more accurate than the δ-TOST overall, a conclusion in agreement with results of Simulation 1.A look at the proportion of configurations with an empirical size significantly greater than α -as assessed by a two-sided binomial exact test at the 1% level performed on the results obtained on the 10 5 Monte Carlo samples per settingshows that the α-TOST procedure is slightly liberal in 0.9% of the configurations considered in Simulation 2, with a maximal empirical size of 0.05311, compared to 0.0528 for the TOST and δ-TOST.This behavior can largely be attributed to the randomness inherent to our large-scale simulation.the standard TOST.Figure 3 summarises the results of Simulation 3 by displaying, for each pair of methods, the histogram of their differences in power for all configurations.The results show that the α-TOST is overall the most powerful, followed by the δ-TOST then by the TOST.
In summary, the simulation studies considered here suggests that a correction of the TOST provides more power and better accuracy in finite samples, with considerably large improvements when σ ν is large.Moreover, the α-TOST appears to provide a better performance than the δ-TOST, indicating that an adjustment on the level rather than on the equivalence bounds is preferable to enhance sample properties of equivalence tests.Results of Simulation 4, considering a paired study and additional correction methods, also suggest that adjusting the level of the TOST leads to better operating characteristics over competing methods, including the corrected SABE.More adequate and extensive

Evaluation of Bioequivalence for Econazole Nitrate Deposition in Porcine Skin
Quartier et al. 42 studied the cutaneous bioequivalence of two topical cream products: a Reference Medicinal Product (RMP) and an approved generic containing econazole nitrate (ECZ), an antifungal medication used to treat skin infections.The evaluation of the putative bioequivalence is based on the determination of the cutaneous biodistribution profile of ECZ observed after application of the RMP and the generic product.The dataset we analyse in this section consists in 17 pairs of comparable porcine skin samples on which measurements of ECZ deposition were collected using both creams.Figure 4 presents the data, collected via a simple paired design, in which each pig delivered two skin samples respectively treated with one of the two drugs of interest.Such designs, possibly attractive for studies not involving regulators, are stricto sensu incompatible with the use of design-specific SABE-like corrections 26 and therefore interesting to showcase the advantages of our design-agnostic method.In order to assess bioequivalence of both topical treatments, the TOST and α-TOST procedures, based on a paired t-test statistic considering ECZ levels on the logarithmic scale, are conducted using c " δ U " ´δL " logp1.25q« 0.223.Although the way to define bioequivalence limits for topical products is still being discussed, 43 we believe the chosen limits to be reasonable. 42gure 5 shows the CIs corresponding to both approaches.The 100p1 ´2αq% TOST confidence interval for the mean of the paired differences in ECZ levels equals r´0.204, 0.250s, given that p θ " 0.023, p σ ν " 0.134, ν " 16 and α " 5%.As its upper bound exceeds the upper bioequivalence limit, the classical TOST procedure does not allow us to conclude that the topical products are (on average) equivalent.To reach a size of 5%, the α-TOST procedure uses in this case a significance level of p α ˚" 7.48% leading to a confidence interval of r´0.166, 0.211s.This CI being strictly embedded within the p´c, cq bioequivalence limits, the α-TOST procedure allows to declare bioequivalence, hence illustrating the increase in power induced by the increased significance level considered to reach a size of 5%.Note that in this case, given p σ 2 ν and ν, the empirical power of the TOST is zero (regardless of p θ) as t 0.05,16 p σ ν ą c, where t α,ν denotes the upper quantile α of a t-distribution with ν degrees of freedom; see Appendix E. Since the α-TOST guarantees a size of α (for all sample sizes), the conclusion brought in by the α-TOST is more trustworthy.
To gain additional insight into the benefits conferred by our approach, we also compare the characteristics and conclusion of the α-TOST to other available methods in Table 2 as well as their rejection region as a function of p θ and p σ ν in Figure 6.We considered here the AH-test, the TOST, α-TOST and δ-TOST.The AH-test does not satisfy the IIP, but represents a good proxy for the other tests without this property and is relatively easy to implement.Among the level-α tests, the α-TOST is the only one leading to bioequivalence declaration.
Figure 6 shows the combinations of values for p θ and p σ ν leading to bioequivalence declaration in the setting of the porcine skin dataset, i.e., with c " logp1.25qand ν " 16.The rejection regions of the different methods almost perfectly overlap for values of p σ ν below 0.09 and differ for larger values.Regardless of p σ ν , the TOST cannot declare bioequivalence for large values of p σ ν (greater than approximately 0.12 here) and the δ-TOST for approximately p σ ν ą 0.17, while the α-TOST and AH-test can, with the rejection region of the α-TOST embedded in the too liberal one of the AH-test.
In Figure 6 100 Figure 5: 100p1´2αq% and 100p1´2p α ˚q% confidence intervals of the TOST and α-TOST procedures for the mean of the paired log differences in ECZ levels obtained with the reference and generic creams with α " 5% and p α ˚" 7.48%.The dashed vertical lines correspond to the used lower and upper bioequivalence limits with c " logp1.25q.Comparison of the CI of each approach to the bioequivalence limits leads to the declaration of bioequivalence for the α-TOST procedure and not for the classic TOST approach due to its CI upper limit exceeding c (hatched area).Table 2: Bioequivalence declaration (yes/no) for the econazole nitrate deposition in porcine skin data using the AHtest, TOST, α-TOST and δ-TOST.The estimated parameter values are p σ ν " 0.134, ν " 16, p θ " 0.023 and α " 5%.The columns IIP, Level-α and Size-α, respectively indicate if each method satisfies the IIP, if its size is bounded by α and if its size is exactly α.The symbol ˚specifies that the property is valid when the standard error σ ν is known.

Discussion
The canonical framework treated in this paper is given in (1) and therefore concerns differences that can be assumed to be normally distributed (in finite samples), with a known finite sample distribution for p σ ν .This framework covers a quite large spectrum of data settings, such as the standard two-period crossover experimental design, 44 and could be extended to include covariates to possibly reduce residual variance.Extensions to non-linear cases, such as for example binary outcomes, 45,46,47,48,49 would follow the same logic, but would require a specific treatment due to the nature of the responses and to the use of link functions.Such extensions also deserve some attention but are left for further research.
For sample size calculations, we could, in principle, proceed with the α-TOST, for given values of c, θ and σ ν .However, when considering high levels of power, the correction is negligible and we have α ˚« α as shown in Section 3, so that the sample size can be computed using the TOST, as implemented in standard packages.The α-TOST approach would then be used to assess equivalence and show its benefits when the observed value of σ ν is unexpectedly large compared to the one considered in the sample size calculation either due to (lack of) chance or to an underestimated value obtained from a prior experiment.
Then, using Kirszbraun theorem, 52 we can extend the function T pγq with respect to γ P A to a contraction map from IR to IR.Thus, Banach fixed point theorem ensures that T `α˚pkq ˘converges as k Ñ 8. We then define the limit of the sequence ␣ α ˚pk`1q ( kPN as α ˚, which is the unique fixed point of the function T pγq.Indeed, we have α ˚" T pα ˚q " α `α˚´ω pα ˚q . By

D. Empirical Size and Power Comparisons for the TOST, α-TOST and δ-TOST
In this section, we perform an extensive simulation study, for the evaluation of the empirical size and power of the α-TOST, compared to the TOST and δ-TOST, by varying the values of ν and σ ν over a large grid.The power (i.e. when θ " 0) and the size (i.e. when θ " c) are computed by replacing both σ ν and θ by realizations of their corresponding random variables in (1), i.e. reproducing the case of parameter estimation.The simulation settings we consider are given in Simulations 2 and 3 of Table 1 for the size and power respectively.8: Heatmap representing the empirical size in % (color gradient) for the δ-TOST, computed using the setting of Simulation 2 in Table 1, as a function of σ ν (x-axis) and ν (y-axis).The lighter colors highlighted in the top legend correspond to the α " 5% nominal level, up to a simulation error assessed by a two-sided binomial exact test at the 1% level performed on the results obtained on the 10  9: Heatmap representing the empirical size in % (color gradient) for the α-TOST, computed using the setting of Simulation 2 in Table 1, as a function of σ ν (x-axis) and ν (y-axis).The lighter colors highlighted in the top legend correspond to the α " 5% nominal level, up to a simulation error assessed by a two-sided binomial exact test at the 1% level performed on the results obtained on the 10 5 Monte Carlo samples per setting.Figure 12: Heatmap representing the empirical power in % (color gradient) for the α-TOST, computed using the setting of Simulation 3 in Table 1, as a function of σ ν (x-axis) and ν (y-axis).
terms of power, the α-TOST uniformly dominates the other two methods and the δ-TOST uniformly dominates the cSABE.This again suggests that an adjustment on the level of the TOST is the most effective way to improve the finite sample properties of equivalence testing.: First row: empirical probability of declaring bioequivalence (y-axis) computed using the setting of Simulation 4, as a function of θ (x-axis) and σ ν (columns), with ν " 45, for the TOST (pink circles), the α-TOST (green triangles), the δ-TOST (blue squares), the SABE (red crosses) and the corrected SABE (light-green diamonds).The tight gray area stands for a 99% simulation error tolerance interval of p4.84, 5.16q corresponding to α " 5% and B " 10 5 Monte Carlo samples.Second row: the difference of the empirical probabilities between the α-TOST and the cSABE (green triangles), and between the δ-TOST and the cSABE (blue triangles).Empirically, the TOST is quite conservative while the SABE is very liberal.In terms of power, the α-TOST uniformly dominates the other two methods and the δ-TOST uniformly dominates the cSABE.

10 4 Figure 1 :
Figure1: Empirical probability of declaring equivalence (y-axis) as a function of θ (x-axis), ν (columns) and σ ν (rows), for the TOST (pink circles), the α-TOST (green triangles), and the δ-TOST (blue squares).Refer to the settings of Simulation 1 in Table1for details.In all configurations considered here, α-TOST shows a similar or greater power than the TOST and δ-TOST while remaining more accurate in terms of empirical size.

Figure 2 :
Figure 2: Histograms of the empirical size (%) of the TOST (first line), the δ-TOST (second line) and α-TOST (third line), computed from the results displayed in Figures 7, 8 and 9 in Appendix D, respectively.Overall, the α-TOST maintains a size of α for a larger proportion of parameters' in comparison to the other methods.

Figure 2 Figure 3 :
Figure 2 summarises the results of Simulation 2 by displaying, for each method of interest, a histogram of the empirical sizes obtained over the 10 4 configurations considered in the simulation.A comparison of these histograms clearly shows that the α-TOST has an empirical size closer to the nominal level α for a larger number of settings compared to the δ-TOST and standard TOST, which shows a large clump-at-zero (31.5%) corresponding to configurations with a power of 0.The heatmaps inFigures 10, 11 and 12 in Appendix D show the results of Simulation 3 by displaying the power of each method for the same configurations as considered in Simulation 2. As expected, the results show that, over our 10 4 configurations of interest, the α-TOST method is the most powerful, followed by the δ-TOST method and then simulations are needed though to compare these methods in the design-specific context required by regulatory agencies when assessing bioequivalence.Such simulations should consider the different adjustments proposed by regulatory agencies and are left for further research.Finally, note that the idea of improving the size of the TOST is not new as Cao and Mathew 41 have proposed a correction based on the adjustment of the critical values defined as a non-increasing continuous function of the sample standard deviation to reduce the conservatism of the TOST.More particularly, they defined adjustment constants for specific values of p σ ν and used linear interpolation for the adjacent values.The lower panel of Figure 14 in Appendix F, compares the critical values obtained with the method of Cao and Mathew to the ones obtained by the α-TOST for different values of p σ ν and ν.We can note that, for all values of ν considered here and for values of p σ ν above 0.1, the corrected critical values 41 correspond to a piecewise version of the critical values obtained with the α-TOST when ν is large.Therefore, their correction appears to be an approximation of the α-TOST, evaluated asymptotically, i.e., at ν Ñ 8.

Figure 6 :
Figure 6: Bioequivalence test rejection regions as a function of p θ (x-axis) and p σ ν (y-axis) per method considered inTable 2 (coloured areas) showing combinations of values for p θ and p σ ν leading to equivalence declaration in the setting of the porcine skin dataset, i.e., with c " logp1.25qand ν " 16.The rejection regions of the different methods almost perfectly overlap for values of p σ ν below 0.09 and differ for larger values.Regardless of p σ ν , the TOST can't declare bioequivalence for large values of p σ ν (greater than approximately 0.12 here) and the δ-TOST for approximately p σ ν ą 0.17, while the α-TOST and AH-test can, with the rejection region of the α-TOST embedded in the too liberal one of the AH-test.The symbol Ś represents the analysed data set in the acceptance/rejection regions where p θ " 0.023 and p σ ν " 0.134.

Figure 13
Figure13: First row: empirical probability of declaring bioequivalence (y-axis) computed using the setting of Simulation 4, as a function of θ (x-axis) and σ ν (columns), with ν " 45, for the TOST (pink circles), the α-TOST (green triangles), the δ-TOST (blue squares), the SABE (red crosses) and the corrected SABE (light-green diamonds).The tight gray area stands for a 99% simulation error tolerance interval of p4.84, 5.16q corresponding to α " 5% and B " 10 5 Monte Carlo samples.Second row: the difference of the empirical probabilities between the α-TOST and the cSABE (green triangles), and between the δ-TOST and the cSABE (blue triangles).Empirically, the TOST is quite conservative while the SABE is very liberal.In terms of power, the α-TOST uniformly dominates the other two methods and the δ-TOST uniformly dominates the cSABE.

5
Monte Carlo samples per setting Heatmap representing the empirical power in % (color gradient) for the TOST, computed using the setting of Simulation 3 in Table1, as a function of σ ν (x-axis) and ν (y-axis).Heatmap representing the empirical power in % (color gradient) for the δ-TOST, computed using the setting of Simulation 3 in Table1, as a function of σ ν (x-axis) and ν (y-axis).