Bayesian sample size determination using commensurate priors to leverage preexperimental data

Abstract This paper develops Bayesian sample size formulae for experiments comparing two groups, where relevant preexperimental information from multiple sources can be incorporated in a robust prior to support both the design and analysis. We use commensurate predictive priors for borrowing of information and further place Gamma mixture priors on the precisions to account for preliminary belief about the pairwise (in)commensurability between parameters that underpin the historical and new experiments. Averaged over the probability space of the new experimental data, appropriate sample sizes are found according to criteria that control certain aspects of the posterior distribution, such as the coverage probability or length of a defined density region. Our Bayesian methodology can be applied to circumstances that compare two normal means, proportions, or event times. When nuisance parameters (such as variance) in the new experiment are unknown, a prior distribution can further be specified based on preexperimental data. Exact solutions are available based on most of the criteria considered for Bayesian sample size determination, while a search procedure is described in cases for which there are no closed‐form expressions. We illustrate the application of our sample size formulae in the design of clinical trials, where pretrial information is available to be leveraged. Hypothetical data examples, motivated by a rare‐disease trial with an elicited expert prior opinion, and a comprehensive performance evaluation of the proposed methodology are presented.


Introduction
Suppose that random variables X A and X B have probability density functions (pdf) of some known form, denoted by f (x j ; µ j ), j = A, B, where µ j are the group-specific parameters, for example, the population means, of inferential interest.We further suppose that the observations are independent and indentically distributed (i.i.d.) with their pdfs f (x j ; µ j ) conditional on the unknown µ j .Consider the problem of comparing µ A and µ B , both supposed to be one-dimensional parameters for simplicity, based on two samples.A classical statistical problem is how to choose the sample size for the experiment to allow good quality inference.There have been many sample size determination (SSD) methods in the literature; the main ways in which they vary are whether: (a) the inferential goal is parameter estimation or hypothesis testing, (b) additional parameters relating to the experimental design, e.g., variance of a normal distribution, are assumed known or unknown, and (c) the analysis approach is frequentist or Bayesian.
Conventional SSD (Desu and Raghavarao, 1990) for such experiments has often been carried out to control certain aspects of the sampling distribution of a test statistic suitable for comparing two group means.This is typically considered from a frequentist perspective that operating characteristics, e.g., type I error rate and power, should be specified for falsely and correctly detecting a meaningful magnitude of the difference in means.For data that are assumed to be i.i.d.normal, SSD may be a function also of nuisance parameters such as unknown variances, denoted by σ j , j = A, B. It is not uncommon that investigators specifiy a fixed value for the unknown σ j , which could deviate far from the true value.
Consequently, this may leave the SSD inaccurate, or only locally optimal, due to the dependence on arbitrarily guesses.Formulating the problem in the Bayesian framework has been argued to be more advantageous, since it allows uncertainty to be described in a prior for the nuisance parameters (O' Hagan and Forster, 2004).Moreover, it brings about the possibility of incorporating pre-experimental information, if available, in a prior for the nuisance parameters and/or the parameter(s) of inferential interest.For these reasons, considerable attention has been given to Bayesian SSD; see, for example, Adcock (1997); formulations of the SSD problem in exploratory trials: the effect size µ ∆ = µ A −µ B is tested based on posterior interval probabilities that mimic the frequentist error rates.These fully Bayesian approaches shed light on the option of incorporating pre-experimental data into a prior for both design and analysis consistently, without undermining the data validity and integrity of the new trial (EMA, 2002).The robust Bayesian SSD approach that we will propose in this paper falls within this category.
Our research is partly motivated by the efficient design and analysis of clinical trials that evaluate a new treatment for rare diseases (EMA, 2006).It is often infeasible to ask for a sample size achieving the frequentist power, nor to draw an inference solely based on the scant trial data.Pre-trial information, collected from historical studies which had been conducted under similar circumstances, or elicited from expert opinion, could play an essential role.Most existing proper Bayesian SSD approaches, as those mentioned above, employ a conjugate prior for algebraic convenience.Under such a framework, however, it is not straightforward to leverage pre-trial information, especially if it has been available from multiple sources.In this paper, we propose basing the SSD on a robust Bayesian model with commensurate priors: we specify predictive priors with respect to the sources of pre-trial information, and parameterise the commensurability between a historical and the new trial parameters explicitly (Hobbs et al., 2011(Hobbs et al., , 2012;;Zheng and Wason, 2020).
By placing a Gamma mixture prior on each commensurate parameter, our approach can overcome the notorious prior-data conflict and maintain an appropriate borrowing from each relevant source.The proposed methodology is generic: it can be applied to areas where there is a need to use pre-experimental data formally through the mechanism of specifying priors.For instance, the sample size for environmental water quality evaluation could often be limited, for which borrowing strength from historical water monitoring data has been considered helpful (Duan et al., 2006).Other examples include quality control in management and engineering (Kleyner et al., 1997;Stamey et al., 2006;Khorate and Shirke, 2018).
The remainder of this paper is structured as follows.In Section 2, we develop a robust Bayesian model for leveraging pre-experimental information from multiple sources, where customised borrowing by source of information is possible.Focusing on normally i.i.d.data, Bayesian SSD formulae are obtained according to various criteria.In Section 3, we illustrate the application of the derived Bayesian SSD formulae in rare-disease trials for cases of known and unknown variances, respectively.This is followed by a performance evaluation of the proposed methodology in Section 4. We conclude in Section 5 with a discussion about potential extensions of our methodology for a pragmatic while flexible application in modern clinical trials.

Methods
Suppose in the new experiment the difference µ ∆ = µ A − µ B > 0 indicates that A is superior than B. In the context of clinical trials, for example, A and B can be treatments or interventions.Let X ij be the observation from experimental unit i assigned to group j = A, B. These observations are assumed to be i.i.d.normally distributed with mean µ j and common variance σ 2 0 .That is, for i = 1, . . ., n j , j = A, B, We further suppose that a total of K relevant historical datasets, denoted by y 1 , . . ., y K ; each having the same data structure as that would accrue in the new experiment.Let θ k denote the counterpart of µ ∆ , i.e., the difference in the normal means specific to each historical experiment, k = 1, . . ., K. In Section 2.1, we explain how these θ 1 , . . ., θ K , specific to the historical experiments, can be linked with the parameter µ ∆ = µ A − µ B of inferential interest in the new experiment.With a brief review of commonly used criteria in Section 2.2, we develop the Bayesian SSD approach in Section 2.3 for the new experiment, when leveraging pre-experimental information is an option.
2.1 Borrowing of historical information from multiple sources Zheng and Wason (2020) propose a robust Bayesian model using commensurate priors (Hobbs et al., 2011(Hobbs et al., , 2012) ) to leverage complementary data from multiple sources for analysing a complex type of modern clinical trials.Specifically, data of complementary (sub)trials would be generated concurrently with that of a contemporary (sub)trial.Predictive priors for the contemporary parameter are specified to represent the complementary data, with a commensurate parameter introduced to determine the degree of borrowing from a specific source, i.e., a specific complementary (sub)study.
Following Zheng and Wason (2020), we stipulate the commensurability, denoted by ν k , as the precision (i.e., inverse of variance) of a normal predictive distribution which is centred at the pre-experimental parameter θ k , k = 1, . . ., K.This leads to K commensurate predictive distributions in the form of where each θk is regarded as equivalent to µ ∆ in terms of the parameter space.More precisely, it means that the parameter space for θk , as projected from a pre-experimental parameter θ k , would be defined with the same or comparable set of parameter values to that of µ ∆ .A spike-and-slab prior, which is a discrete mixture distribution, is placed on each ν k for robust borrowing of information in the original proposal.For analytic tractability in exact Bayesian inference, we consider using conjugate priors instead, i.e., a two-component Gamma mixture prior, for the predictive precision: Stipulating with 0 < w k < 1 in (2) produces a compromise between the two extreme cases.
The strength of this Gamma mixture prior is then tuned by w k , which can be interpreted as the prior probability of incommensurability.
The joint pdf of θk and ν k , given information on θ k that reveals the average difference between group A relative to B in a historical experiment k, will then be: pointing to a Normal-Gamma mixture distribution.We marginalise this mixture distribution for θk by integrating out the nuisance parameter ν k , and obtain which is a two-component mixture of non-standardised (shifted and scaled) t distributions.
In particular, the component t distributions have their location parameters identically as θ k yet scale parameters as b 01 a 01 and b 02 a 02 , respectively.Detailed derivation of (4) and the demonstration of it being a non-standardised t mixture distribution are given in Section A of the Web-based Supplementary Materials.For easing the synthesis of K predictive priors later on, we approximate this unimodal t mixture distribution by a normal distribution This approximation is based on the first two moments of the non-standardised t mixture distribution, which are analytically available; see Section B of the Supplementary Materials for details.The variance of the normal approximation, takes account of the dispersion of both t mixture components.The goodness of such normal approximation to the original t mixture distribution depends on the degrees of freedom, 2a 01 and 2a 02 , and the scale parameters, b 01 a 01 and b 02 a 02 , which are of the investigators' choice.We show the numerical accuracy of this approximation in Section C of the Supplementary Materials.Meanwhile, we note that the motivation is rather practical: it can bring considerable convenience for deriving a normal collective prior for µ ∆ , which represents information on θ 1 , . . ., θ K , by the convolution operator (Grinstead and Snell, 1997).
With the normal approximation given by (5), we stipulate µ ∆ as a linear combination of K ≥ 2 hypothetical random variables, θk , projected from the pre-experimental parameters.
That is, µ ∆ = k p k θk , for k = 1, . . ., K. In particular, the weights p 1 , . . ., p K sum to 1, with each reflecting the relative importance of a corresponding pre-experimental dataset to constitute the collective predictive prior for µ ∆ .We note these weights could be associated with the prior mixture weights 1 − w k , which describe our prior belief in the commensurability between a pre-experimental dataset and that to be collected from the new experiment.Applying the convolution operator for the sum of normal random variables, µ ∆ has a normal prior distribution.Suppose that for k = 1, . . ., K, each preexperimental dataset leads to an estimate of θ k | y k ∼ N (m k , s 2 k ).The normal collective prior for µ ∆ thus has the form of with being the marginal prior means and variances.It accounts for both the variability in a pre-experimental dataset y k and the postulated level of incommensurability, w k , through the Gamma mixture prior placed on the predictive precision, ν k .We give more details in Section D of the Supplementary Materials for this derivation.Using Bayes' Theorem, this collective prior will be updated by the new experimental data, denoted by y K+1 , to a robust posterior f p (µ ∆ | y 1 , . . ., y K , y K+1 ).Robustness is achieved through the use of a mixture of an informative and diffuse priors in the construction of the model.

Criteria for the Bayesian sample size determination
Most Bayesian SSD criteria aim to control a certain property of the posterior f p (µ ∆ | y 1 , . . ., y K , y K+1 ), wherein the new experimental data y K+1 are unobserved at the design stage.It is important to state at the outset that uncertainty of sampling a set of data, y K+1 , from the entire probability space needs to be accounted for.In other words, there is no unique outcome or result of the new experiment.Thus, strictly, the Bayesian SSD criteria can only maintain the average properties of the posterior.In what follows, we review some widely-used Bayesian SSD criteria for deriving our formulae accordingly.Joseph and Bélisle (1997) propose specifying a density region, R(y K+1 ), set to be bounded by r and r + 0 to contain possible parameter values.Here, 0 is the desired interval length and r chosen so that R(y K+1 ) is the highest posterior density (HPD) interval; so-called HPD because this interval includes the mode of the posterior distribution.
In particular, their specification is to ensure the coverage probability of R(y K+1 ) to be at least 1 − α, when averaged over all possible samples.Formally, it requires that where Y denotes the probability space and f d (y K+1 ) the marginal distribution of the sample, i.e., the new experimental data.This Bayesian SSD criterion is generally applicable to both symmetric and asymmetric posterior distributions.For the property of controlling the coverage probability, it is often referred to as the average coverage criterion (ACC).
The posterior distribution in our context would be unimodal and symmetric about the posterior mean, as we can envisage from the collective prior given by (6) which leverages pre-experimental information.We would then simply stipulate the HPD interval as which coincides with the alpha-expectation tolerance region by Fraser and Guttman (1956).
As an alternative to the ACC, one may want to fix the coverage of a posterior interval as (1 − α 0 ) for the SSD, while limiting the interval length to be at most .This proposal is commonly known as the average length criterion (ALC), with its first formal description given by Joseph and Bélisle (1997).Let (y K+1 ) be the random interval length of the posterior credible interval dependent on the unobserved new experimental data.Targeting a fixed coverage probability at level (1 − α 0 ), one may solve (y K+1 ) to meet where r would be specified to give the HPD interval as that for the ACC above.Averaged over all possible samples, the ALC requires that The ALC could be more favoured than the ACC, since most Bayesian practitioners are keen to report the average interval length of a fixed coverage probability, say, 95%, in their data analyses.
As we can see, sample sizes chosen to meet the ACC by ( 7) or ALC by ( 8) rely on the marginal distribution of the new experimental data y K+1 ; that is, When this predictive distribution of data also depends on nuisance parameters, say, the variance σ 2 0 being unknown, we may remove the dependence on the variance by integrating it out; then In our context, the prior distribution(s) for unknown parameter µ ∆ as well as possibly unknown nuisance parameter σ 2 0 would be specified using pre-experimental information.The predictive distribution of data y K+1 would thus formally be f d (y K+1 | y 1 , . . ., y K ), given our π(µ ∆ | y 1 , . . ., y K ) and g(σ 2 0 | y 1 , . . ., y K ).Both the ACC and ALC are based on defining a probability measure of the posterior distribution.We consider one additional criterion that relates to the moments of the posterior distribution.For practicality reasons, we focus on the precision of the posterior mean estimate µ ∆ , i.e., the second central moment of the posterior probability distribution.
This would be referred to as the average posterior variance criterion (APVC) hereafter.
Given a fixed level of dispersion 0 , a suitable sample size is chosen to ensure that As Adcock (1997) commented, this criterion is equivalent to using the L 2 -norm loss function It is also worth noting that the literature has documented many other Bayesian approaches to SSD, e.g., based on the use of utility theory (Lindley, 1997) and Bayes factors (Weiss, 1997).The reader may be interested in working out more Bayesian SSD formulae based on our posterior distribution while applying alternative criteria.
To complete the brief review of Bayesian SSD criteria of our selection, we note that the fixed values of 0 , α 0 and 0 are all positive real numbers.How these values would affect the sample size chosen for planning a new experiment will be investigated through numerical evaluation in Section 4.

Sample size required for comparing two normal means
Denote the groupwise sample sizes in the new experiment by n A and n B , respectively.
For most settings, the experimental data y K+1 contain two independent random vectors (X 1A , . . ., X n A A ) and (X 1B , . . ., X n B B ), sampled from the populations that are assumed to have a common variance, denoted by σ 2 0 .

For cases of known variance
If the common variance σ 2 0 is known exactly, the sample means xA has a normal prior based on pre-experimental datasets y 1 , . . ., y K , as was given in (6).Since the joint likelihood of the n A + n B measurements in the new experiment L(y K+1 |µ A , µ B ) ∝ L(x ∆ |µ A , µ B ), we formulate the data likelihood in terms of x∆ which is regarded as a random variable.We further derive the posterior distribution as where The marginal distribution (unconditional on µ ∆ ) for the difference in sample means is given by As Joseph and Bélisle (1997) noted, the ACC and ALC result in the same outcome for cases where the variance is known.Hence, we illustrate using the ACC in the following.
Letting the HPD interval (r, r + 0 ) stretch symmetrically around the posterior mean η, the coverage can be computed by where Φ(•) is the cumulative distribution function of the standard normal distribution.We where z α/2 is the upper (α/2)-th quantile of the standard normal distribution, i.e., Φ −1 (1 − α/2).Similarly, averaging over the entire data space, the APVC gives In theory, when p 2 k ξ 2 k , the prior variance for µ ∆ | y 1 , . . ., y K , is so small that the righthand side of the inequalities above becomes zero or negative, there is no need to conduct a new experiment for collecting more information.

For cases of unknown variance
When the common variance σ 2 0 is unknown, we assume that the quantity c p 2 k ξ 2 k /σ 2 0 ∼ χ 2 (c), where χ 2 (c) refers to a Chi-square distribution with c degrees of freedom (Gelman et al., 2013).This is equivalent to the prior specification that hence, the larger value c takes, the more the unknown σ 2 0 converges to the prior variance for µ ∆ | y 1 , . . ., y K .The marginal posterior for µ ∆ will then be obtained by intergrating out the nuisance parameter σ 2 0 : that is, the posterior is proportional to the product of normal and non-standardised t kernels (Ahsanullah et al., 2014).Detailed steps for deriving (12) are given in Section E of the Supplementary Materials.In particular, the t density kernel (with the location and scale parameters being x∆ and ) can be related to a normal kernel with the same location parameter and the variance as The posterior (12) can thus be further developed as with , which is consistent with (9) but here with unknown ).We can also find the distribution for x∆ unconditional on µ ∆ as see Section E of the Supplementary Materials for the derivation.Apparently, this marginal distribution for x∆ relies on prior distribution for the unknown σ 2 0 , which may yield different solutions of n A and n B across the Bayesian SSD criteria considered in this paper.
Let the interval (a, a + 0 ) be symmetric about µ N given the marginal posterior for µ ∆ in (13).When applying the ACC, the sample size is found requiring where z α/2 denotes the upper (α/2)-th quantile of the standard normal distribution.We rewrite the expression and obtain where g(σ 2 0 ) is the pdf of an Inv-Gamma( c 2 , ) distribution.The reader may compare this inequality with what was obtained for cases where σ 2 0 is known in (10).For computing the ALC sample size, we would have to average the random credible interval length (x ∆ ) = 2z α 0 /2 σ N over the marginal distribution for x∆ which varies with σ 2 0 .According to the definition of ALC, we obtain that which does not have a closed-form solution.This require a search over the integers for n A and n B to find the smallest sum that satisfies the inequality.With the use of APVC, we would likewise remove the dependence on σ 2 0 by intergration.The formula thus becomes Finally, we note that for computing an exact sample size for the new experiment, the allocation ratio n A : n B would have to be specified, for example, considering an equal allocation by setting n A : n B = 1 : 1.

Application
In this section, we illustrate the application of our Bayesian SSD formulae to rare-disease trials, for which it is generally infeasible to enroll a large sample size based on frequentist SSD formulae.Such rare-disease trials are unlikely to be conducted without any preceding investigation.Pre-trial information collected from relevant studies (for example, historical clinical trials evaluating the same treatment in a similar patient population) or elicited from expert opinion, would often be available.Our proposed Bayesian SSD methodology provides a framework to formally utilise this prior information for planning a new trial.In line with the original assumptions of the MYPAN trial, we suppose the log-odds ratio of treatment benefit, , can be adequately modelled by a normal distribution.Here, ρ jk denotes the probability of remission for a patient receiving treatment j = A, B, respectively.Furthermore, we assume some expert opinion had been expressed, as summarised in the form of Eliciting such expert opinion is a non-trivial problem; we refer the reader to the statistical literature on elicitation of prior distributions in Bayesian inferences (Dias et al., 2017).For illustration, we assume there were five sets of expert opinion that had been summarised as N (−0.26,0.25), N (−0.24,0.23), N (−0.37,0.22), N (−0.34,0.36) and N (−0.32,0.26).Opinion would also be sought on the experts' skepticism about the predictability of each pre-trial parameter θ k towards the parameter µ ∆ , measured on the continuous scale of 0 to 1, to specify the prior probabilities of incommensurability, w k , k = 1, . . ., 5. In this example, we suppose such pre-trial information may be valued about equally, so stipulate w 1 = 0.15, w 2 = 0.20, w 3 = 0.17, w 4 = 0.13, w 5 = 0.20 for robust borrowing of information.Pragmatically, the trial statistician could look into the levels of pairwise commensurability between the N (m k , s 2 k ) distributions based on distribu-tional discrepancy, such as Hellinger distance (Dey and Birmiwal, 1994), to reconcile the choices of value for w k .
For reaching a collective prior for µ ∆ | y 1 , . . ., y 5 , weights p 1 , . . ., p 5 need to be specified to reflect the relative commensurability of each set of pre-trial information with data to be accrued in the new trial.Both w k and p k might be determined by some distance measure between parameters θ k and µ ∆ (Zheng and Wason, 2020), for the rationale that we hope to discount pre-experimental information to a larger extent when it would be less similar with the new experimental data.For implementing the proposed Bayesian SSD approach in this paper, we compute the weights as , where s 0 in this descreasing function of w k determines how concentrated the weights p 1 , . . ., p K would be around the average, 1/K.With a value of s 0 w k , nearly all p k converge to 1/K irrespective of the values of w k .Whereas, with s 0 → 0 + , the smallest w k would have p k → 1, meaning that the corresponding θ k | y k tends to dominate the collective prior.For illustration, we let s 0 = 0.05.Thus, with the w k specified above, we obtain p 1 = 0.23, p 2 = 0.16, p 3 = 0.20, p 4 = 0.25, p 5 = 0.16.
We let the prior predictive precision ν k ∼ w k Gamma(2, 2) + (1 − w k )Gamma(18, 3), where the means together with 95% credible interval of the component Gamma distributions are 1.000 (0.121, 2.786) and 6.000 (3.556, 9.073), respectively.This mixture prior for ν k is specified to fully account for two extreme cases as either no borrowing or strong borrowing of pre-experimental information, which correspond to setting w k = 1 and w k = 0, respectively.
We assume known variance of σ 2 0 = 0.35 in the new trial under planning and suppose n A = n B .The total sample sizes (i.e., n A + n B ) found based on the AAC and ALC criteria are then both 41.8 for 95% posterior coverage probability and the credible interval length as 0.65 on average.For cases of unknown σ 2 0 , we let σ 2 0 ∼ Inv-Gamma(2.500,0.385) (i.e., setting c = 5).The ACC and ALC sample sizes become 30.7 and 24 for attaining the same posterior behaviours, respectively.Targeting 0 = 0.03, the APVC sample sizes are 32.2 and 27.6 for known and unknown σ 2 0 (where we set c = 5), respectively.Here, we have reported the sample sizes by rounding the result of an exact solution to one decimal place and by the smallest integer if found based on a search procedure.
In this illustrative example, sample sizes derived for cases of known σ 2 0 = 0.35 appear larger than those yielded by the same criterion for unknown σ 2 0 .This is because by setting c = 5, we in fact additionally permit certain amount of borrowing to inform the variance in the new study; more specifically, σ 2 0 would be thought of as similar to some extent to the prior variance of µ ∆ | y 1 , . . ., y 5 , say, 0.154, which is smaller than the fixed σ 2 0 = 0.35 in our illustration.We will examine the sensitivity of our formulae to the choice of c and provide comprehensive interpretation in Section 4.

Performance evaluation
In this section, we provide a performance evaluation of the proposed formulae, with sample sizes computed and visualised under various configurations of parameters.

Basic settings
Motivated by the MYPAN trial, we generate four base scenarios of historical data, which are configured with different levels of pairwise (in)commensurability and informativeness.
Such pre-experimental information from K sources is supposed to have been summarised For each configuration of hypothetical historical data, two distinct sets of robust weights I and II are considered to implement the proposed approach for borrowing of information.These robust weights are chosen for illustrative purposes to (a) reflect high and low level of prior confidence in the historical data when they are consistent between themselves, or (b) designate certain source of historical data to be more influential.Table 1 lists four base scenarios of historical data along with their assigned robust weights.We compute the squared Hellinger distances of any two N (m k , s 2 k ) distributions to describe their pairwise (in)commensurability, which are visualised in Figure S2 of the Web-based Supplementary Materials.Prior probabilities of incommensurability, w k , could be chosen at similar levels as such distances for leveraging each N (m k , s 2 k ) in the new experiment.The values of w k in Table 1 for our numerical study are justified as no greater than 0.500, as the largest squared Hellinger distance visualised in Figure S2 is below 0.500.The weights, p k , for prioritising certain prior information are determined following our stipulation in Section 3.
In Table 1, the first two configurations of historical data represent situations of consistent pre-experimental information from different sources.Robust weights I and II for these configurations consistently down-weight all historical data to a small and larger extent, respectively.Configurations 3 and 4 represent situations of divergent historical data; each has their own robust weights I and II to downplay some informative historical data to a small and larger extent, respectively.The collective priors (18,3).We note that the component Gamma distributions of the mixture prior can be essential, as choices are highly impactful on the prior effective sample size.Thus, in this numerical evaluation, we also examine how Bayesian SSD for the new trial would change given various Gamma mixture priors.
We compare the sample sizes computed using the proposed Bayesian SSD formulae with those computed (a) without robustification, i.e., setting each w k = 0 for k = 1, . . ., 5, (b) without leveraging historical information for µ ∆ , i.e., setting each w k = 1, (c) from the proper Bayesian SSD approach driven by a single prior, here specified as the most informative N (m k , s 2 k ), for example, N (−0.37,0.22) for configuration 1, and (d) from an optimal approach as the benchmark.Specifically, the optimal approach is coupled with a perfectly commensurate prior, by equating σ 2 0 to the collective prior variance p 2 k ξ 2 k .In this way, the corresponding result would serve as the benchmark referring to the scenario of perfect consistency between the collective prior and the new data, so the largest saving in sample size could be attained by using the proposed methodology.For cases of unknown σ 2 0 , the optimal sample sizes could be approached by setting c to a sufficiently large value.
Table 1: Configurations of hypothetical historical data, each accompanied by two sets of weights for robust borrowing of information.Pre-experimental information about θ k | y k is assumed to have been summarised by a N (m k , s 2 k ) prior for k = 1, . . ., 5.

Results
Figure 1 visualises a subset of the results, which compare the proposed Bayesian SSD formulae using robust weights I and II with the alternative approaches for cases of known and unknown σ 2 0 , respectively.Here, we assume σ 2 0 = 0.35 and, if unknown, σ 2 0 ∼ Inv-Gamma(1.5,1.5× p 2 k ξ 2 k ) for illustration.We fix the posterior credible interval length 0 = 0.65 to find the ACC sample sizes, so that the average coverage probability would be 95%, that is, targeting α = 0.05 in (10).Likewise, for computing the ALC sample sizes, we fix α 0 = 0.05 and constrain the average length of the posterior credible interval below 0.65.When applying the APVC, sample sizes are found with the average posterior variance retained to level = 0.03.

(ii)
Figure 1: Comparison of the Bayesian SSD approaches in terms of the sample size obtained according to the ACC, ALC and APVC criteria for cases of (i) known σ 2 0 = 0.35 and (ii) unknown σ 2 0 .Sample sizes in subfigure (ii) for unknown σ 2 0 are computed setting c = 3, i.e., assuming that σ 2 0 ∼ Inv-Gamma(1.5,1.5× p 2 k ξ 2 k ), for fairly limited use of pre-experimental information to inform the variance σ 2 0 .
In all configurations 1 -4, we see that the sample sizes computed according to the same criterion, using robust weights I, are smaller than those using robust weights II.This is because following our setting the collective prior, produced by robust weights I, has smaller variance than its counterpart by robust weights II, for each configuration.
Moreover, sample sizes yielded using either robust weights I or II are always bounded by those using no robustification (w k = 0) and no borrowing (w k = 1).We may think that no robustification leads to the least conservative result by the proposed SSD formulae, for the given historical information fully used.These, however, are not necessarily identical to the optimal situations, where σ 2 0 is equated to the collective prior variance, or largely determined by the latter if unknown.In Figure 1, we omit the benchmark optimal sample sizes that may be obtained by using the proposed formulae with robust weights I and II for each configuration.Yet we will comment on the maximal saving that the proposed SSD approach can achieve in the following along with other figures.
The height difference across bars of sample sizes, computed using our approach with robust weights I or II and no borrowing (w k = 1), quantifies the benefit from leveraging pre-experimental information for µ ∆ .Looking across subfigures (i) and (ii), such height differences between methods are far greater for the unknown variance case than the known variance case.Comparison of SSD approaches with borrowing versus no borrowing, as visualised in subfigure (ii) of Figure 1, would be more objective for illustrating the benefit.As mentioned, choosing c = 3 means σ 2 0 would be related with p 2 k ξ 2 k to a very limited extent, as if a diffuse prior had been placed on σ 2 0 .Thereby, implementing no borrowing by setting w k = 1, pre-experimental information would neither be leveraged through the robust prior for µ ∆ , nor through the prior for the unknown ).Consequently, larger sample sizes would be found for no borrowing SSD for the unknown σ 2 0 than the known cases assuming σ 2 0 = 0.35, to retain similar properties of the posterior distribution.Focusing on the bars for robust weights I and II against no borrowing within subfigure (ii), saving in all the ACC, ALC and APVC sample sizes could be as much as two-thirds for configurations 1 and 2. Such saving is attenuated in configurations 3 and 4 when historical information is divergent.In configurations 3, the ACC (ALC) sample size obtained from the no borrowing approach is about twice the size from the proposed approach with robust weights I, specifically, 232.2 versus 116.8 (136 versus 65), respectively.We observe a small increase in sample size by using robust weights II instead of I, because slightly higher prior probabilities of incommensurability had been allocated to certain informative N (m k , s 2 k ) for greater down-weighting.The trend is similar for results in configuration 4.
We then compare the proposed approach with an alternative strategy, that is, restricting the use of pre-experimental information from a single source.When the historical data are consistent (divergent) between themselves, the proposed SSD formulae lead to smaller (larger) sample sizes, as presented obviously in configuration 1 (configurations 3 and 4) for both cases of known and unknown σ 2 0 .As one may perceive, such selection of a single source could be less robust than averaging over all available pre-experimental information.
Another noteworthy finding is concerned with the comparison of the ACC and ALC sample sizes, particularly when σ 2 0 is unknown and we place a very weakly-informative prior on it (setting c = 3).As shown in Figure 1, the ALC sample size is universally smaller than the ACC sample size for all these investigated configurations.
We move on to quantify how the sample sizes would vary as c changes.Focusing on approaches using pre-experimental information from multiple sources, Figure 2 displays the sample sizes exclusively for cases of unknown ).We set = 3, 5, 10, 20, 30, 40 and keep the target level of each SSD criterion unchanged from what we have used for Figure 1.As c gets larger, the sample sizes for all approaches investigated here decrease and tend to stablise at their own lowest levels possible.This could be explained from the perspective of prior effective sample size (Neuenschwander et al., 2020), to which variance is a key determining factor.Consider the prior placed on the inverse of the unknown variance that 1 ), of which the mean and variance are 1 values of c, e.g., when c = 3, 5.We note that the so-called 'no borrowing' (by setting w k = 1) should be clarified as no borrowing in terms of the parameter µ ∆ .When c gets larger, it means the unknown variance σ 2 0 would be more closely tied to the prior variance based on the historical data.That is, borrowing is enabled through the variance, although not directly the parameter of inferential interest.By fixing w k = 1, historical data would not be leveraged through the robust prior for µ ∆ , but nevertheless could be used to inform the unknown σ 2 0 , particularly when c is sufficiently large.Figure 3 illustrates how the sample size varies, for cases of unknown σ 2 0 , when targeting the average coverage probability, posterior credible length and posterior variance at different levels.Like in Figure 1, these results are obtained by setting c = 3 for the very limited use of pre-experimental information to inform σ 2 0 .The optimal sample sizes are also plotted to show the maximal saving the proposed SSD formulae may achieve.Specifially, the optim I and II should be taken as the benchmark for formulae using robust weights I and II, respectively.As expected, sample sizes by robust weights I and II would always be bounded by the extremes of no robustification (all w k = 0) and no borrowing (all w k = 1).Given a fixed length 0 = 0.65 of the HPD interval, more ACC sample sizes would be required if increasing the desired coverage probability on average, 1 − α.For example, the ACC sample size computed using robust weights I (II) rises from 78.7 to 156.5 (104.4 to 204.2) for configuration 3, had the level of 1 − α been lifted from 90% to 97.5%.The displayed ALC sample sizes in subfigure (ii) ensures the coverage probability as 95%; by relaxing the target average HPD interval length, fewer sample sizes would be needed.Likewise, the APVC sample sizes in subfigure (iii) share this commonality of decreasing as we relax the target posterior variance.Generating these plots would be helpful in practice for balancing between obtaining an economic sample size planning and a posterior sufficiently informative for inferences on a case-by-case basis.For example, targeting the average length of the HPD interval with 95% coverage probability as = 0.60 requires the ALC sample size to be 28 for configuration 1 using robust weights I, which may not be much different from 23 yielded by the level = 0.65.Finally, we examine how sensitive the proposed Bayesian SSD formulae is to the Gamma mixture components.Since a suitable yet least informative Gamma(a 01 , b 01 ) has been chosen for down-weighting, the other component of the mixture prior, Gamma(a 02 , b 02 ), determines the maximum borrowing possible.Assuming unknown σ 2 0 and setting c = 3, Figure 4 shows the Bayesian SSD under different choices of the hyperparameters, a 02 and b 02 , for each criterion.As expected, a more informative Gamma(a 02 , b 02 ) yields a smaller sample size given the same set of w k , k = 1, . . ., K. The ALC sample sizes appear to have least decreasing, compared with the ACC and APVC, in this sensitivity evaluation.
We also observe that the reduction in Bayesian sample sizes is not proportional to the improving of informativeness of Gamma(a 02 , b 02 ): setting the informative component as Gamma(18, 3) is not much different from Gamma(54, 3) for our illustrative examples.
For practical implementation, we recommend the component Gamma distributions to be chosen for representing two extremes of very limited borrowing and complete pooling of Gamma(6,3) Gamma (18,3) Gamma(54, 3)  information, when given a full prior mixture weight w k = 1 and w k = 0, respectively.

Discussion
Planning a new experiment with a sufficient sample size necessitates the use of relevant Choosing sensible values of w k , k = 1, . . ., K, is crucial for practical implementation.
Following Zheng and Wason (2020), we recommended these to be linked with measures of pairwise distributional discrepancy.In our illustration, the squared Hellinger distance between any two pre-experimental parameters, θ k | y k , was computed to inform the choices of w k .The underlying logic is that the new experiment, at the planning stage, may be regarded as compatible with the historical experiments, and so would their data be.The levels of pairwise (in)commensurability between a pre-experimental parameter and the new experimental parameter would thus be comparable to those between the pre-experimental parameters themselves.Nevertheless, we recognise that these prior mixture weights w k can not be correctly specified when the new experimental data are yet to be generated.From a pragmatic perspective, the new experiment could be embedded with one or multiple interim analyses to enable mid-course modifications towards w k .Each update in terms of w k tends to better reflect the genuine incommensurability (Zheng and Hampson, 2020).This area deserves further investigation; similar topics concerning the sample size re-estimation have been discussed among others (Cui et al., 1999;Wang, 2007;Brutti et al., 2009;Brakenhoff et al., 2019).
While this paper has focused on estimating the difference between two normal means, our proposed sample size formulae are straightforward to derive for a single normal mean.
For planning a new experiment that generates data on binomially distributed outcomes, we used logit transformation to consider the log-odds ratio as a continuous variable that could be adequately modelled by a normal distribution.It would be interesting to extend the proposed methodology for bionomial proportions in a more direct manner (Joseph et al., 1995;M'Lan et al., 2008;Joseph and B Ã©lisle, 2019), as well as in the context of other sampling distributions.We also consider developing sample size formulae in a generalised linear regression modelling framework where potential confounding can be adjusted for.
When illustrating the application, we supposed that pre-experimental information had been made available with regard to the parameter of influential interest.In practical implementation, situations may be more complex.For instances, historical data may have been collected from experiments with very different design considerations (Zhang et al., 2019), or recorded on a different measurement scale (Zheng et al., 2020) from what might be for the new experiment under planning.This is an area where our future research would look towards.As a separate note, we applied quite general criteria such as ACC and ALC to control the average coverage probability or length of the HPD interval of the posterior distribution for the parameter of influential interest throughout.We are aware of occasions where error rates need to be controlled, e.g., in most clinical trials for drug development undertaken in patients with non-rare diseases, as required by the regulatory agencies (EMA, 1998).We note that our sample size formulae according to the ACC can be easily extended to give a solution, which is analogous to the frequentist hypothesis testing: rejection of the null hypothesis would be defined based on posterior interval probabilities with respect to certain magnitude of effect size (Whitehead et al., 2008).
which demonstrates that the predictive prior distribution for each θk , given a historical parameter θ k , is a mixture of non-standardised t distributions.In particular, the location parameters are identical to θ k , while the scale parameters are b 01 a 01 and b 02 a 02 , respectively.In other words, the two mixture components correspond to θk −θ k √ b 01 /a 01 and θk −θ k √ b 02 /a 02 that follow classic Student's t distributions with 2a 01 and 2a 02 degrees of freedom, respectively.

B. MOMENTS AND OVERALL VARIANCE OF THE t MIXTURE DISTRIBUTION
For a finite mixture distribution denoted by f (x) = ∑ i w i h i (x), its -th moment about zero exists and follows that where µ ( ) i is each -th moment of the mixture component h i (x).In our context, the constant prior mixture weights w 1 = w k and w 2 = 1 − w k ; while the components h i ( θk |θ k ), i = 1, 2, are the nonstandardised Student's t distributions, both centred at θ k .We then know that µ (1) We further constrain that a 01 , a 02 > 1 so that variances, denoted by σ 2 i , of h i ( θk |θ k ) exist, as b 01 a 01 −1 and b 02 a 02 −1 , respectively.Following Equation (S4), the overall variance for the mixture distribution can be computed as

C. ON THE GOODNESS OF THE NORMAL APPROXIMATION
As illustrated, θk follows a non-standard t mixture distribution that has its two components both centred at θ k .We approximate it by a normal distribution , of which the variance was set equal to the overall variance of the unimodal t mixture prior.
This approximation is considered to be appropriate for the comparable coverage probabilities of a given credible interval, defined as (θ k − r, θ k + r).Precisely, with the unimodal t mixture distribution, the coverage probability can be given by P The coverage probability of (θ k − r, θ k + r) given the normal approximation has a closed-form expression, and we further let it be of size 1 − α: We may then solve this by giving r = z α/2 , where z α/2 denotes the upper (α/2)th quantile of the standard normal distribution, i.e., Φ −1 (1 − α/2).Substituting this to (S5), we can derive the exact coverage probability of the interval in the non-standardised t mixture distribution, which is approximately 1 − α.For illustration, we specify that ν k ∼ w k Gamma(2, 2) + (1 − w k )Gamma(18, 3).Thus, θk − θ k follows a non-standardised t mixture distribution centered at 0, which may be approximated by . Figure S1 displays some example density curves for θk − θ k , with the component t distributions visualised in each subfigure.In subfigures (b) -(d), we overlie the density curve of the normal approximation, given prior mixture weights w k = 0.25, 0.09, 0.01.As we observe, a smaller value of w k tends to distort the approximated normal density curve further towards the more informative component t distribution.
Based on the normal approximation and setting w k = 0.25, 0.09, 0.01, a 95% credible interval being symmetric about θ k can be yielded by r = 1.559, 1.144, 0.865, respectively.Substituting these values of r to (S5), we compute the exact coverage probabilities based on the original t mixture distribution as 95.1%, 96.4%, and 95.5%, which are all close to 95% as expected.Finally, we note that the component distribution Gamma(2, 2) is quite extreme; with larger a 01 and b 01 , the approximation tends to be even better.

D. PREDICTIVE PRIOR THAT LEVERAGES MULTI-SOURCE PRE-EXPERIMENTAL DATA
We start from the normal approximation to the non-standardised t mixture distribution:

E. POSTERIOR FOR THE DIFFERENCE IN MEANS WHEN THE VARIANCE IS UNKNOWN
We suppose that the unknown mean µ ∆ has a normal prior distribution specified based on K pre- The last step in (S6) follows, since the integrant is the kernel of the density for an Inv-Chi-square distribution with c degrees of freedom, multiplied by We then divide both terms in the square brackets by c ∑ p 2 k ξ 2 k and obtain which features the product of kernels of normal and non-standardised t distributions, with the latter of c degrees of freedom, shifted by x∆ and scaled by 1 n A + 1 n B ∑ p 2 k ξ 2 k .Therefore, the marginal posterior relies on the distribution of x∆ .
We can easily work out that x∆ | µ ∆ has a t distribution, as Likewise, we relate it with the normal kernel conditional on unknown σ 2 0 so that Placing the normal prior distribution for µ ∆ , specified based on pre-experimental data, we obtain the distribution for x∆ unconditional on µ ∆ as x∆ | σ 2 0 , y 1 , . . ., y K ∼ N ∑ p k λ k , 1 which depends on the distribution for σ 2 0 .We compute the squared Hellinger distance [1] for all pairs of the individually specified prior opinion on the log-odds ratio, θ k | y k , with each represented in a N(m k , s 2 k ) distribution.The squared Hellinger distance between two normal distributions, say, N(m k , s 2 k ) and N(m k , s 2 k ), has a closed-form expression:

F. PAIRWISE COMMENSURABILITY OF THE HYPOTHETICAL HISTORICAL DATA
.
Figure S2 visualised the pairwise distances for hypothetical pre-experimental data.This could be useful to determine the prior probabilities of incommensurability, w k , k = 1, . . ., K, for applying the proposed Bayesian sample size formulae.
where w k denotes the prior mixture weight, on the scale of [0, 1], to represent preliminary scepticism about how commensurate θ k and µ ∆ would be.The hyperparameters are chosen so that the first mixture component with a 01 , b 01 has the density concentrated on small values of ν k , while the second mixture component with a 02 , b 02 has density covering larger values of ν k .A large prior mixture weight allocated to either component distribution would thus result in sufficient down-weighting (with no borrowing at all as one extreme) or strong borrowing of historical information (with fully pooling as the other extreme), respectively.

Figure 2 :
Figure2: The ACC, ALC and APVC sample sizes for the new trial, where the unknown σ 2 0 could be related to the collective prior variance by assuming the quantity c p 2 k ξ 2 k /σ 2 0 ∼ χ 2 (c).The extent of borrowing for better knowledge about σ 2 0 depends on the number of degrees of freedom, c.

Figure 3 :
Figure 3: Sample sizes required when σ 2 0 is unknown to retain desired average property of the posterior distribution.The ACC and ALC sample sizes are computed by fixing the credible interval length 0 = 0.65 and coverage probability 1 − α 0 = 95%, respectively.

Figure 4 :
Figure 4: The proposed Bayesian SSD dependent on the choice of the informative Gamma component distribution for strong borrowing.The labels at x-axis are short for Optim I, Optim II, Robust weights I, Robust weights II, and No robustification, respectively.
information.Bayesian methods allow for the inherent uncertainty in the estimate of model parameters, as well as a formal incorporation of any expert opinion or historical data.In this paper, we have developed Bayesian sample size formulae that use commensurate priors to leverage pre-experimental data, available from multiple sources, for the model parameter(s) of interest.The proposed approach falls within the remit of proper Bayesian approach to sample size determination, wherein the priors for design planning and data analysis remain the same.The level of down-weighting, in the light of possible disagreement between any pre-experimental and the new experimental data, relies on the user's choice of the prior probabilities of incommensurability, w k , k = 1, . . ., K. As shown in the performance evaluation, smaller values of w k imply less down-weighting, resulting in smaller sample sizes required for the new experiment.By contrast, a large value of w k for large down-weighting and thus results in large sample size.

Figure S1 :
Figure S1: The t mixture distribution for θk − θ k , with the components depicted using black solid and blue dotted curves.Its normal approximations, given by θk − θ k ∼ N 0, w k b 01 a 01 −1 + (1−w k )b 02 a 02 −1 , are visualised using the red dashed curves in subfigures (b) -(d), under various choices of w k .
(Hampson et al., 2015)resent a Bayesian approach for elicitation of expert opinion on the parameters for enhanced design and analysis of rare-disease trials.A two-day elicitation meeting(Hampson et al., 2015)was held for the MYPAN trial, which compares the efficacy of a new treatment (labelled A) relative to the standard of care (labelled B) for polyarteritis nodosa, a rare and severe inflammatory blood vessel disease.Priors were elicited from the input of 15 experts individually.Specifically, opinion was sought on (i) the probability that a patient given B would achieve disease remission within 6 months (a dichotomous event), and (ii) the log-odds ratio of remission rates.Consensus distributions for the remission rates were obtained, with the mode at 71% for A and 74% for B. In what follows, we regard expert opinion as a type of pre-trial information, and generate hypothetical examples considering characteristics of the MYPAN trial to illustrate the application of the proposed Bayesian SSD approach.