Bayesian design and analysis of external pilot trials for complex interventions

External pilot trials of complex interventions are used to help determine if and how a confirmatory trial should be undertaken, providing estimates of parameters such as recruitment, retention, and adherence rates. The decision to progress to the confirmatory trial is typically made by comparing these estimates to pre-specified thresholds known as progression criteria, although the statistical properties of such decision rules are rarely assessed. Such assessment is complicated by several methodological challenges, including the simultaneous evaluation of multiple endpoints, complex multi-level models, small sample sizes, and uncertainty in nuisance parameters. In response to these challenges, we describe a Bayesian approach to the design and analysis of external pilot trials. We show how progression decisions can be made by minimizing the expected value of a loss function, defined over the whole parameter space to allow for preferences and trade-offs between multiple parameters to be articulated and used in the decision-making process. The assessment of preferences is kept feasible by using a piecewise constant parametrization of the loss function, the parameters of which are chosen at the design stage to lead to desirable operating characteristics. We describe a flexible, yet computationally intensive, nested Monte Carlo algorithm for estimating operating characteristics. The method is used to revisit the design of an external pilot trial of a complex intervention designed to increase the physical activity of care home residents.

a clear gap between the two trials. Pilot trials, which aim to inform the feasibility and optimal design of a subsequent definitive trial, 3 are distinct from phase II trials, which focus instead on assessing potential efficacy and safety.
The data generated by an external pilot trial are used to help decide if the main RCT should go ahead, and if so, whether the intervention or the trial design should be adjusted to ensure success. In the United Kingdom, the National Institute for Health Research asks that these progression criteria are pre-specified and included in the research plan, 4 and the recent CONSORT extension to randomized pilot trials requires their reporting. 5 A single pilot trial can collect data on several progression criteria, often focused on the aforementioned areas of recruitment, protocol adherence, and data collection. 6 Although they may take the form of single threshold values leading to binary stop/go decision rules, investigators are increasingly using two thresholds to accommodate an intermediate decision between stopping altogether and progressing straight to the main trial, which would allow progression but only after some adjustments have been made. 5 The need for appropriate progression criteria is clear when we consider the consequences of poor post-pilot progression decisions. If the criteria are too lax, there is a greater risk that the main trial will go ahead but found to be infeasible and thus a waste of resources; if the criteria are too strict, a promising intervention may be discarded under the mistaken belief that the main trial would be infeasible. Despite this, there is little published guidance about how they should be determined. 6,7 In addition to pre-specifying progression criteria, another key design decision is the choice of pilot sample size. Conventional methods of sample size determination, which focus on ensuring the trial will have sufficient power to detect a target difference in the primary outcome, are rarely used since they would lead to a pilot sample size comparable with the main trial sample size. Several methods for pilot sample size determination instead aim to provide a sufficiently precise estimate of the variance in the primary outcome measure to inform the sample size of the main trial. [8][9][10][11][12][13] Others have suggested a simple rule of thumb for when the goal is to identify unforeseen problems. 14 While some have noted that the low sample size in pilots may lead to a considerable probability that a certain progression criterion will be met (or missed) due to random sampling variation, 12,15 and despite the consequences of making the wrong progression decision, the statistical properties of pilot decision rules are rarely used to inform the choice of sample size. This may be due to the methodological challenges commonly found in pilot trials of complex interventions, including the simultaneous evaluation of multiple endpoints, complex multi-level models, small sample sizes, and prior uncertainty in nuisance parameters. 16 In this article, we will describe a method for designing and analyzing external pilot trials which addresses these challenges. We take a Bayesian view, allowing for complex models to be estimated in the typically small sample context of pilot trials and for external information to be leveraged. 17 We propose progression decisions should then be made to minimize the expected value of a loss function with respect to a posterior distribution on model parameters. This decision-theoretic approach allows for the various trade-offs between model parameters to be expressed and guide progression decisions. By implicitly defining a pre-specified decision rule, the use of a loss function also ensures operating characteristics can be calculated and used as a basis for pilot trial sample size determination.
We propose a loss function with three parameters whose values can be determined either through direct elicitation of preferences or by considering the pilot trial operating characteristics they lead to. The operating characteristics we propose are all unconditional probabilities (with respect to a prior distribution) of making incorrect decisions, also known as assurances. 18 Using assurances rather than the analogous frequentist error rates brings several benefits, including the ability to make use of existing knowledge whilst allowing for any uncertainty, and a more natural interpretation. 19 As we will show, assurances are also useful when our preferences for different end-of-trial decisions are based on several attributes in a complex way that involves trading off some against others.
The remainder of this article is organized as follows. In Section 2, we describe the general framework for pilot design and analysis, some operating characteristics used for evaluation, and a routine for optimizing the design. Two illustrative examples are then described in Sections 3 and 4. Finally, we discuss implications and limitations in Section 5.

Prior specification
Consider a pilot trial which will produce data x according to model p(x| ). We decompose the parameters into = ( , ), where denotes the parameters of substantive interest and the nuisance parameters. We follow Wang and Gelfand 20 and assume that two joint prior distributions of have been specified. First, the analysis prior p A ( ) is that which will be used when fitting the model once the pilot data is obtained. It has been argued that regulators are unlikely to accept the prior beliefs of the trial sponsor for analysis of the data, 18,21 and as such a weakly or non-informative prior should be used for p A ( ) in order to "let the data drive the inference." 20 The choice of such a prior will depend on the specific model being used, although methodological guidance for various specific cases such as logistic regression 22 and hierarchical models 23 is available. It should be emphasized, however, that the typically small sample size of a pilot trial can mean the effect of the analysis prior is non-negligible. As such, the analysis prior should provide a credible and justifiable representation of prior ignorance, avoiding extreme default choices which may place too much prior weight on infeasible regions of the parameter space. The design prior p D ( ) will be used when evaluating the statistical performance of a proposed pilot trial design. It may be considered as purely hypothetical in the spirit of a "what-if" analysis, 20 in which case several candidate design priors may be suggested and performance evaluated under each of these. Alternatively, and as we will assume in the remainder of this article, p D ( ) can be a completely subjective prior which fully expresses our knowledge and uncertainty in the parameters at the design stage. Although eliciting such a prior is potentially challenging, many examples describing successful practical applications of expert elicitation for clinical trial design are available, 19,21,24 as are tools for its conduct such as the Sheffield Elicitation Framework (SHELF). 25 From a strictly subjective Bayesian perspective, we can then view the weakly informative analysis prior as representing the beliefs of the person who will analyze the data and who is relatively uninformed with regards to the model parameters.

Analysis and progression decisions
After observing the pilot data x, we must decide whether or not to progress to the main RCT. We consider three possible actions following the aforementioned "traffic light" system commonly used in pilot trials: • red-discard the intervention and stop all future development or evaluation; • amber-proceed to the main RCT, but only after some modifications to the intervention, the planned trial design, or both; or • green-proceed immediately to the main RCT.
In what follows we will denote these decisions by r, a, and g, respectively. We assume that our preferences between the three possible decisions are influenced by but independent of , formalizing the separation of into substantive and nuisance components. We partition the substantive parameter space Φ into three disjoint subspaces Φ I , for I = R, A, G. Each subspace label corresponds to the decision we would make if we knew the true value of . For example, if ∈ Φ R then the optimal decision is r(ed)-halt development and do not proceed to a definitive trial. We will henceforth refer to these three subsets as hypotheses, and to conditioning on the event ∈ Φ I as "under hypothesis Φ I ." Throughout, we will distinguish hypothesis I from the corresponding optimal decision i by using upper and lower case letters, respectively.
When ∈ Φ I and we choose a decision j ≠ i, there will be negative consequences. In particular, we may make three kinds of mistakes: proceed to an infeasible main RCT; discard a promising intervention; or make unnecessary adjustments to the intervention or trial design. We denote these errors as E 1 , E 2 , E 3 , respectively. The occurrence of error j will be denoted by E j = 1, otherwise E j = 0. An error's occurrence will be a function of the decision made d and the true parameter value , that is, E j (d, ) ∶ {r, a, g} × Φ → {0, 1} for j = 1, 2, 3. We then use a loss function to express the preferences of the decision-maker(s) on the space of possible events E 1 × E 2 × E 3 under uncertainty, defined as Note that the additive form of the loss function implies that the our preferences for any one of the attributes E 1 , E 2 , E 3 are independent of the values taken by the others. 26 To determine appropriate values of the parameters c 1 , c 2 , c 3 , we first scale the loss function by setting c 1 + c 2 + c 3 = 1. Thus, a loss of 0 is obtained if no errors occur, and a loss of 1 is obtained if all errors occur (although note that this is not possible in this setting). We then follow the procedure described by French and Rios Insua (page 99), 26 eliciting some judgments from the decision-maker(s) and using these to determine the values of c 1 , c 2 , c 3 . One such judgment involves a simple gamble of obtaining the event (E 1 = 0, E 2 = 0, E 3 = 0) with probability 1 − p 1 and the event (E 1 = 1, E 2 = 0, E 3 = 1) with probability p 1 . The decision-maker is asked to compare this gamble against an alternative of obtaining the event (E 1 = 1, E 2 = 0, E 3 = 0) for certain, and to adjust the value of p 1 until they feel indifferent between the two options.

Hypothesis
TA B L E 1 Losses associated with each decision under each hypothesis Since this indifference implies the expected losses of the two options are equal, we will then have Similarly, we can ask the decision-maker(s) to consider a gamble between the event (E 1 = 0, E 2 = 0, E 3 = 0) with probability 1 − p 2 and the event (E 1 = 1, E 2 = 1, E 3 = 0) with probability p 2 , and compare this against the option of obtaining (E 1 = 1, E 2 = 0, E 3 = 0) for certain. Again, by determining the value of p 2 which corresponds to indifference and thus equal expected loss, we deduce that This gives three equations that can be solved to obtain Note that the two specific judgments suggested here are only two of many possible similar questions which could be posed to the decision-maker(s). It is recommended that more indifferences are elicited in order to seek out any inconsistencies and further clarify their true preferences.
The loss function will then take values as given in Table 1. For example, suppose we make a "green" decision under the "amber" hypothesis. The subsequent trial will be infeasible because the necessary adjustments will not have been made; but we have also discarded a promising intervention, since it would have been redeemed had the adjustments been made. The overall loss is therefore c 1 + c 2 .
Given a loss function with parameters c = (c 1 , c 2 , c 3 ), we follow the principle of maximizing expected utility (or in our case, minimizing the expected loss) when making a progression decision. We first use the pilot data in conjunction with the analysis prior p A ( ) to obtain a posterior p( | x), and then choose the decision i * such that i * = arg min i∈{r,a,g} E |x [L(i, )] (1) = arg min i∈{r,a,g} ∫ L(i, )p( |x)d .
We can simplify this expression by noting that, given the piecewise constant nature of the loss function, the expected loss of each decision depends only on the posterior probabilities For some simple models that admit a conjugate analysis, the posterior probabilities p I can be obtained exactly. Otherwise, Monte Carlo estimates can be computed based on the samples from the joint posterior distribution generated by an MCMC analysis of the pilot data. Specifically, given M samples (1) , (2) , … , (M) ∼ p( | x), where I(.) is the indicator function.

Operating characteristics
Defining a loss function and following the steps of the preceding section effectively prescribes a decision rule mapping the pilot data sample space  to the decision space {r, a, g}. To gain some insight at the design stage into the properties of this rule, we propose to calculate some trial operating characteristics. These take the form of unconditional probabilities of making an error when following the rule, calculated with respect to the design prior p D ( ). We consider the following: -probability of proceeding to an infeasible main RCT; -probability of discarding a promising intervention; -probability of making unnecessary adjustments to the intervention or the trial design.
These operating characteristics can be estimated using simulation. First, we draw N samples For each dataset, we then apply the analysis and decision-making procedure described in Section 2.2, using some vector c to parametrize the loss function. This results in N decisions i (k) which can be contrasted with the corresponding true parameter value (k) and in which hypothesis it resides, noting if any of the three types of errors have been made. MC estimates of the operating characteristics can then be calculated as the proportion of occurrences of each type of error in the N simulated cases. Assuming that N is large, the unbiased MC estimate of an operating characteristic with true probability p will be approximately normally distributed with variance p(1 − p)/N. *

Eliciting loss parameters through optimization
Elicitation of the loss function parameters c = (c 1 , c 2 , c 3 ) in the manner described in Section 2.2 may be challenging, particularly when multiple decision-makers are involved. 27 An alternative way to determine c is through examining the operating characteristics it leads to (for some fixed pilot design). As c is adjusted, the balance between the conflicting objectives of minimizing each OC will change, and the task is then to find the c which returns the best balance from the perspective of the decision-maker. Formally, and thinking of operating characteristics as functions of c, we wish to solve the multi-objective optimization problem where Since the three objectives are in conflict, there will be no single solution which simultaneously minimizes each one. We would instead like to find a set  * = {c (1) , c (2) , … , c (K) } such that each member provides a different balance between minimizing the three operating characteristics. If there exist c, In this case, because c leads to worse (or at least no better) values of all three operating characteristics when compared to c ′ , we have no reason to include it in our set  * . Because the search space  has only two dimensions, problem (7) can be approximately solved by generating a uniform random sample of c's and estimating the operating characteristics for each. Any parameters which are dominated in this set can then be discarded, and the operating characteristics of those which remain can be illustrated graphically. The decision-maker(s) can then view the range of available options, all providing different trade-offs among the three operating characteristics, and choose from among them.
To solve the problem in a timely manner, we must be able to estimate operating characteristics quickly. Noting from Equation (3) that the expected loss of each decision depends only on c and the posterior probabilities p R , p A and p G , we first generate N samples of these posterior probabilities and then use this same set of samples for every evaluation. * Note that in the case of complex models which do no admit a conjugate analysis, the posterior probabilities obtained using an MCMC analysis will themselves be approximate and as such the optimal decision will be subject to error, which may increase the variance of the operating characteristic estimates. However, this issue can be sidestepped by assuming that, for each dataset, the analysis that is simulated corresponds exactly to the analysis that would be carried out in practice. In particular, we assume that exactly M posterior samples will be generated by the same MCMC algorithm, using the same seed in the random number generator.
This approach not only ensures that optimization is computationally feasible, but also means that differences in operating characteristics are entirely due to differences in costs, as opposed to differences in the random posterior probability samples.

ILLUSTRATIVE EXAMPLE-CHILD PSYCHOTHERAPY (TIGA-CUB)
TIGA-CUB (Trial on Improving Inter-Generational Attachment for Children Undergoing Behavior problems) was a two-arm, individually-randomized, controlled pilot trial informing the feasibility and design of a confirmatory RCT comparing Child Psychotherapy (CP) to Treatment as Usual (TaU), for children with treatment resistant conduct disorders.
The trial aimed to recruit 60 primary carer-child dyads, to be randomized equally to each arm. This sample size was chosen to give desired levels of precision in the estimates of the common standard deviation of the primary outcome, the follow-up rate, and the adherence rate. Here, we focus on the latter two parameters and consider how our proposed method could have informed the design of TIGA-CUB. We model the number of participants successfully followed-up (denoted f ) using a binomial distribution with parameter p f , and similarly the number successfully adhering to the intervention (denoted a) with a binomial distribution with parameter p a . For a fixed pilot trial per-arm sample size n, the parameters of the model are = (p f , p a ), with no nuisance parameters. Assuming for simplicity that the numbers followed-up and adhering are independent, the likelihood is then At the design stage, the follow-up rate p f was thought to be somewhere in the range 62% to 92%, while the adherence rate p a was thought to lie between 40% and 95%. We reflect these ranges of uncertainty in our design priors by using beta distributions p f ∼ Beta(40, 10) (thus giving a prior mean of 0.8), and p a ∼ Beta(11.2, 4.8) (giving a prior mean of 0.7). We assume that a uniform "non-informative" prior Beta(1, 1) will be used for each parameter in the analysis. TIGA-CUB's progression criteria included only simple stop/go thresholds, with no intermediate "amber" decisions. As such, in this example, we partition the parameter space into two hypotheses, Φ G and Φ R . For the purposes of illustration, we define the hypothesis Φ G as the subset of the parameter space where p f > = 0.8 and p a > = 0.7, hypothesis Φ R being its complement. Thus, in this example, we do not consider there to be a trade-off between the two parameters of interest. For the main trial to be feasible, both must be above their respective thresholds. The prior distributions on parameters p f and p a imply an a priori probability of 0.28 that ∈ Φ G , that is, that both follow-up and adherence are sufficiently high.
In this special case, the loss function is and the expected losses of decisions g and r will be E |x [L(g, )] = c 1 p R and E |x [L(r, )] = c 2 p G , where p R + p G = 1 and c 1 + c 2 = 1. Decision g is therefore optimal whenever p G > c 1 . The posterior probability p G can be easily calculated given the pilot data due to the beta prior distributions being conjugate. Specifically, given a total sample size 2n and observing x f participants with follow-up and x a participants with adherence, the posterior probability Pr[ ∈ Φ G | x] is given by where F(y; , ) denotes the cumulative probability function of the beta distribution with parameters , . At the design stage, we can calculate the probability of an infeasible trial (OC 1 ), and similarly for the probability of discarding a promising intervention. As these calculations can be computationally expensive for moderate n due to the nested summation term, we use Monte Carlo approximations as described in Section 2. To examine the effect of adjusting the sample size, we evaluated the operating characteristics obtained for n = 10, 12, 14, … , 50 per arm whilst setting c 1 = 0.2, 0.36, 0.5. The results are shown in Figure 2. Each line includes a shaded area denoting the 95% Monte Carlo error intervals, although these are so small as to be illegible given the high number (N = 10 6 ) of MC samples used for each calculation. Although operating characteristics generally improve as the sample size is increased, we see that for c 1 = 0.36 and 0.5 the probability of an infeasible main trial, OC 1 , remains flat whilst OC 2 has a downward trend. As we would expect, the expected loss reduces smoothly as n increases in all cases. In contrast, there is some variability beyond that explained by MC error in the OCs. This can be explained by the discrete nature of simulated adherence and follow-up data. Our results show that, for the design priors and hypotheses used in this example, the chosen sample size in TIGA-CUB of n = 30 can provide error rates broadly in line with conventional type I and II error rates under the usual hypothesis testing framework.

ILLUSTRATIVE EXAMPLE-PHYSICAL ACTIVITY IN CARE HOMES (REACH)
The REACH (Research Exploring Physical Activity in Care Homes) trial aimed to inform the feasibility and design of a future definitive RCT assessing a complex intervention designed to increase the physical activity of care home residents. 28 The trial was cluster randomized at the care home level, with twelve care homes in total randomized equally between treatment as usual (TaU) and the intervention plus TaU.
Data on several feasibility outcomes were collected. Here, we focus on four: recruitment (measured in terms of the average number of residents in each care home who participate in the trial, or average cluster size); adherence (a binary TA B L E 2 Pre-specified progression criteria used in the original REACH design indicator at the care home level indicating if the intervention was fully implemented); data completion (a binary indicator for each resident of successful follow-up at the planned primary outcome time of 12 months); and potential efficacy (a continuous measure of physical activity at the resident level). Progression criteria using the traffic light system were pre-specified for all of these outcomes except potential efficacy, as detailed in Table 2.
Denoting the size of the jth cluster by m j and the number of care homes in each arm by k, we assume that cluster sizes are normally distributed, m j ∼ N( c , 2 ), j = 1, … , 2k. We further assume that the probability of a participant being followed-up is constant across clusters and arms, and that the total number follows a binomial distribution f ∼ Bin( ∑ 2k j=1 m j , p f ). The number of care homes which successfully adhere to the intervention is assumed to binomially distributed, a ∼ Bin(k, p a ).
The continuous measure of physical activity is expected to be correlated within care homes. We model this using a random intercept, where the outcome y ij of resident i in care home j is Here, X j is a binary indicator of care home j being randomized to the intervention arm, Y j is a binary indicator of care home j successfully adhering to the intervention, is the mean treatment effect, u j ∼  (0, 2 B ) is the random effect for care home j, and i ∼  (0, 2 W ) is the residual for resident i. We parametrize the model using the intracluster correlation coefficient, . The parameters describing average cluster size, follow-up, and adherence rates, and mean treatment effect are of substantive interest when making progression decisions, giving = ( c , p f , p a , ). The remainder are nuisance parameters, = ( 2 , , 2 W ).

Prior and hypothesis specification
To begin specifying a model for the REACH trial, we first note that the four substantive parameters can be divided into two pairs. First, mean cluster size and follow-up rate relate to the amount of information which a confirmatory trial will gather. Second, potential efficacy and adherence relate to the effectiveness of the intervention, where effectiveness is thought of as the effect which will be obtained in practice when the effect of non-adherence is accounted for. We expect that a degree of trade-off between adherence and potential efficacy will be acceptable, with a decrease in one being compensated by an increase in the other. Likewise, low mean cluster size could be compensated to some extent by higher follow-up rate, and vice versa. While there may be trade-offs within these pairs of parameters, we do not expect trade-offs between them. A trial with no effectiveness will be futile regardless of the amount of information collected, and so should not be conducted. Similarly, a confirmatory trial should not be conducted if it is highly unlikely to produce enough information for the research question to be adequately answered. We therefore consider the sub-spaces of Φ formed by these parameter pairs, partition these into hypotheses, and combine these together. Constructing hypotheses in these two-dimensional spaces is cognitively simpler than working in the original four-dimensional space, not least because they can be easily illustrated graphically.
Formally, let Φ i be the sub-space of mean cluster size and follow-up rate, and Φ e be that of adherence and potential efficacy. Having specified hypotheses Φ i I , Φ e I for I = R, A, G, we then have

Follow-up and cluster size
Recall that cluster sizes are assumed to be normally distributed with mean c and variance 2 . A normal-inverse-gamma prior is placed on the mean and variance to allow for prior uncertainty in both parameters. It was anticipated that an average of 8 to 12 residents would be recruited in each care home. To reflect this prior belief we set the hyper-parameters to 0 = 10, 0 = 6, 0 = 20, 0 = 39, giving a prior cluster size of 10 with mean variance 2.05. For the probability of successful follow-up, p f , we take a Beta distribution with hyper-parameters 0 = 22.4, 0 = 9.6 as the prior. This gives a prior with a mean of 0.7 and a standard deviation of 0.08.
To partition the parameter space into hypotheses, we first consider the case where follow-up is perfect, that is, p f = 1. Conditional on this, we reason that a mean cluster size of below 5 should lead to a red decision (stop development), whereas a size of above 7 should lead to a green decision (proceed to the main trial). As the probability of successful follow-up decreases, we suppose that this can be compensated by an increase in mean cluster size. We assume the nature of this trade-off is linear and decide that if p f were reduced to 0.8, we would want to have a mean cluster size of at least 8 to consider decisions a or g. We further decide that a follow-up rate of less than p f = 0.6 would be critically low, regardless of the mean cluster size, and should always lead to decision r. Similarly, a follow-up rate of 0.6 ≤ p f < 0.66 should lead to modification of the intervention or trial design. Together, these conditions lead to the following partitioning of the parameter space: The hypotheses are illustrated in Figure 3A. Having specified both the hypotheses and the prior distribution for these two parameters, we can obtain prior probabilities of each hypothesis by sampling from the prior and calculating the proportion of these samples falling into the regions Φ i R , Φ i A and Φ i G . We have plotted 1000 samples from the prior in Figure 3A, falling into hypotheses Φ i R , Φ i A , and Φ i G in proportions 0.354, 0.517, and 0.129, respectively. This demonstrates that there is significant prior uncertainty regarding the optimal decision, indicating the potential value of the pilot trial.

Adherence and potential efficacy
Having defined priors and hypotheses with respect to cluster size and follow-up, we now consider adherence and potential efficacy. Recall that the number of care homes which successfully adhere to the intervention delivery plan is assumed to be binomially distributed with probability p a . We assume that adherence is absolute in the sense that all residents in a care home which does not successfully deliver the intervention will not receive any of the treatment effect. We place a Beta prior on p a , with hyper-parameters = 28.8 and = 3.2 giving a prior mean of 0.9 and a standard deviation of 0.05. For the continuous measure of physical activity, we place priors on the mean effect , the intracluster correlation coefficient , and the within-cluster variance 2 W in the manner suggested by Spiegelhalter. 23 Specifically, we choose To reflect prior expectation of an ICC around 0.05 but possibly as large as 0.1, the hyperparameters give a prior mean of 0.05 for the ICC with a prior probability of 0.104 that it will exceed 0.1.
While there is potential for adherence to be improved after the pilot, we assume there will be little opportunity to improve the potential efficacy of the intervention. Moreover, we suppose an absolute improvement in adherence of up to around 0.1 is feasible. To define the hypotheses in this subspace, we first set a minimal level of potential efficacy to be 0.1, and decide that we would be happy to make decision g at this point if and only if adherence is perfect. As p a reduces from 1, a corresponding linear increase in potential efficacy is considered to maintain the overall effectiveness of the intervention. The rate of substitution for this trade-off is determined to be approximately 0.57 units of potential efficacy per unit of adherence probability. We consider an absolute lower limit in adherence of p a = 0.5, below which we will always consider decision r to be optimal. Taking these considerations together, the marginal hypotheses are defined as The hypotheses are illustrated in Figure 3B. Again, a sample of size 1000 from the joint marginal prior distribution p(p a , ) is also plotted, falling into hypotheses Φ e R , Φ e A , and Φ e G in proportions 0.234, 0.470, and 0.296, respectively. As before, this indicates substantial prior uncertainty regarding the optimal decision and thus supports the use of a pilot study.
The marginal hypotheses are combined together using Equation (12). Considering the same 1000 samples from the design prior plotted in Figure 3, these now fall into the regions Φ R , Φ A , and Φ G in proportions 0.507, 0.458, and 0.035, respectively. Note that the prior probabilities of these overall hypotheses are quite different to those of the marginal hypotheses. In particular, there is a considerable increase in the probability that decision r will be optimal, and a considerable decrease that decision g will be.

Weakly informative analysis
We applied the proposed method assuming that a weakly informative joint prior distribution will be used at the analysis stage. † We took the sample size of the trial to be k = 6 clusters per arm. For calculating operating characteristics we generated N = 10 4 samples from the joint distribution p( , x) = p(x| )p D ( ). We analyzed each simulated dataset using Stan via the R package rstan, 29 in each case generating 5000 samples in four chains and discarding the first 2500 samples We evaluated the operating characteristics for a sample of parameters (c 1 , c 2 , c 3 ) as described in Section 2.4. A total of 254 parameter vectors were evaluated, of which 62 led to operating characteristics which were worse in every respect than some other vector (ie, dominated) and were discarded. The operating characteristics of the non-dominated parameters are shown in Figure 4. The three operating characteristics are found to be highly correlated. In particular, changing the parameters to give a lower probability of discarding a promising intervention (OC 2 ) tends to lead to a reduction in the probability of making an unnecessary adjustment (OC 3 ). When selecting (c 1 , c 2 ), the key decision appears to be trading off the probability of an infeasible trial, (OC 1 ), against OC 2 . There is a very limited opportunity to minimize OC 3 at the expense of these. For example, compare points b and c in Figure 4, details of which are given in Table 3. We see that point c reduces OC 3 by 0.078 in comparison to point b, but only at the expense of increase in OC 1 and OC 2 of 0.13 and 0.145, respectively.
We would expect to see a clear relationship between the value of parameters c 1 , c 2 , c 3 and the operating characteristics they relate to. We explore this in Figure 5 with scatter plots of each parameter against each operating characteristic. The results show that there is indeed a strong relationship between the loss assigned to discarding a promising intervention, c 2 , and the probability that this event will occur, OC 2 (see center plot). Moreover, c 2 also seems to be the main determinant of operating characteristics OC 1 and OC 3 . The implication is that once the c 2 ∈ [0, 1] has been chosen, the operating characteristics of the trial depend only weakly on the way in which the remaining 1 − c 2 is allocated to c 1 and c 3 . This appears to be due to the fact that, regardless of how errors are weighted, the way we have defined our prior distributions and hypotheses means we are much more likely to make the error of discarding a promising intervention than the other types of error. The cost we assign to this error is therefore more influential on the overall operating characteristics than the other costs.
To illustrate the effect of varying sample size in the REACH trial, we set the loss function parameters to that of point a in Figure 4 and Table 3, (c 1 , c 2 , c 3 ) = (0.07, 0.9, 0.03). We then estimated the operating characteristics obtained for k = 6, 12, 18 clusters per arm. Note that we considered only three choices of sample size due to the significant computational burden of each evaluation. The results are plotted in Figure 6. Increasing the sample size appears to have little effect on OC 1 and OC 3 , while leading to a decrease in OC 2 , the probability of discarding a promising intervention. This behavior reflects the priorities encoded by the costs parameter, where c 2 = 0.9.

Incorporating subjective priors
Rather than use weakly or non-informative priors when analyzing the pilot data, we may instead want to make use of the (subjective) elicited knowledge of parameter values described in the design prior p D ( ). Anticipating criticisms of a fully subjective analysis, we can envisage two particular cases where this might be appropriate. First, using the components of the design prior which describe the nuisance parameters while maintaining weakly informative priors on substantive parameters . Second, when very little data on a specific substantive parameter is going to be collected in the pilot, using the informative design prior for that parameter could substantially improve operating characteristics. We replicated the above analysis for these two scenarios. For the second, we used informative priors for all nuisance parameters and for the probability of adherence, p a . Recall that this is informed by a binary indicator at the care home level and only in the intervention arm, and will therefore have very little pilot data bearing on it. For each case we used the same N samples of parameters and pilot data which were used in the weakly informative case, repeating the Bayesian analysis using the appropriate analysis prior and obtaining estimated posterior probabilities p R , p A , and p G as before. These were used in conjunction with the same set of loss parameter vectors  to obtain corresponding operating characteristics (Figure 7).
For brevity, we will refer to the three cases as weakly informative (WI), informative nuisance (IN), and informative nuisance and adherence (INA). Comparing the operating characteristics of cases WI and IN, we found very little difference (further details are provided in the supplementary material). When we contrast cases WI and INA, however, there is Weakly informative a clear distinction. Using the INA analysis prior will lead to larger probabilities of an infeasible trial (OC 1 ) and of unnecessary adjustment (OC 3 ), while reducing the probability of discarding a promising intervention (OC 2 ), for almost all loss parameters. The expected loss is always lower for the INA analysis than for WI, as we would expect.

DISCUSSION
When deciding if and how a definitive RCT of a complex intervention should be conducted, and basing this decision on an analysis of data from a small pilot trial, there is a risk we will inadvertently make the wrong choice. A Bayesian analysis of pilot data followed by decision-making based on a loss function can help ensure this risk is minimized. The expected results of such a pilot can be evaluated through simulation at the design stage, producing operating characteristics which help us understand the potential for the pilot to lead to better decision-making. These evaluations can in turn be used to find the loss function which leads to the most desirable operating characteristics, and to inform the choice of sample size. Our proposal has been motivated by some salient characteristics of complex intervention pilot trials, and offers several potential benefits over standard pilot trial design and analysis techniques. The Bayesian approach to analysis means that complex multi-level models can be used to describe the data, even when the sample size is small. In contrast to the usual application of independent progression criteria for several parameters of interest, we provide a way for preferential relationships between parameters to be articulated and used when making decisions. Using a subjective prior distribution on unknown parameters at the design stage allows both our knowledge and our uncertainty to be fully expressed, meaning we can leverage external information whilst also avoiding decisions which are highly sensitive to imprecise point estimates.
Our proposed design is related to the literature on assurance calculations for clinical trials, 18 applying the idea of using unconditional event probabilities as operating characteristics to the pilot trial setting. In doing so we have shown how assurances can be defined for multiple substantive parameters with trade-offs between them, and with respect to the "traffic light" red/amber/green decision structure commonly found in pilot trials. The multi-objective optimization framework we have used to inform trial design allows the decision-maker to explicitly consider the different trade-offs between operating characteristics which are available, and select that which best reflects their own preferences. A similar approach has been taken in the context of phase II trials using the statistical concept of admissible designs. 30,31 This can be contrasted with the conventional and much criticized approach common in the frequentist context, where arbitrary constraints are placed on type I and II error rates in order to define a single optimal design. 32 The benefits brought by the Bayesian approach must be set against the challenges it brings, particularly in terms of computation time and implementation. In terms of the latter, we are required to specify a joint prior distribution over the parameters and a partitioning of the parameter space into the three hypotheses. The specification of the prior distribution may be a challenging and time-consuming task. Although some relevant data relating to similar contexts may be available, for example, in systematic reviews or observational studies, expert opinion may still be required to articulate the relevance of such data to the problem at hand. When no data are available, which is not unlikely given the early phase nature of pilot studies, expert opinion will be the only source of information. Although potentially challenging, many examples describing successful practical applications of elicitation for clinical trial design are available, 19,21,24 as are tools for its conduct such as the Sheffield Elicitation Framework (SHELF). 25 Dividing the parameter space into three hypotheses may also prove challenging in practice, particularly when trade-offs between more than two parameters are to be elicited. There is a need for methodological research investigating how methods for multi-attribute preference elicitation, such as those set out by Keeney and Raiffa, 27 can be applied in this context.
The computational burden of the proposed method is significant, particularly when the model is too complex to allow a conjugate analysis to be used when sampling from the posterior distribution. We have used a nested Monte Carlo sampling scheme to estimate operating characteristics, as seen elsewhere. 18,20,33 One potential approach to improve efficiency is to use non-parametric regression to predict the expected losses of Equation (3) based on some simulated data, thus bypassing the need to undertake a full MCMC analysis for each of the N samples in the outer loop. This approach has been shown to be successful in the context of expected value of information calculations. 34,35 The computational difficulties will be particularly pertinent when using our approach to determine sample size, as several evaluations of different sample size choices will be required. If the choice of sample size can be framed as an optimization problem, methods for efficient global optimization of computationally expensive functions such as those described by Jones 36 and Roustant et al. 37 may be useful. 16 Alternatively, one of several rules-of-thumb for choosing pilot sample size 3,9,11,13 could be used, with the resulting operating characteristics evaluated using the proposed method. Volumes.
We have defined our procedure in terms of a loss function, where the decision-making following the pilot will minimize the expected loss. However, the piecewise constant loss function we have proposed may not adequately represent the preferences of the decision-maker. For example, we may object to the loss associated with discarding a promising intervention being independent of exactly how effective the intervention is. An alternative is to try to define a richer representation of the loss function through direct elicitation of the decision-makers preferences under uncertainty, 26 leading to a fully decision-theoretic approach to design and analysis. 38 However, as previously noted by others, [39][40][41] implementation of these approaches has been limited in practice and this may be indicative of their feasibility.
The proposed method could be extended in several ways. More operating characteristics could be defined and used in design optimization, more complicated trade-off relationships between multiple parameters could be addressed, or the hypotheses could be expanded to include nuisance parameters which would be used as part of the sample size calculation in the main RCT. A particularly interesting avenue for future research is to consider how to model post-pilot trial actions in more detail. For example, while we allow for the possibility of making an "amber" decision, indicating that modifications to the intervention or trial design should be made, we do not model what that decision will actually look like and how it should relate to the observed pilot data. Methodology for jointly modeling a pilot and subsequent main RCT in this manner could be informed by developments for designing phase II/III programs in the drug setting. [42][43][44][45]