Confidence distributions for treatment effects in clinical trials: Posteriors without priors

An attractive feature of using a Bayesian analysis for a clinical trial is that knowledge and uncertainty about the treatment effect is summarized in a posterior probability distribution. Researchers often find probability statements about treatment effects highly intuitive and the fact that this is not accommodated in frequentist inference is a disadvantage. At the same time, the requirement to specify a prior distribution in order to obtain a posterior distribution is sometimes an artificial process that may introduce subjectivity or complexity into the analysis. This paper considers a compromise involving confidence distributions, which are probability distributions that summarize uncertainty about the treatment effect without the need for a prior distribution and in a way that is fully compatible with frequentist inference. The concept of a confidence distribution provides a posterior–like probability distribution that is distinct from, but exists in tandem with, the relative frequency interpretation of probability used in frequentist inference. Although they have been discussed for decades, confidence distributions are not well known among clinical trial statisticians and the goal of this paper is to discuss their use in analyzing treatment effects from randomized trials. As well as providing an introduction to confidence distributions, some illustrative examples relevant to clinical trials are presented, along with various case studies based on real clinical trials. It is recommended that trial statisticians consider presenting confidence distributions for treatment effects when reporting analyses of clinical trials.


INTRODUCTION
Bayesian inference has long been advocated for clinical trials. 1 It allows the incorporation of prior information such as historical controls or evidence from related populations, 2 which is particularly applicable for early phase studies or confirmatory studies in rare diseases and small populations.The wider use of adaptive designs has also led to increased use of Bayesian inference in clinical trials. 3,4ne of the most appealing features of a Bayesian analysis is that uncertainty about the treatment effect can be expressed using a probability distribution -the posterior distribution.This allows probability statements to be made about the treatment effect.Thus, for example, a Bayesian analysis of a recent clinical trial was able to summarize the strength of evidence for a beneficial treatment effect using the statement Pr(BENEFIT) = 0.93. 5 In a frequentist analysis such probability statements are not possible.In particular, interpreting a confidence interval as a frequentist probability statement about the treatment effect is incorrect.Nonetheless, researchers often find it intuitive to make probability statements about the treatment effect, which is a strong motivation for using a Bayesian analysis.
A key ingredient of any Bayesian analysis is the prior distribution, which summarizes prior belief about the parameter of interest.When relevant prior information exists, such as evidence on adults that is relevant to a pediatric trial, the prior distribution provides the mechanism for incorporating this information into the analysis.However, many randomized trials using a Bayesian approach do not explicitly incorporate pre-existing evidence into the analysis.Instead, a uniform or diffuse prior is assumed, which is intended to represent no prior beliefs about the treatment effect.Such non-informative prior specification may add complexity and may not be invariant to the arbitrarily chosen parameter scale.Even when there are strong pre-existing beliefs about the treatment effect or some other parameter such as a control response probability, there may be concern about sensitivity of the results to an informative prior that embodies those prior beliefs.In either case, the benefits of the Bayesian posterior distribution come with the prerequisite requirement to first specify a prior distribution.
Confidence distributions are probability distributions summarizing uncertainty about the treatment effect, but they do not require specification of a prior distribution and they are fully compatible with frequentist inference.In this sense they provide a desirable posterior-like probability distribution for the treatment effect, without the subjectivity or complexity of the assumed prior distribution.8][9] Indeed, like confidence distributions, Fisher's concept of fiducial probability provides a probability distribution over the parameter space, which comes from the statistical model of the sampling mechanism used to produce the observed data.Subsequently, many other prominent 20 th century statisticians-Hall, 10 Efron 11 and Hampel 12,13 to name a few -were attracted to confidence distributions or related concepts.
In the 21 st century the concept has received renewed attention with applications in diverse fields as well as review papers, 14,15 a book, 16 tutorial papers 17,18 and software packages. 19,20Schweder and Hjort 16 coined the phrase "posterior distributions without priors" to describe confidence distributions, while Hampel 13 had earlier characterized them as "the Bayesian omelette … without breaking the Bayesian eggs", paraphrasing Savage's classic quote on fiducial inference. 21The purpose of this paper is to provide a practical illustration of the application of confidence distributions to treatment effects from randomized clinical trials.Although confidence distributions have been widely used in various applied statistical contexts, 16 they have rarely been used for clinical trials and the goal here is to encourage trial statisticians to consider the use of confidence distributions when analyzing clinical trials.
The next section provides a general introductory discussion on the interpretation of probability statements in frequentist and Bayesian contexts with the goal of motivating confidence distributions as a tool to enhance our ability to quantify uncertainty.This is followed in Section 3 by a formal definition of confidence distributions and a discussion of how they can be constructed and interpreted in practice.Section 4 then reviews the practical mechanics of constructing and using confidence distributions for treatment effects in clinical trials, with specific focus on the range of effect measures typically used in clinical trials.Case studies based on analyses of data from real clinical trials are presented in Section 5, followed by a discussion of a broad range of related topics and future research directions in Section 6.

RELATIVE FREQUENCY AND CONFIDENCE
An understanding of the usefulness of confidence distributions in statistical analysis starts with an understanding of probability as a dual concept.By this it is meant that the single word probability is used to describe two quite distinct notions.The presentation will begin with a brief discussion of probability and its two interpretations, before moving on to a discussion of confidence distributions.This will ultimately lead us to interpret confidence as a type of probability, and confidence intervals as a type of probability statement, albeit not a frequentist probability statement.The fundamental differences between Bayesian and frequentist inference stem largely from the different interpretations of probability employed by the two paradigms.Bayesian probability is epistemic -it quantifies uncertainty due to incomplete knowledge.This allows Bayesian probability statements about a treatment effect, which are interpreted as an expression of degree of belief.Armed with this interpretation of probability statements, it becomes possible to use Bayes theorem in ways that are unavailable in frequentist inference, including the incorporation of prior belief and posterior probability statements about the treatment effect.
Frequentist probability is aleatory-it quantifies uncertainty due to randomness in nature.Thus, unlike Bayesian probability, frequentist probability may only be applied to statements about the outcome of random phenomena, such as the tossing of a coin or the random sampling of an individual from a population.In this context, frequentist probability corresponds to the stable relative frequency of the outcome in repeated occurrences of the random phenomenon.Accordingly, since treatment effects are considered a fixed state of nature in frequentist inference, they are outside the scope of frequentist probability statements.
The inability to make epistemic probability statements is a hole in frequentist inference, because researchers intuitively want to make such statements.It is intuitively appealing to summarize the evidence from a clinical trial with a statement such as Pr(BENEFIT) = 0.93.Confidence distributions provide a mechanism to plug this hole in frequentist inference.This has been increasingly recognized in recent years, and a number of authors have advocated the use of confidence distributions by emphasizing the distinction between the aleatory concept of relative frequency and the epistemic concept of confidence; see Hampel, 13 Schweder 15 and Schweder and Hjort 16 (see section 1.9 and references therein).
The simultaneous use of different versions of probability is anathema to both the frequentist and Bayesian paradigms, but it actually has a long history from a philosophical perspective.Indeed, philosophers have long seen probability as a duality -two distinct concepts that confusingly have the same name.This was emphasized by Hacking 22 in his seminal treatise on the emergence of the concept of probability: "Probability has two aspects.It is connected with the degree of belief warranted by evidence, and it is connected with the tendency, displayed by some chance devices, to produce stable relative frequencies."Hacking 22 This duality was also central to the theories of earlier philosophers, notably Carnap, 23 who used the terminology Probability 1 and Probability 2 to distinguish aleatory and epistemic probability.Even as early as Poisson, 24 a distinction was made between the words chance and probability to capture the difference between aleatory and epistemic uncertainty.If the concept of a confidence distribution is added to standard frequentist inference, and is allowed to exist alongside but distinct from frequentist probability, then relative frequency and confidence become Carnap's Probability 1 and Probability 2 or Poisson's chance and probability.As will be discussed, this expansion of frequentist inference to allow epistemic probability statements provides the benefits of Bayesian posterior distributions without the limitation of having to specify a prior distribution.

CONFIDENCE DISTRIBUTIONS
We now come to the definition of a confidence distribution, and its relevance for quantifying information about treatment effects in clinical trials.As discussed in Section 1, confidence distributions have been presented and studied elsewhere in more general terms.The goal here is to present and advocate the use of confidence distributions in the specific context of treatment effects for clinical trials, and to demonstrate their application in some illustrative examples.

Definition
We consider a two-arm randomized trial with fixed total sample size n allocated randomly between a control group with sample size n 0 and a treatment group with sample size n 1 , where the fixed allocation ratio is r = n 0 ∕n 1 .Extensions of this basic fixed design are discussed in Section 6.For individual i = 1, … , n the outcome is Y i and the random group allocation is t i , which takes the value 0 for control and 1 for treatment.The treatment effect parameter of primary interest is a scalar  ∈ Θ, with nuisance parameter vector  and combined parameter vector  = (, ) ∈ Λ.The Y i are independent and identically distributed, and are modeled using a frequentist model f (y i |, t i ) which could represent a density or probability function depending on whether Y i is discrete or continuous.The data consisting of the outcome and random group allocation for each individual will be denoted Y = {(Y 1 , t 1 ), … , (Y n , t n )}.We will assume that the parameter spaces Θ and Λ are convex, which means that we are not dealing with discrete parameters and the corresponding confidence distributions are continuous.This is appropriate for typical treatment effect measures and models used in clinical trials.Consider C(|Y ) for some non-decreasing right-continuous function C ∶ Θ → [0, 1] which depends on the data Y and is therefore a random variable.Following References 14-16, the function C(|Y ) is defined to be the cumulative distribution function of the confidence distribution for the treatment effect  if Since Y has a distribution that depends on both  and the nuisance parameter , (1) must hold for whatever is the true value of  = (, ) ∈ Λ.
We will refer to C(|Y ) as the confidence distribution function, which is one of three key functions associated with the confidence distribution.The other two are the confidence density function c(|Y ) = dC(|Y ) d and the confidence curve Next we consider the motivation for these three key summaries of the confidence distribution.
The technical definition of a confidence distribution can be made more intuitive by examining the properties of functions that satisfy (1).Notice that (1) implies is a one-sided confidence interval with confidence level 1 − .Thus, the confidence distribution function is the function constructed from the set of all one-sided confidence intervals of level 1 −  ∈ [0, 1].The confidence distribution function therefore generalizes the notion of a single confidence interval with a single confidence level, by providing an entire probability distribution that summarizes uncertainty in  using confidence intervals of all possible levels.
The confidence density is the density function associated with the confidence distribution function.Thus, it provides the same information but may be considered a more natural scale on which to express uncertainty about .Likewise, the confidence curve re-expresses the one-sided information contained in the confidence distribution function, providing all two-sided confidence intervals with confidence level 1 −  ∈ [0, 1] using the intervals Note that the function p(|Y ) = 1 − CC(|Y ) is often referred to as the p-value function because for any given  0 ∈ Θ, p( 0 |Y ) provides the p-value for a test of the null hypothesis H 0 ∶  =  0 against a two-sided alternative. 18,25The two-sided confidence interval in (2) may be equivalently expressed as { ∶ p(|Y ) ≥ } which can be interpreted as all values  0 that would not lead to rejection of H 0 at significance level .In this paper we will use confidence curves rather than p-value functions although the information contained in the two is equivalent.

Construction
A confidence distribution for  was defined in (1) but we have not yet considered how to actually construct a confidence distribution.Pivots play an important role in the construction of confidence distributions.In the current context, a pivot is a function P ∶ Θ → ℜ which is a function of the treatment effect  that depends on the data Y and is written as P(|Y ).
The defining property of a pivot is that its distribution is independent of the combined parameter vector .Whenever such a pivot is available then a confidence distribution may be constructed by using the standard uniform transformation to obtain the uniform distribution required by (1).This is particularly useful in the context of large samples where one can appeal to asymptotic normality, as we now consider.In large samples the treatment effect  is often estimated using the observed value of a consistent estimator θn which satisfies for some function V of the treatment effect  and possibly the nuisance parameters .Various examples of (3) are provided in Section 4. Assuming we have available a consistent estimator βn of  and V is continuous, then (3) provides a large sample pivot where λn = ( θn , βn ).Defining Φ to be the standard normal cumulative distribution function it follows that for large n ) for whatever is the true value of  ∈ Λ.That is, C(|Y ) in ( 5) is a non-decreasing right-continuous function satisfying the requirements of a confidence distribution function defined by (1), with the associated confidence density ) where  is the standard normal density function.
The functions C(|Y ) and c(|Y ) in ( 5) and ( 6) are very straightforward to construct and illustrative examples are provided in Figure 1.For the purposes of illustration, these examples are based on a scenario in which negative values of the treatment effect  correspond to benefit and the observed estimate is θn = −1.5 with standard error ŝn = 1.75 where ŝ2 n = V( λn )∕n.From Panel A of Figure 1 it can been seen how the confidence distribution function is constructed from all one-sided confidence intervals of level 1 −  ∈ [0, 1].Panel B illustrates the confidence associated with treatment benefit as an area under the confidence density, which is discussed in detail in Section 3.3.
Large sample normality is not the only way to construct confidence distributions.For specific parametric models, pivots often exist which can be used to construct confidence distributions in an analogous fashion to (5) for any sample size.In Section 4 we will consider this approach using a linear normal model which leads to a confidence distribution based on the t-distribution.Other small sample approaches using pivots or other methods for specific models also exist; see for example section 3.2 of Schweder and Hjort. 16When the large sample normal approximation specified by ( 3) is not a good approximation, then the defining property (1) will not hold and C(|Y ) will not specify a valid confidence distribution.This is analogous to a confidence interval being invalid because it does not have the desired coverage probability.One remedy in this situation is to use second order corrections to provide a more accurate approximation to the sampling distribution and thereby improve the performance of the confidence distribution.Such methods are analogous to methods used in other inferential contexts and have been discussed in some detail for confidence distributions by Schweder and Hjort 16 (see chapter 7).Alternatively, bootstrapping provides a flexible and more general approach for constructing confidence distributions, without the need to rely on asymptotic normality.
The use of bootstrapping to construct confidence distributions was first described by Hall 10 under the name confidence pictures, and was also supported by Efron 11 before being subsequently discussed in detail by various other authors. 14,16ndeed, Hall's confidence pictures represent an independent re-discovery of confidence distributions as a tool to construct posterior-like parameter distributions compatible with frequentist inference, which is a further level of support for the intuition behind confidence distributions.The rationale for using bootstrapping is that it is a tool for constructing confidence intervals, so it can be used to construct confidence intervals of all possible confidence levels.In principle this collection of confidence intervals can then be synthesized to construct a confidence distribution.However, since there are many different versions of bootstrapping, there are many possible ways in which bootstrapping can be used to construct a confidence distribution.
Following Hall 10 and Xie and Singh 14 we will consider bootstrapping applied to the standardised pivot used in (4), which is referred to as the bootstrap-t or percentile-t method.Sampling n observations with replacement from the data Y to produce a bootstrap sample leads to bootstrap replicates of θn and λn which are denoted θ * n and λ * n , respectively.The bootstrap method of constructing confidence distributions involves estimating the distribution of P(|Y ) in (4) using the bootstrap distribution Although G(z) is unknown in general, a bootstrap estimate Ĝ(z) can be obtained by generating B bootstrap samples and estimating the probability in (7) by the observed proportion of the B samples that satisfy the inequality.The confidence distribution function is then which is the bootstrap analogue of (5).This confidence distribution function can be used to generate a confidence curve, however, since ( 8) is not differentiable it does not immediately yield a confidence density.Nonetheless, application of a kernel density smoother to (8) enables graphical display of the confidence density.

Interpretation
Like a posterior distribution, the confidence distribution is a probability distribution over the parameter space.Regions of the parameter space having higher probability density under this confidence distribution will have higher confidence attached to them.Thus, just as a confidence interval may be considered analogous to a Bayesian credible interval, a confidence distribution may be considered analogous to a Bayesian posterior distribution.Appealing to the familiar analogy between confidence intervals and credible intervals is a useful device for appreciating the similarities and differences between confidence distributions and posterior distributions.
Based on this interpretation, the confidence distribution for  can be used to make probability statements about the treatment effect that are analogous to probability statements using a Bayesian posterior distribution.We will use the term confidence probability statement to refer to any probability statement based on the confidence distribution for .Since the confidence distribution is a probability distribution over the parameter space, a confidence probability statement is a probability statement about a region of the parameter space.It is important when making such statements that it is understood that these are not frequentist probability statements.Frequentist probability statements are aleatory in nature and apply only to the outcomes of random phenomena.Instead, confidence probability statements are epistemic, in that they use a probability distribution to quantify uncertainty about the treatment effect.Confidence probability statements about the treatment effect in clinical trials should be seen as distinct from, but existing in parallel to, frequentist probability statements about random phenomena.This dichotomy does not exist in Bayesian inference and it provides some advantages.In particular, it avoids interpreting treatment effects as random variables when we make probability statements about them, which is common in Bayesian inference but which is a misinterpretation that has been refuted by various authors; see for example Greenland. 26ince probability statements based on confidence distributions are distinct from standard frequentist probability statements, notation that emphasizes the distinction is desirable.For example, once we have observed the data Y = y, a probability statement quantifying the confidence that the treatment is beneficial could be written as This would be analogous to the Bayesian posterior probability statement Pr(BENEFIT) = 0.93 referred to in Section 1 and would be expressed as: there is 93% confidence that the treatment is beneficial.Using the dual interpretations of probability, such a statement provides a probability statement about the treatment effect that is fully compatible with frequentist inference.That is, by adding confidence distributions to frequentist inference one obtains a distributional summary of the treatment effect while retaining all the standard features of a frequentist analysis.The desire for a distributional summary of the treatment effect is often a motivation for using a Bayesian analysis.However, an alternative is to use confidence distributions alongside a frequentist analysis to provide a distributional summary of the treatment effect and enabling the process of specifying a Bayesian prior distribution to be avoided.
For interpreting a confidence probability statement about a subset of the parameter space, it is useful to appeal to the familiar notion of a confidence interval.The confidence attached to any subset of the parameter space is the confidence level of a confidence interval that coincides with that subset.Thus, the quantity Conf(BENEFIT) discussed above is the confidence level of a confidence interval that coincides with the region of benefit within the parameter space.Likewise, in an equivalence trial, Conf(EQUIVALENCE) would be the confidence level of a confidence interval that coincides with the region of equivalence within the parameter space, while Conf(NONINFERIORITY) would have an analogous interpretation in a non-inferiority trial.Any other region of the parameter space can also have a confidence probability statement attached to it and this can be interpreted similarly.
Confidence probability statements also make the interpretation of confidence intervals and p-values more intuitive.Teachers of statistics go to great lengths to avoid students interpreting confidence intervals and p-values as probability statements about the parameter or the null hypothesis, despite the strong intuition to do so.Instead, a less intuitive repeated sampling interpretation is usually taught.Confidence distributions allow the underlying intuition to be better accommodated.Thus, a 95% confidence interval of (0.6, 0.9) for a hazard ratio is indeed a statement that the probability of the hazard ratio being between 0.6 and 0.9 is 0.95, albeit not a frequentist probability statement.It is a confidence probability statement.Likewise, 1 minus the p-value for testing a composite null hypothesis H 0 ∶  ≤  0 versus a composite alternative hypothesis H 1 ∶  >  0 can be interpreted as the probability that the null hypothesis is false, as long as one interprets it as a confidence probability statement.This is reflected in a statement such as Confidence probability statements such as these are likely to be more intuitive than the equivalent repeated sampling interpretations and they endow frequentist inference with an analogue of Bayesian posterior probability statements about hypotheses.

CLINICAL TRIAL MODELS
In this section, we consider a range of treatment effect measures that are commonly used in clinical trials, along with the corresponding parameters  = (, ), outcomes Y i and models f (y i |, t i ).In each case, key quantities will be identified that are needed to construct a confidence distribution for the treatment effect and we will consider the various ways in which the confidence distribution can be constructed.Then, in Section 5, the application of confidence distributions will be illustrated for analyzing data from real clinical trials.

Risk differences
Suppose the outcome Y i is binary and takes the value 1 for an event and 0 otherwise.Assuming an additive model then  measures the treatment effect using a risk difference while the control risk  is a nuisance parameter, with  = (, ).The numbers of events in each group are with corresponding observed risks Y 0 = Z 0 ∕n 0 and Y 1 = Z 1 ∕n 1 .Recall that the allocation ratio n 0 ∕n 1 is a fixed value r.
Then the estimated risk difference satisfies (3) with The large sample confidence distribution function is then given by ( 5) or the bootstrap version can be constructed using (8).

Odds ratios
If the binary model from Section 4.1 is modified to then  measures the treatment effect using a log odds ratio while the control odds  is a nuisance parameter.Letting These can be used in ( 5) and ( 8) to obtain the desired confidence distribution function.Note that it is straightforward to transform the confidence distribution if it is preferred to have it on the odds ratio scale rather than the log odds ratio scale.An alternative approach to constructing the confidence distribution for a (log) odds ratio is to use a logistic regression fit to estimate  and its standard error, with V( λn )∕n equal to the squared standard error.As described in Section 4.6, this approach also enables the confidence distribution to be adjusted for baseline characteristics.Furthermore, this approach can be extended to accommodate an ordinal outcome in which Y i takes values 0, 1, … , K. Using the proportional odds model with k = 0, … , K − 1, (9) can be generalized to in which case  is again the log odds ratio and the nuisance parameter is a vector  = ( 0 , … ,  K−1 ).This nuisance parameter vector is subject to the parameter constraints 0 A proportional odds regression fit can then be used to construct a confidence distribution for .In Section 5 the construction of confidence distributions based on proportional odds models is illustrated using data analysis examples.

Relative risks
The model for odds ratios may be modified straightforwardly to provide relative risks using in which case  is the log relative risk.A confidence distribution for the log relative risk may then be constructed from (5) or (8) after modifying (10) using standard results for relative risks

Mean differences
If the outcome Y i is continuous then a natural model is where  measures the treatment effect as a mean difference.Allowing for the two nuisance parameters the overall parameter vector is  = (, ,  2 ).Generalizations allowing for heteroscedasticity are of course also possible.
If Y 0 and Y 1 are the sample means from the control and treatment groups, respectively, then the components of λn are Using the fact that Var( θn ) =  2 (1∕n 0 + 1∕n 1 ), a finite sample pivot is defined by where Defining T  to be the cumulative distribution function of the t-distribution with  degrees of freedom, the finite sample version of (5) yields the confidence distribution function ) The normality assumption in ( 11) is of course unnecessary with a large sample size, in which case T n−2 is replaced by Φ in (12) to yield a large sample confidence distribution function.

Hazard ratios
When the outcome Y i is a right-censored time to event then it is common to measure the treatment effect using a hazard ratio.Direct estimation of the standard error of the log hazard ratio is possible using just the observed number of events, which could then be used in constructing the confidence distribution assuming large sample normality.However, in practice time to event endpoints are typically analyzed using a Cox regression model which yields a standard error estimate for the log hazard ratio.Together with the estimated hazard ratio, this standard error can be used in the same manner as was described in Section 4.2 for binary and ordinal endpoints.This approach has the added advantage that it can easily incorporate adjustment for baseline characteristics, as we now consider.

Adjustment for baseline characteristics
Suppose individual i has a covariate vector x i of p baseline characteristics, such as randomization stratification factors, which need to be adjusted for in the analysis.This can be incorporated into the models discussed in this section using a regression model with the basic linear predictor This leads to a vector of p + 1 nuisance parameters  = ( 0 ,  A ) and depending on the model there may be additional nuisance parameters, such as the error variance in a linear normal model.Using the basic structure (13), the models discussed in this section have a convenient regression formulation that can be used to estimate the treatment effect  adjusting for the baseline characteristics.These regression models include logistic regression for adjusted odds ratios, log-binomial regression for adjusted relative risks, 27,28 linear regression for adjusted mean differences and Cox regression for adjusted hazard ratios.Perhaps the most challenging context is adjusted risk differences, which requires a linear probability model, however even in this context there are a variety of regression adjustments that are applicable for clinical trials. 29Using asymptotic properties of these models the large sample normality required in (3) is still applicable after adjusting for covariates, with V( λn )∕n corresponding to the squared standard error of θn , and confidence distributions can be constructed in a similar fashion to that described in Section 3.2.The bootstrap method of constructing confidence distributions is also applicable.

HERO-2 trial
The HERO-2 trial was a study of anti-coagulation therapy in individuals with acute myocardial infarction (heart attack), comparing the thrombin-inhibitor bivalirudin with heparin. 30The primary endpoint was the binary outcome of 30-day mortality (dead or alive).In addition, bleeding severity on a 5-point ordinal scale was also assessed as a secondary safety endpoint.The ordinal scale may be summarized as: no bleeding (1); minor bleeding (2); moderate bleeding (3); major bleeding (4); and intra-cranial haemorrhage ( 5).Here we provide illustrative analyses of these two endpoints, which yield confidence distributions for risk differences, relative risks and odds ratios.The HERO-2 trial was very large, with n = 17073 participants.It is therefore expected that Bayesian and frequentist approaches would yield highly consistent results.The purpose of this case study is to provide an illustration of the construction of confidence distributions in clinical trials for a range of effect measures, and the manner in which they can be used as a distributional summary of the treatment effect.
Of 8516 patients randomized to bivalirudin, 919 died by day 30.Of 8557 patients randomized to heparin, 931 died by day 30.Table 1 summarizes the risk difference and the (log) relative risk and odds ratio, with negative values corresponding to less mortality on bivalirudin.Also provided are various quantities required in the computation of confidence distributions using ( 5) and ( 6).This yields the confidence distribution functions and confidence densities for the treatment effect, which are plotted in Figure 2 for the risk difference and relative risk.The HERO-2 trial was designed as a superiority trial, however, given the almost equivalent mortality in the two arms, it is useful to illustrate how equivalence might be assessed using a confidence distribution.For example, suppose that equivalence had been specified as a risk difference of less than 1% in absolute value.Panel B of Figure 2 illustrates computation of the confidence of equivalence based on this equivalence margin.Likewise, for relative risks and odds ratios, suppose that equivalence had been specified as a TA B L E 1 Treatment effects for 30-day mortality in the HERO-2 trial including quantities used to compute confidence distributions using ( 5) and ( 6) with n = 17073 and plotted in Figure 2. Note: For risk difference Conf(EQUIV) is the confidence that there is no greater than a 1% absolute difference in risk while for relative risk and odds ratio it is the confidence that the treatment effect is between 0.9 and 1.1.is the confidence that there is no greater than a 1% absolute difference in risk while for relative risk it is the confidence that the relative risk is between 0.9 and 1.1.range from 0.9 to 1.1.Panel D of Figure 2 illustrates the confidence computation for relative risk, while Table 1 presents the corresponding numerical values for each effect measure, which all exceed 95% for the confidence of equivalence.

TA B L E 2
The bleeding severity data are summarized in Table 2, which provides the counts of each bleeding severity category.Individuals with multiple bleeding events are allocated to the category of their most severe bleed, which explains some minor discrepancies with counts displayed in the primary study publication where counts of all bleeding events were presented. 30The cumulative odds ratios are also displayed in Table 2, with values greater than 1 indicating more bleeding on the bivalirudin arm.The cumulative odds ratios in Table 2 are displayed so that the value for category k is the ratio between treatments of the odds of having a bleed severity less than or equal to k (hence no odds ratio for category 5).The similarity of the crude cumulative odds ratios suggest that the proportional odds model would be appropriate, as this assumes each of the cumulative odds ratios is identical, with the common value being the desired treatment effect parameter.Panel D displays the crude cumulative odds ratios (squares) and 95% confidence intervals (lines) together with the combined odds ratio and 95% confidence interval from the proportional odds model (diamond).Odds ratios are plotted on the log scale.
The confidence distribution of the log odds ratio for comparing bivalirudin with heparin, based on the proportional odds model, was constructed using both an assumption of large sample normality and using bootstrapping, as outlined in Sections 3.2 and 4.2.The Bayesian posterior distribution was also constructed, using a flat prior over the constrained proportional odds parameter space described in Section 4.2.This analysis used Markov-Chain Monte-Carlo (MCMC) sampling as implemented in the stan polr function within the rstanarm package in R. 31 Frequentist proportional odds models are fitted in R using the polr function within the MASS package in base R. The results of both types of analyses are displayed in Figure 3, with code available in the Supporting Information.
It can be seen from Panel A of Figure 3 that the confidence distribution and Bayesian posterior distribution are very similar, which is not surprising given the very large sample size.It is also evident from Panel B that the confidence distributions based on normality and bootstrapping are also very similar.The overall treatment effect estimate and 95% confidence interval are displayed as the diamond in Panel D, with an odds ratio of 1.48 and confidence interval of 1.36-1.62,indicating evidence of increased bleeding on bivalirudin.This overall treatment effect from the proportional odds model is a synthesis of the crude odds ratios displayed for each bleeding severity in Panel D, where it is seen that the proportional odds assumption is appropriate.The confidence curve in Panel C displays confidence intervals of all confidence levels, with the 95% interval from Panel D represented as the confidence curve width at a value of 0.95 on the vertical axis.The confidence distribution function and density function in Panels A and B supplement the frequentist estimate and confidence interval with a full distributional summary of the treatment effect.In this sense they are analogous to the Bayesian posterior distribution for the treatment effect, while being fully compatible with the usual frequentist summaries.

REMAP-CAP trial
The REMAP-CAP trial is an ongoing multi-domain platform trial in individuals with severe community acquired pneumonia requiring hospitalization in intensive care.The trial has many treatment domains with flexible sample size and an

Note:
The assumed prior parameters specify a 22-dimensional Dirichlet distribution which partly defines the prior distribution for the Bayesian proportional odds model.
adaptive design.Here we consider an illustrative re-analysis of the corticosteroid domain using published data on hydrocortisone treatment for Covid-19 infection. 5The illustrative analysis will be restricted to the two-treatment comparison of fixed duration hydrocortisone and no hydrocortisone.The primary endpoint is organ support-free days (OSFD) in the first 3 weeks, considered as an ordinal endpoint.OSFD ranged from 0 to 20 days and individuals who died prior to discharge were assigned an OSFD value of −1.This leads to a 22-point ordinal scale ranging from −1 through 20.The pre-specified primary analysis used a Bayesian proportional odds model so it is of interest to see how an analysis based on confidence distributions compares to the Bayesian analysis.
The prior distribution assumptions used in the Bayesian proportional odds model for REMAP-CAP are very complex.These assumptions are available in a separate public domain document detailing the full statistical model details. 32Part of the complexity stems from the large number of categories in the ordinal scale, which necessitates a 22-dimensional prior distribution which was specified using a Dirichlet model that incorporates the required parameter constraints in the proportional odds model.Table 3 details the pre-specified prior assumptions used in the REMAP-CAP primary analysis, together with the ordinal outcome data which have been digitally reconstructed from the published graphs. 33Values greater than 1 for odds ratios are in the direction of benefit for hydrocortisone treatment.
In addition to the prior distribution parameters in Table 3 specifying the Dirichlet model, additional complexity in the prior distribution model comes from adjustment of the primary analysis for various factors.These include: age (6 levels); sex at birth (2 levels); time era (6 levels); and site.The treatment effect also requires a prior distribution.There are consequently dozens of parameters governing the prior distribution model, each of which must have a specific numerical value assumed.It is clearly of great interest to consider ways in which the advantages of posterior probability statements could be retained while circumventing such complex prior distribution modeling.
Confidence distribution for the treatment effect in the REMAP-CAP trial data from Table 3. Panel A provides the confidence distribution function with Conf(Benefit) being the confidence that fixed duration hydrocortisone is beneficial relative to no hydrocortisone and Pr(Benefit) being the corresponding Bayesian posterior probability of benefit.Panel B displays the confidence density and Panel C displays the confidence curve, together with the 95% confidence interval (CI) and Bayesian credible interval (CrI).Panel D displays the crude cumulative odds ratios.Odds ratios (OR) are plotted on the log scale.
The main summary of the published Bayesian primary analysis is a 93% posterior probability that fixed duration hydrocortisone treatment is beneficial compared to no hydrocortisone treatment. 5A probability statement such as this is not possible using a standard frequentist analysis but it is often considered by researchers to be an intuitive way to summarize the evidence for a treatment effect.By using a confidence distribution it is possible to make an analogous probability statement in a way that is compatible with a frequentist analysis but does not require the very complex prior distribution specification that was needed for the Bayesian analysis.
Figure 4 provides the confidence distribution for the treatment effect based on the data in Table 3.The maximum likelihood estimate (MLE) of the treatment effect odds ratio is virtually identical to the median posterior estimate from the Bayesian analysis: 1.44 and 1.43, respectively.Likewise the 95% confidence interval and 95% credible interval are almost identical (Panel C, Figure 4).Panels A and B show that the primary Bayesian summary of 93% posterior probability of superiority of hydrocortisone treatment is almost identical to the analogous statement using the confidence distribution -there is 94% confidence that fixed duration hydrocortisone treatment is superior to control.Thus, there is seen to be a very close correspondence between the results obtained from the confidence distribution and the originally published primary analysis results based on the Bayesian model.
It is important not to over-interpret the similarity of the confidence distribution and the Bayesian posterior distribution, since the published Bayesian analysis includes adjustment for many factors not included in the confidence distribution analysis.These adjustments could have been accommodated in the confidence distribution if individual participant data (IPD) were used, as discussed in Section 4.6.A further consideration is that the proportional odds model may even be questionable for these data since Panel D of Figure 4 suggests there could be an increasing trend in the cumulative odds ratios as OSFD increases.Thus, these confidence distribution analyses are intended to be illustrative not definitive.The main message from this illustrative analysis is that probability statements about the treatment effect are not restricted to Bayesian analyses -they are also available using confidence distributions and will typically involve less modeling complexity.

DISCUSSION
This paper has presented confidence distributions from the point of view of clinical trial statisticians.As well as the definitions and mechanics of confidence distributions in the context of trials, the presentation is based on the notion that confidence is an alternative type of probability that exists alongside the usual relative frequency interpretation.This provides the basis for probability statements about treatment effects while remaining compatible with frequentist inference principles.Confidence distributions are likely to add most value when the primary analysis of a clinical trial is based on frequentist principles or when there is no relevant prior information to include in a Bayesian analysis.
In practice, the asymptotic normality approach is likely to be the best starting point for trial statisticians, and it should be applicable to most well-powered clinical trials using one of the approaches described in Section 4. Nonetheless, the bootstrap approach provides an all-purpose back-up which does not rely on large sample normality and is applicable in any situation in which the bootstrapping is valid.In a sense this can be seen as the computationally intensive analogue of using a Bayesian computational strategy, such as MCMC sampling for posterior distributions.The two approaches based on asymptotic normality and bootstrapping should together provide a broadly applicable methodology for obtaining confidence distributions for treatment effects in clinical trials.Nonetheless, there are a number of further issues worthy of discussion.

Multi-arm studies
The presentation was restricted to a scalar treatment effect parameter  involving the comparison of two treatment arms.When clinical trials have more than two arms, often called multi-arm trials, then the treatment effect is a vector and the confidence distribution is a multivariate distribution.In the general case, multivariate confidence distributions have some theoretical challenges that are not fully resolved. 16Nonetheless, for the large sample base case involving a multivariate normal distribution for the multivariate treatment effect, there is a natural multivariate pivot that generalizes (4) and leads straightforwardly to a multivariate normal confidence distribution. 16This distribution will often be applicable to well-powered studies and can be used to make simultaneous probability statements about multiple treatment effects.In the non-normal case the theory is less well-developed, however, we should not be dissuaded by the fact that multivariate confidence distribution theory is still in development.This is to be expected given the relative infancy of the field and provides motivation for ongoing research. 34

Hypothesis testing
Although confidence distributions allow probability statements about hypotheses and are compatible with null hypothesis significance testing, the discussion in this paper is not intended to provide an argument either way on the desirability of hypothesis testing.Frequentist inference provides a framework for making inferential statements based on the relative frequency interpretation of probability, which provides a repeated sampling interpretation of such inferences.Frequentist inference is not inherently linked to the process of comparing a test statistic to a critical value, or a p-value to 0.05.The arguments for or against hypothesis testing are not addressed here and are essentially irrelevant to the discussion.Some authors have favored a departure from the dichotomous inferential approach of hypothesis testing. 35For those that support this view it should be pleasing that the augmentation of frequentist inference with confidence distributions provides a tool for facilitating such a departure.In particular, confidence distributions provide a more holistic inferential summary of the treatment effect than typical frequentist summaries such as confidence intervals and p-values.In this sense, as has been emphasized, they are analogous to Bayesian posteriors without the specification of a prior.

Confidence-adaptive trials
It was assumed that design features such as the sample size n and allocation ratio r are fixed.It is now common for clinical trials to have adaptive designs.Adaptive trials include sequential trials which have decision rules for early stopping due to efficacy, harm or futility, as well as trials with design characteristics that change in response to the accumulating data, such as response-adaptive randomization.Bayesian adaptive trials have become increasingly common and make use of the posterior distribution of the treatment effect to modify design characteristics.Such adaptations include (but are not limited to) early stopping if the posterior probability of superiority exceeds a threshold, or altering the randomization probabilities to favor the treatment with greatest posterior probability of superiority.Given the analogy between confidence and Bayesian posterior probability, it is natural to define a confidence-adaptive trial as a trial that uses the confidence distribution of the treatment effect to modify design characteristics.Confidence could be used to adapt the design in the same way as posterior probability is used in Bayesian adaptive trials.This could include response-adaptive randomization, which would lead to the allocation ratio r becoming a random variable, or could include sequential stopping rules for benefit, futility or harm based on confidence thresholds, which would lead to the sample size n becoming a random variable.Under standard assumptions about how the adaptations are implemented, if θn is the MLE in a fixed design trial then it would also be the MLE in a confidence-adaptive trial. 36Many other approaches to treatment effect estimation in adaptive trials have also been suggested and could potentially be modified to the confidence-adaptive context. 37onfidence-adaptive trials have an important advantage over Bayesian adaptive trials when it comes to sequential decision rules.Decision rules based on confidence fit seamlessly into standard frequentist sequential theory, such as alpha-spending functions. 38An alpha-spending stopping rule can equivalently be expressed based on a confidence threshold.The design of trials using sequential decision rules based on confidence can therefore take advantage of the rigorous underlying sequential theory to compute stopping boundaries, power and multiple testing adjustments.Sequential decision rules based on Bayesian posterior probabilities do not have an analogous underlying theory which is why simulation is used so extensively to understand the frequentist operating characteristics of Bayesian adaptive designs.While simulation is also useful in the design of confidence-adaptive trials, the underlying sequential theory can be used more effectively to guide the design and understand its operating characteristics.Given the growing importance of platform trials and other adaptive designs, confidence-adaptive trials are a promising area of future research.

Software and computation
The computations required to construct confidence distributions based on large sample normality are typically straightforward and could be coded directly using the various examples discussed in Section 4. Nonetheless, there are a few R packages that compute confidence distributions and produce graphical displays of the results.The R package pvaluefunctions is available on CRAN 19 and another R package is available on GitHub, concurv. 20The bootstrap approach is not implemented in these packages and is more computationally challenging but can be undertaken using standard bootstrapping code or software.The analyses presented in Section 5 were coded directly in R with the bootstrap analyses making use of the boot package in base R.

Prior information
Confidence distributions allow one to circumvent the step of assuming prior information in order to construct a distributional summary of information about the treatment effect.However, this does not mean that relevant prior information must be ignored.][41] However, this would be a two-step process in which information from the current trial is summarized and interpreted, followed by synthesis with any prior information that may be relevant, such as the results of other randomized trials addressing the same question.As with any process of meta-analysis, this allows the contributions of different information sources to be clearly understood, rather than requiring the prior information to be embedded within the analysis, as would be the case for a Bayesian posterior distribution.The use of confidence distributions to quantify prior information as an alternative to a Bayesian prior distribution has also been proposed and has been demonstrated to have some desirable properties. 42hen there is no prior information available then a Bayesian analysis would use a non-informative prior distribution and it would be natural to expect concordance between the Bayesian posterior distribution and the confidence distribution.A standard objective approach to incorporating a lack of prior information into a Bayesian analysis is to use a Jeffreys reference prior distribution.It is therefore particularly noteworthy that in some contexts the Bayesian posterior distribution obtained from assuming a Jeffreys reference prior is identical to the confidence distribution. 16This relationship strengthens the case for confidence distributions being seen as analogous to an objective Bayesian inferential summary, while also being compatible with frequentist inferential summaries.

Probability interpretations
Although we have focused on the practical aspects of applying confidence distributions in the context of clinical trials, their use as distributional summaries of treatment effects invokes underlying philosophical concepts connected to the nature of probability.Fundamentally, our discussion has been based on confidence as a version of probability that is different to frequentist probability and Bayesian probability.As Schweder 15 discussed, a variety of versions of probability exist and there is no universal agreement on how to interpret these different versions.However, a useful framework for summarizing the difference between confidence, frequentist and Bayesian probability is provided by the duality summarized by Hacking 22 and quoted in Section 2. Frequentist probability is what Hacking called a stable relative frequency produced by chance devices.In this sense, it is an absolute state of nature that we can attempt to understand through observation and experimentation.In contrast, both confidence and Bayesian probability express what Hacking called a "degree of belief warranted by evidence" which, rather than being a state of nature, is a quantification of incomplete knowledge.However, there is a crucial distinction in that Bayesian probability is a personal degree of belief that may differ between individuals, whereas confidence is an objective degree of belief that follows in a formulaic way from the observed data and the assumed model.It is for this reason that confidence distributions may be interpreted as an objective prior-free analogue of a Bayesian posterior distribution.While confidence is not frequentist probability, it is defined in terms of frequentist probability.Furthermore, the motivation for using it as an objective degree of belief comes from its definition in terms of frequentist probability.For example, consider the conclusion from our REMAP-CAP case study that there is 94% confidence of a beneficial treatment effect.In 94% of repeated studies, the interval above the 6% percentile of the confidence distribution will include the true treatment effect.Thus, the fact that this interval coincides with the region of benefit in the REMAP-CAP study gives us 94% confidence that the treatment effect is within this region of benefit.This is expressed with the statement Conf(BENEFIT) = 0.94, which is an epistemic probability statement that stands alongside and enhances the other aspects of a frequentist analysis.This motivation is essentially a generalization of the familiar motivation for confidence intervals in terms of coverage probabilities.

Conclusion
Confidence distributions are closely related to Fisher's fiducial probability 9 and indeed they are equivalent when there is only one parameter. 16Fiducial probability is usually viewed as Fisher's one great failure, 43 so it may seem curious that the concept is undergoing a resurgence.The failure of fiducial probability can at least partly be explained by the fact that it does not fully replace the frequentist and Bayesian interpretations of probability.However, when confidence distributions are used in tandem with the relative frequency interpretation of probability, rather than to the exclusion of it, then the elements of frequentist inference are retained at the same time as being endowed with an objective capacity to make probability statements about parameters and hypotheses.Over the last decade there has been renewed interest and support for confidence distributions, a trend that was foreseen by Efron in a 1998 paper looking forward to the 21st century: "here is a safe bet for the 21st century: statisticians will be asked to solve bigger and more complicated problems.I believe that there is a good chance that objective Bayes methods will be developed for such problems, and that something like fiducial inference will play an important role in this development.Maybe Fisher's biggest blunder will become a big hit in the 21st century!" Efron 11 The role of confidence distributions as a way to equip frequentist theory with an objective Bayes interpretation was re-affirmed by Efron in 2013. 44These and the many other references discussed here document a groundswell of support for confidence distributions as an objective prior-free analogue of the Bayesian posterior distribution, while being fully compatible with frequentist inference.I hope that this paper will encourage trial statisticians to report confidence distributions for treatment effects in clinical trials.

1
Confidence distribution function C(|Y ) in (5) (Panel A) and confidence density function c(|Y ) in (6) (Panel B) for a treatment effect  with estimate θn = −1.5 and standard error ŝn = 1.75.Negative values of  correspond to treatment benefit and red lines are one-sided confidence intervals.
a = Z 0 , b = Z 1 , c = n 0 − Z 0 and d = n 1 − Z 1 then θn and V( λn ) are based on standard odds ratio estimates θn = log ( ad bc

2
Confidence distributions for risk difference (Panels A and B) and relative risk (Panels C and D) of 30-day mortality in the HERO-2 study.For risk difference Conf(EQUIV)

3
Confidence distributions for the effect of treatment on bleed severity in the HERO-2 trial data from Table 2. Treatment effect odds ratios are based on proportional odds models.Panel A compares the bootstrap confidence distribution function with the Bayesian posterior distribution function based on a flat prior with MCMC sampling.Panel B compares the confidence distributions based on bootstrapping and normality.Panel C displays the confidence curve with 95% confidence interval (CI) and Bayesian credible interval (CrI).
Data on bleed severity in the HERO-2 trial.
Data from the REMAP-CAP trial comparing organ support-free days (OSFD) in individuals with severe Covid-19 infection receiving either fixed duration hydrocortisone or no hydrocortisone (control).
TA B L E 3