Avoiding prior–data conflict in regression models via mixture priors

The Bayesian‐80 model consists of the prior–likelihood pair. A prior–data conflict arises whenever the prior allocates most of its mass to regions of the parameter space where the likelihood is relatively low. Once a prior–data conflict is diagnosed, what to do next is a hard question to answer. We propose an automatic prior elicitation that involves a two‐component mixture of a diffuse and an informative prior distribution that favours the first component if a conflict emerges. Using various examples, we show that these mixture priors can be useful in regression models as a device for regularizing the estimates and retrieving useful inferential conclusions.


INTRODUCTION
The Bayesian model consists of the prior-likelihood pair (Gelman & Shalizi, 2013), and checking the model components is an essential statistical task. Recently, starting from the approaches of Box (1980), Rubin (1984) and Gelman, Meng & Stern (1996), simultaneous and joint checks of both components have been developed. However, along with these posterior predictive checks, the need for a statistical model to be generative has lately emerged. In particular, the ability to generate realistic hypothetical data is validated through prior-predictive checking (Gabry et al., 2019), and the plausibility of the prior is then somehow checked in a preliminary and separate way using a simulation perspective.
In this approach, a prior could be checked against the data and subsequently changed depending on the results of this check. Rather than merely representing the belief of the statistician before observing the data (Garthwaite, Kadane & O'Hagan, 2005), the analyst considers the prior distribution as a device that can convey information, regularize and suitably restrict the space of the unknown parameters (Gelman, 2017). As suggested by Gelman, Simpson & Betancourt (2017), to ensure a robust analysis we need to go beyond the standard Bayesian workflow where the prior is meant to be chosen without any reference to the data.
However, if we admit that the prior can be (judged to be) grossly inappropriate, these two model components (the prior and the sampling distribution) could contradict each other: when the prior concentrates most of its mass in low-density areas of the likelihood, we incur a prior-data conflict (Evans & Moshonov, 2006;Bousquet, 2008;Evans & Jang, 2011a;Al Labadi & Evans, 2017). In spite of the fact that this issue is recognized among Bayesian practitioners, remarkably few tools have been identified for resolving such a conflict once it has been identified. The Kullback-Leibler divergence criterion (Bousquet, 2008) and Bayesian P-values (Evans & Moshonov, 2006;Nott et al., 2016) have been developed for assessing the extent of prior-data conflict, but neither approach is used explicitly to elicit an improved prior distribution. In this article, we outline a procedure for including Bayesian P-values directly in the prior formulation to prevent a prior-data conflict. Following the logic outlined in Gelman et al. (2008), the usefulness of our procedure will be revealed in a regression context, where prior predictive checks highlight the importance of a careful elicitation. The main thread concerning use of a prior-data conflict to derive an improved prior beginning from an informative prior can be traced to Evans & Jang (2011a).
For the rest of this article, our working assumption will be that the sampling distribution is correct. Given a pair of priors p, q, where the first one is informative (and so could conflict with the data) and the second one is noninformative (so that it should not conflict with the data), our strategy is to combine them in a new mixture prior = q + (1 − )p, and to develop an automatic procedure for estimating the mixture weight in such a way that any conflict between p and the data no longer occurs. The resulting prior is then a robust alternative that lies between the informative prior p and the noninformative prior q. It would then be reasonable to choose the weight such that as approaches 1-meaning that a substantial prior-data conflict occurs-q is favoured; conversely, as approaches 0-meaning that no conflict occurs-p is then a suitable prior.
The choice of the priors p and q is of primary importance in our approach. However, assigning a mixture prior weighted with an estimated to a regression parameter does not guarantee robustness. Moreover, eliciting a rather informative prior, say a standard normal  (0, 1), may dramatically change its impact depending on the sampling model. In such extreme cases, we suggest and implement using a predictive informative prior. This is a data-driven prior distribution that depends on the sufficient statistic for the regression model, and is likely able to regularize the inferences, even when the informative prior and our proposed mixture fail. In practice, we do not believe that the dependence of our prior distribution on data is a major concern. We feel our proposed approach may be naturally located within the so-called falsificationist Bayesian philosophy (Gelman & Hennig, 2017), which openly deviates from subjective and objective Bayesian practices and in which the prior is open to falsification.
In a sense, it is immediate to see how the family of mixture priors { q( ) + (1 − )p( ); ≥ 0} represents a natural priors' hierarchy before viewing the data; distinct priors can be in fact identified as varies. Thus, our mixture prior is a device whose aim is to incorporate any possibility of prior-data conflict, allowing the absence of any such prior-data conflict as a particular case. So, it works as a sort of built-in prior with no need to routinely check many weakly informative priors and possibly to change them.
The rest of the article is organized as follows. Section 2 provides a quick overview of the prior-data conflict measures proposed by Evans & Moshonov (2006). In Section 3, we present our methodology, together with some theoretical results that quantify the extent of any prior-data conflict we can expect a priori when using rather than p. We also justify the use of predictive informative priors in extreme cases. Section 4 explores several regression applications. Some concluding remarks and observations are provided in Section 5.

PRIOR-DATA CONFLICT: SOME BACKGROUND
We incur a prior-data conflict when we elicit a prior whose density mass concentrates on values of the parameter that are not supported by the data. In other words, such a conflict happens when the prior places its mass primarily on distributions in the sampling model for which the observed data are surprising (typically when only a few data points are observed). As mentioned by Evans & Moshonov (2006), Evans & Jang (2011a) and Nott et al. (2016), checking for prior-data conflicts takes a distinct perspective from verifying the appropriateness of the likelihood components.
We denote the sampling model for the data y in the sample space  by {p(y| ) ∶ ∈ Θ ⊆ R d , d ≥ 1}, where each p(y| ) is a probability density on  with respect to some support measure . The prior distribution p( ) then leads to a prior predictive probability measure is the density of M with respect to the measure , known as the prior predictive distribution for the sample y. For a function T ∶  →  , we may define the marginal prior predictive density for T, where p(t| ) is the marginal density for T. If T is a minimal sufficient statistic for the sampling model p(y| ), it is well known that the posterior is the same whether we observe y or T(y). In Evans & Moshonov (2006) and Evans & Jang (2011a), a prior-data conflict arises when the observed value T(y 0 ) = t 0 turns out to be surprising when compared with the distribution M T : The measure of surprise in Equation (3) is a prior predictive P-value according to the probability distribution M T , and its purpose is to locate t 0 in the distribution M T . Evans & Jang (2011b) stated a consistency result for Equation (3), proving that the limiting value for this tail probability as the data grow measures the extent to which the true value of the parameter is a surprising value with respect to the choice of the prior. If m T is unimodal, Equation (3) represents the probability P(t 0 ) such that the value t 0 falls in a low-density distribution area. It is really only when very small values of (3) are obtained that problems arise since then the data contradicts the prior. If P(t 0 ) approaches zero, this means that t 0 lies in a region where M T assigns very little probability, and then a prior-data conflict is likely to occur. Although there may be many concerns about the use of these P-values to detect prior-data conflicts, in the rest of the article we will use P(t 0 ) to detect a possible prior-data conflict. We will return to this point, which plays a key role in our approach, in Section 3. The goal is to elicit an informative prior p( ), which we will henceforth call the reference informative prior for a problem of interest. Moreover, we want to avoid the possibility that this prior is in conflict with the observed data according to the P-value described in Equation (3), or that it dominates the inference when data are not fully informative. To dilute the effect of this choice, we may combine p( ) with a noninformative prior q( ), ∈ ℝ 1 for simplicity, in a mixture prior ( ) using the weight , as follows: The idea of using mixture priors to overcome a prior-data conflict was previously proposed, in the context of clinical trials, by Schmidli et al. (2014) and Mutsvari, Tytgat & Walley (2016). However, the authors were vague about the choice of the mixture weights, suggesting that this specification can be based on the degree of confidence of the clinical trial team in the relevance of the historical data, or more simply, that the larger weight, say 1 − = 0.7 or 1 − = 0.9, should be assigned to the informative prior. In our opinion, this is a subjective choice designed to correct a subjective source as an informative prior. To wisely use mixture priors in applied statistics, we believe the choice of the weight should be automatic rather than subjective.

Choice of the Mixture Weight
We propose a strategy for choosing , which depends on the P-value in Equation (3) evaluated for ; in particular, is such that the mixture prior does not imply a conflict. According to Equation (2), the marginal prior predictive density for the minimal sufficient statistic T under the prior , m T = ∫ Θ p(t| ) ( )v(d ), yields the prior predictive probability measure M T = ∫ A m T (t) (dt). Using the mixture prior identified in Equation (4), it follows at once that Given the observed statistic value T(y 0 ) = t 0 , we propose to choose the smallest value of such that the P-value P (t 0 ) = M T (m T (t) ≤ m T (t 0 )) exceeds the threshold and, consequently, a prior-data conflict no longer exists, i.e., In this context, acts as a tuning parameter; its choice is connected to the degree of flexibility assumed by the experimenter. Evans & Moshonov (2006) do not address this issue; they only recognize a conflict in examples where the P-value is at most 0.05. Evans & Jang (2011a) allude to the fact that is usually some cut-off value that depends on the application.  are even more vague, suggesting that a discrepancy in a statistic is found when its observed value falls in one of the tails of its replicated distribution. The goal is not a decision about the existence of a conflict (the method is applied when there are clues that a conflict might occur), but it is vital for us to define a prior that incorporates the absence of a conflict as a particular case, and according to which the degree of this conflict (choice of ) may vary based on the individual cases and the investigator's judgement.

Specifying the Priors
The final degree of freedom provided to users of our approach is the choice of q, the noninformative prior. Although many definitions of noninformative priors have been proposed in the past (Kass & Wasserman, 1996;Consonni et al., 2018), it is sufficient for our purposes to consider the absence of the possibility of any prior-data conflict as a necessary characteristic of any noninformative prior, as suggested by Evans & Moshonov (2006). In the examples in Section 4, we use the weakly informative priors proposed by Gelman et al. (2008), which, as shown in Evans & Jang (2011a), are unlikely to cause any prior-data conflict. However, we need to characterize when the mixture prior identified in Equation (4) is weakly informative with respect to p; we do so by following the procedure proposed by Evans & Jang (2011a). To assess whether a base prior q is weakly informative with respect to an elicited prior p, these authors suggest evaluating where ) is the P-value used to check whether or not there is prior-data conflict with respect to q, and x ∈ [0, 1] is a quantile of the distribution of P p (t 0 ). As Evans and Jang remark, if m p T (t 0 ) has a continuous distribution when t 0 ∼ M p T , then x = . We say that q is weakly informative relative to p at level if the value identified in Equation (7) is at most x . Moreover, the degree of weak informativity of a prior q relative to a prior p can be quantified via the ratio where the final equality holds under the assumption that the P-value P p (t 0 ) is uniformly distributed when t 0 ∼ M p (t 0 ), as suggested by Evans & Jang (2011a). The ratio specified in Equation (8) represents, as a proportion, the reduction in any prior-data conflict that we can expect, a priori, when using q rather than p. In the mixture prior context, we want to check when is weakly informative relative to p, and how much less informative than p it is via a measure analogous to the ratio identified in Equation (8) for the two priors p, . The following result characterizes the notion of weak informativity for the mixture prior identified in Equation (4). The proof of the theorem may be found in the Appendix.
Theorem 1. Suppose p and q are proper prior distributions for , the parameter of a statistical model p(y| ), and ( ) = q( ) + (1 − )p( ) is the mixture prior, for any ≥ 0. Then the ratio ∕x , which represents, as a proportion, the reduction in any prior-data conflict that we can expect, a priori, when using rather than p, equals (P q (t 0 ) − P p (t 0 ))∕x , with the numerator pq ≡ (P q (t 0 ) − P p (t 0 )) bounded between 0 and 1. Moreover, is weakly informative relative to p at level if and only if pq > 0 , i.e., whenever > 0 and P q (t 0 ) > P p (t 0 ).
This result underlines the role played by in the prior-data conflict context and quantifies the extent of the reduction in any prior-data conflict we can expect when using the mixture prior rather than the informative prior p: pq ≈ 0 (more or less the same quantity of prior-data conflict raised by p) when the informative prior p is entirely weighted towards the mixture prior ( ≈ 0, P p (t 0 ) ≈ P q (t 0 )); conversely, pq ≈ 1 when the noninformative prior q is entirely weighted towards the mixture prior ( ≈ 1, P p (t 0 ) ≈ 0). In the following example we retrieve this result and the typical meaning of noninformative priors by comparing the mixture prior with a normal prior in the case of a normal likelihood for the observed data.
The Canadian Journal of Statistics / La revue canadienne de statistique Example 1. Comparing the mixture with a normal prior.
Suppose we collect a sample y = (y 1 , … , y n ) from a  ( , 1) distribution. The minimal sufficient statistic is T(y) =ȳ ∼  ( , 1∕n). Suppose that the prior p on is  ( 0 , 2 1 ), whereas the prior q on is a  ( 0 , 2 2 ), with 0 , 2 1 , 2 2 known. We can combine p and q in a mixture prior q( ) + (1 − )p( ), with estimated and known. Then, we can compute where G k denotes the cumulative distribution function of the chi-squared random variable with k degrees of freedom. It follows that . This quantity is at most iff > 0 and P q (t 0 ) − P p (t 0 ) > 0, where the latter condition applies only when 2 2 > 2 1 , as is customary for the case of two prior distributions p and q; see Evans & Jang (2011a).

Predictive Informative Priors
The choice of the statistic T(y) and the two priors p and q plays a key role in our method. Unfortunately, when a serious prior-data conflict exists, we will need to replace the prior, and that would seem to suggest some dependence on the data; after all, we are replacing the prior because of the observed data. In many applied problems, it may then be useful to elicit a prior that does not cause any prior-data conflict. From this perspective, the prior should act as a device and its choice should guarantee robust inferential conclusions. If the mixture prior that we proposed in Section 3.1 is revealed to be in conflict with the data after a first check (thus, approaches one), the analyst could be tempted to reinforce his prior assumptions by defining a predictive informative prior p T ( ), whose characteristic is to be centred at the sufficient statistic. Consider the location normal model (Evans & Moshonov, 2006), where y i ∼  ( , 2 ), and ∼  (0, 1); then, the predictive informative prior is p T ( ) =  (t 0 , 1), where t 0 ≡ T(y 0 ) = n −1 ∑ i y i is the observed value of the minimal sufficient statistic. In general, we propose using the formulation for the predictive informative prior, where t 0 might be the maximum likelihood estimate or any other reasonable estimate for the parameter . The rationale here is that by choosing this prior which is centred on the maximum likelihood estimate we may have a readily available correction in the direction of the data and thus obtain a more robust prior. Some possible benefits associated with this particular choice will be revealed by the examples that we consider in Section 4. Nevertheless, we claim that this is a very strong data-dependent prior, and its use should be restricted to extreme cases. For such a reason, in terms of a natural priors' hierarchy, the users are strongly encouraged to elicit a preliminary reference informative prior p, and possibly replace it with a predictive informative prior in the mixture prior identified in Equation (4) after checking for the possibility of prior-data conflict using P(t 0 ) specified in Equation (3).

The Multi-Parameter Case
When ∈ ℝ p , we have a parameter-vector = ( 1 , 2 , … , p ). To implement our procedure, we can assign a mixture prior to each of the p components of or, alternatively, we can define an approximation of m(y) for only the component of interest and obtain a pseudo-prior predictive distribution. For example, if only 1 , the initial element of , was of interest, we could use where 2 , 3 , … , p are replaced by consistent estimateŝ2,̂3, … ,̂p. We may then define the analogous pseudo-distribution for the marginal prior predictive density for T. We will rely on these pseudo quantities in Section 4 for regression models that involve more than one parameter. The prior predictive distribution specified in Equation (11) uses consistent estimates for 2 , 3 , … , p , and this may result in using the observed data twice. However, the alternative, i.e., evaluating the multi-dimensional integral of dimension p, would be computationally costly and would require an approximation. Evans & Moshonov (2007) proposed checking individual prior components in hierarchical priors, but their approach is based upon the existence of a set of ancillary statistics in cases where, moreover, the decomposition of the prior conforms to a certain structure. In fact, they focus on exponential models and group statistical models, and, as they report, even in those contexts only certain decompositions are seen to be amenable to their methodology. As an alternative, in their Example 2 they chose to use hierarchical checking without sufficient ancillaries. First they checked the marginal prior on the variance for a location-scale normal model. The conditional prior for the mean was checked subsequently.
The advantage that our proposed approach affords is that Equation (11) may be adopted for almost any prior structure, with no distinction between hierarchical, independent and other DOI: 10.1002/cjs.11637 The Canadian Journal of Statistics / La revue canadienne de statistique forms of dependent priors. Furthermore, there is no need to require the existence of any ancillary statistics at this stage. The main drawbacks of our procedure are the possible double use of the data and the sensitivity of the prior predictive distribution to different estimators.

Computational Issues and Simulations
To compute the values P(t 0 ) specified in Equation (3), we often need to use numerical or simulation methods, since the distributions M T , M T are often not available in an analytical form. For this reason, we usually approximate Equation (2) by drawing hypothetical replications y rep 1 , … , y rep n from m(y) and obtain the simulated distributions for m T (t), m T (t). We are then able to compute the P-values and the mixture weight as outlined in Equation (6).
In the Supplementary Materials that accompany this article, we provide the R source code required to simulate hypothetical data replications and compute the mixture weights for the examples discussed in Section 4.

Summary
The procedure we propose may be summarized as follows: (a) Choose a reference informative prior p( ) and a noninformative prior q( ).

A Simple Example
We refer to the location normal model example which Evans & Moshonov (2006) discussed as Example 1. Suppose y = (y 1 , … , y n ) is sampled from a  ( , 1) distribution, ∈ ℝ 1 , and the two possible priors p and q are ∼  ( 0 , 2 ), and ∼  ( 0 , c 2 ), c >> 0, respectively. The sample mean T(y) =ȳ ∼  ( , 1∕n) is the minimal sufficient statistic. As Evans & Jang (2011a) show, the  ( 0 , c 2 ) prior is weakly informative with respect to the  ( 0 , 2 ) prior when c > 1. As Evans and Jang show, the prior predictive distribution ofȳ with respect to the prior p is  ( 0 , 2 + 1∕n). Given the observed value T(y 0 ) = t 0 , we want to assess whether or not this value lies in one of the tails of the prior predictive distribution via the P-value: We combine p and q in the mixture prior identified in Equation (4), and obtain the P-value We may adjust our reference informative prior and define the predictive informative prior p T ( ) =  (ȳ, 2 ). To illustrate the benefits that derive from using our procedure, we simulate two distinct samples for y, from  (0, 1) and from  (10, 1), respectively, with the following parameter settings: n = 100, 0 = 0, = 4, c = 100; in addition, we fix = 0.25. The first The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11637 sample is fictitiously simulated under the assumption that there is no prior-data conflict between our informative prior p and the data y, whereas the second sample is generated with the explicit intention of a conflict with the informative prior centred at 0. For the first sample we obtain M p T = 0.985 and = 0; thus, as we were expecting, a prior-data conflict does not occur. For the second sample we obtain M p T = 0.01 for the informative prior, which indicates a prior-data conflict. For the mixture prior ∼  ( 0 , c 2 ) + (1 − ) ( 0 , 2 ), M T = 0.25 with the weight estimated at 0.25; M T = 1, = 0 for the mixture prior with the predictive prior in place of the reference p, ∼  ( 0 , c 2 ) + (1 − ) (ȳ, 2 ).
Even in the second sample, where a prior-data conflict occurs, adjusting the reference informative prior is not strictly required: the starting informative prior p, when weighted and combined with a diffuse prior, already prevents any prior-data conflict up to the fixed threshold .

APPLICATIONS
In this section we consider three distinct examples of regression models for which the mixture priors introduced in the previous section prove to be beneficial. Although complete/quasi-complete separation in logistic regression is not strictly the cause of a prior-data conflict, nevertheless it may be viewed as the implicit cause in a broader context in which default priors are often not able to regularize the inferences and yield poor answers.
For simplicity and technical convenience, in each of the following examples we elicit the standard normal  (0, 1) as the reference informative prior: rather than eliciting an actual informative prior as in Al Labadi, Baskurt & Evans (2018), we are motivated to use one reference prior and checking when it does cause prior-data conflicts. Of course, the impact of such a prior varies in relation to the likelihood and the sufficient statistic (Gelman, Simpson & Betancourt, 2017).
To implement our procedure, we set = 0.05 and fixed the number of hypothetical data replications required to compute the P-values identified in Equation (3) to be 10 3 in each example. These values were chosen following some sensitivity tests. Computational steps and other details may be found in the Supplementary Material, i.e., R source code that accompanies this article.

Logistic Regression and Separation
Experiments such as clinical trials may contain much historical information and the analyst may rely on this source of past information, considering it as a sort of baseline for similar and future studies. This is the underlying mechanism in Bayesian inference: given sequential observation of the data points y 1 and y 2 collected at two distinct times t 1 and t 2 , each with sampling distribution p(y| ), the posterior p( |y 1 ) usually acts as the prior for the new data point y 2 , and the update is then proportional to p( |y 1 , y 2 ) ∝ p( |y 1 )p(y 1 , y 2 | ). Now consider the following imaginary-but realistic-situation typical of causal inference, where the parameter represents the probability of contracting a particular disease. Suppose we want to collect the binary response y i , our dependent variable, where y i = 1 if the ith subject in the th sample has the disease, and y i = 0 otherwise. We suppose that for each selected subject we also collect some individual predictors, x i and z i , say the plasma level of a protein of interest and the sex of the subjects in the sample, respectively (z i = 1 if the ith subject in the th sample is a male, 0 otherwise). We, the analysts, collect a total of five samples, each of length n = 100 during the year 2019 in the same hospital, assuming that there are no subjects' ties: each sample y 1 , … , y 5 is associated with the predictors (x 1 , z 1 ), … , (x 5 , z 5 ). However, we immediately realize a quasi-complete separation (Zorn, 2005;Gelman et al., 2008;Sauter & Held, 2016)  of the plasma level x 5 . Thus, one of the model's covariates almost perfectly predicts the outcome variable y 5 , i.e., y i5 = 0 if x i5 = 1, whereas y i5 can be 0 or 1 if x i5 = 0. We are asked to perform a Bayesian logistic regression at the end of 2019, assuming that the response probability p i ≡ Pr(Y i = 1) associated with the ith patient in the th sample is where logit(x) = log(x∕ (1 − x)), x ∈ [0, 1]. As a preliminary attempt, we decide to perform five logistic regressions treating each experiment as if it was independent of the others. Let 11 , 12 , … , 15 and 21 , 22 , … , 25 denote the regression parameters 1 and 2 associated with plasma level and sex, respectively, which correspond to the five measurements. Figure 1 shows the resulting posterior intervals (50% credibility areas are coloured in light blue) for 1 and 2 obtained with the R package rstanarm (Goodrich et al., 2018) using the default weakly informative priors: 1 , which is displayed in the left panel, is rather similar across the five samples, whereas in the right panel of the same figure, 25 has the opposite sign to 21 , … , 24 , due to separation. Now consider fitting the Bayesian logistic regression for the fifth experiment conditional on what we observed in the previous four experiments. Initially we decide to carry out our analysis without worrying about separation in the data. If we are fully informative Bayesian analysts, we should use the posterior derived from the four experiments as the new prior for the fifth experiment, and then update the results. Otherwise, we could use a standard weakly informative prior as suggested by Gelman et al. (2008) for the logistic regression, say a  (0, 10 2 ) for the intercept and  (0, 2.5 2 ) for the parameters 1 and 2 . Posterior 50% intervals and marginal posterior distributions from the two analyses are reported in the top row of Figure 2. The posterior distributions for the parameter 2 are completely different in the two frameworks. It appears that the weakly informative prior favours the probability of no disease in males too strongly. For a male subject (z i5 = 1), the estimated odds ratio that we obtain using the posterior median is p i5 ∕(1 − p i5 ) = exp(−5.5) = 0.004; as we will see, we feel that this posterior estimate could dramatically underestimate the probability of disease for a male subject, especially if this  result is used as prior information for future studies. Conversely, the informative prior for 2 , 2 ∼  (0, 0.33 2 ), estimated from the first four experiments is likely to conflict with the data, with an odds ratio of about 1.64 when z i5 = 1. We need something in between. To implement our mixture prior (a) First we need to choose the sufficient statistic with respect to 2 : T(y, z) = ∑ n i=1 y i5 z i5 has observed value T(y 0 , z 0 ) = 0 for the fifth experiment. (b) Run our source code to (i) draw hypothetical values from m(y|̂,̂1, 2 ), witĥand̂1 consistent estimates for and , and (ii) compute the weights. We obtain = 0, meaning no weight is associated with the prior q. (c) Choose p( 2 ) =  (0, 1), q( 2 ) =  (0, 2.5 2 ). DOI: 10.1002/cjs.11637 The Canadian Journal of Statistics / La revue canadienne de statistique (d) Run the Bayesian logistic regression with the following mixture prior for 2 : 2 ∼  (0, 2.5 2 ) + (1 − ) (0, 1). (e) In such a case, since T(y 0 , z 0 ) = 0, the predictive informative prior p T ( 2 ) coincides with the reference informative prior p( 2 ).
The 50% posterior intervals and marginal posterior distributions from our procedure are displayed in the lower panel of Figure 2. The posterior interval for 2 is sensibly narrower than the same interval under the weakly informative prior; moreover, the posterior median, about −3.5, makes more sense since it represents a compromise between the median for 2 under the strongly informative analysis (about 0.5) and the same under the weakly informative analysis (about −5.5). As a further confirmation, the odds ratio for a male is about 0.03 under the mixture prior, somehow lying between the unrealistic value 0.004 (weakly informative prior) and 1.64 (informative prior). As a final comment, we feel that our procedure is even more robust than the weakly informative prior in case of separation arising in logistic regression. In this case, the standard normal prior absorbs the information required to obtain meaningful posterior estimates, and thus there is no need to use the adjusted predictive prior.

A Bioassay Experiment
We consider now a well-known small-sample experiment previously analyzed by Racine et al. (1986) and Gelman et al. (2008), in which the choice of a prior may strongly affect the final inference. Table 1 summarizes the data collected from 20 animals that were exposed to four different doses of a toxin, where x i represents the ith of k dose levels, measured on a logarithmic scale, given to n i animals, of which y i died. We assume the typical binomial model where p i represents the probability of death for animals given dose x i , and logit(p i ) = + x i .
As suggested by Racine et al. (1986), prior information may be available either in the form of the results of a previous experiment using the same substance or in the form of assessments elicited from one or more expert toxicologists.
When the sample size is small, the role of the prior may be particularly relevant. In this application, we want to combine two aspects of the prior specification. On one hand, the prior should not be in conflict with the observed data. Nonetheless, a Bayesian model should always be generative, and simulations from the prior predictive distribution should be reasonable.
Consider first the scenario where substantial information may not be incorporated in the prior, leading to the elicitation of weakly informative priors, namely ∼  (0, 10 2 ) and ∼  (0, 2.5 2 ). The log(dose) is rescaled to have mean 0 and standard deviation 0.5. As is evident from Figure 3a, where we have displayed four predictive intervals, one for each of the four values of x, fake data generated under these priors are meaningless in this small-sample application: the intervals range from 0 to 5, covering the entire support of the observed data, and posterior medians are constant with respect to x. Thus, even in the absence of substantial prior information, weakly informative priors are too vague in this context and do not yield useful replications under the assumed prior marginal distribution. The same result is obtained for the reference informative prior  (0, 1) (Figure 3b) for the same bioassay experiment.    Evans & Jang (2011a) (see Figure 4a in their paper) have suggested that this reference informative prior is not more informative than  (0, 2.5 2 ) and could instead be considered the standard noninformative prior for this problem when one focuses on the probabilities instead of the regression coefficients (Al Labadi, Baskurt & Evans, 2018). Moreover, even if P q (t 0 ) = 0.1073 when q( , ) =  (0, 10 2 ) ×  (0, 2.5 2 ) according to Evans & Jang (2011a), the degree of prior-data conflict raised by the weakly informative q( , ) and the informative prior q( )p( ) =  (0, 10 2 ) ×  (0, 1) amounts in our case to 0.021 and 0.023, respectively. This misalignment with their result is justified by the fact that they use m T to compute the P-value P q (t 0 ) identified in Equation (3), whereas we use the pseudo-prior predictive distribution m(t|̂, ) specified in Equation (11), witĥestimated from the data. As Figure 3a,b shows, both these priors are far from the data, they are centred at zero, and this choice means the binomial model in this small-sample scenario cannot fulfill its generative function when these priors are chosen. We need perhaps a prior able to regularize the inferences and to provide plausible replications from the prior predictive distribution in the context of small datasets. To implement the mixture prior that we advocated in Section 3, we need a sufficient statistic for the model, such as Thus, we need to estimate the weight such that the mixture prior does not lead to a prior-data conflict. Using the reference mixture prior ∼ q( ) + (1 − )p( ) (see Figure 3c), where q( ) =  (0, 2.5 2 ), p( ) =  (0, 1), the estimated weight is = 0.5, which does not improve the situation. Thus, our reference informative prior p( ) is not generative, and hence is not very informative with respect to the parameter , and is therefore likely to lead to a prior-data conflict. In such a situation, we definitely need to use the predictive informative prior, ∼  (0, 2.5 2 ) + (1 − ) ( ∑ 4 i=1 x i y i , 1). The predictive intervals displayed in Figure 3d, where = 0.3, are now narrower, and the posterior medians clearly vary with the different dose levels, replicating the pattern observed in the original data.
Marginal posterior distributions and posterior 50% intervals for that were obtained using weakly informative, reference mixture, and predictive mixture priors are displayed in Figure 4; the distributions obtained under weakly informative and reference mixture coincide, whereas the marginal posterior under the predictive mixture prior yields narrower posterior intervals. In such a case, the reference prior  (0, 1) is not able to regularize the estimates, and using the predictive mixture prior is clearly preferable.

Linear Regression with Multiple Predictors
The dataset Prostate in the R package lasso2 was used by Stamey et al. (1989) and Tibshirani (1996) to investigate the correlation between the level of prostate-specific antigen and other covariates for men who were about to undergo a radical prostatectomy. See Table 2 for a full list of the covariates. We assume a simple linear model for the response measurement y i representing the amount of prostate-specific antigen as the dependent variable: The variate x i denotes the value of the th covariate for the ith unit, and each is an unknown regression coefficient. A natural first approach here is to use LASSO (least absolute shrinkage and selection operator) regression as developed in Tibshirani (1996) to shrink toward zero a subset of coefficients that are not associated with influential predictors. LASSO estimates ± standard errors are displayed in Figure 5; the coefficients 4 , 5 , 7 , 8 and 9 , associated with age, lbph, lcp, gleason and pgg45, respectively, are shrunk toward zero. The plot reveals a possible identifiability problem with the intercept 1 , which has a standard error that is large when compared with the standard errors of the other regression coefficients. A rough solution could be to drop the intercept term from the model, but this could yield undesirable effects in the global model and, in general, a lack of interpretation for the remaining coefficients.

Weakly informative priors
Following Gelman et al. (2008), we assigned weakly informative priors to each of the coefficients in the regression: the intercept 1 ∼  (0, 10 2 ), whereas ∼  (0, 2.5 2 ) for = 2, … , 9, and ∼ Exponential(1). We fit the model with the rstanarm package, specifying 2000 Hamiltonian Monte Carlo simulations and checking the convergence of the Markov chains using the Gelman-Rubin statisticR (R ≤ 1.1 for all the parameters). Figure 6a displays  posterior intervals for the components of the vector; the Bayesian model with underlying weakly informative priors for the regression coefficients results in posterior interval estimates that are rather similar to the corresponding LASSO-based intervals. The regression coefficients 4 , 5 , 7 , 8 and 9 , associated with age, lbph, lcp, gleason and pgg45, respectively, are all shrunk toward zero, whereas 2 , 3 and 6 , associated with lcavol, lweight and lbph, respectively, are greater than zero. The estimate of the intercept 1 reveals a problem; the parameter is not identifiable. The suspicion here is that a prior-data conflict arose with respect to the parameter 1 , and the data are not fully informative. To fix the conflict and properly estimate the intercept, we need a suitable remedy.

Mixture priors
We now must choose the informative and the diffuse prior. For the latter, we end up selecting the same weakly informative prior  (0, 10 2 ) used in the previous analysis; for the former, we start with a reference standard normal prior  (0, 1), and then eventually update it using the procedure we described in Section 3.7.
To implement the mixture prior, we refer to Equation (10) and consider the pseudo-prior predictive distribution m(y|̂2,̂3, … ,̂9). The main steps are the following: (a) Choose the sufficient statistic, T(x, y) = X T y, where X is the n × (p + 1) predictor matrix. (b) Run our source code to sample hypothetical replications from m(y|̂2,̂3, … ,̂9) under the pseudo-prior predictive distribution and estimate . We obtain = 0. (c) Run the linear regression with the reference mixture prior 1 ∼  (0, 10 2 ) + (1 − ) (0, 1). (d) Consider the predictive prior distribution p T ( 1 ) =  (ȳ∕̂2, 1), wherê2 denotes an estimate for 2 . From the LASSO model, we obtained̂2 = 0.52. (e) Carry out the linear regression with 1 ∼  (0, 10 2 ) + (1 − ) (ȳ∕̂2, 1); the estimated value of was equal to 0. for the intercept is narrower than the corresponding interval estimate that we obtained using the weakly informative prior. Clearly, the estimate of 1 is more stable. Figure 6c displays the posterior intervals for the parameters under the predictive mixture prior 1 ∼  (0, 10 2 ) + (1 − ) (ȳ∕̂2, 1), with = 0. Evidently, the posterior estimates for the regression coefficients the 95% posterior interval does not contain zero. Somehow, we added the essential information required to estimate 1 , and we checked that this information was actually relevant by simulating hypothetical data from m(y|̂2,̂3, … ,̂9). As a final remark, we feel that choosing to assign weakly informative priors may not prevent poor estimation when there is a (partial) lack of information in the observed data.

DISCUSSION
How to proceed once a prior-data conflict is detected is a tricky question to answer. Moreover, there are no automatic procedures for dealing with an eventual lack of robustness of the posterior estimates and also with a lack of parameter identifiability that arises from the marginal posterior distributions. However, the prior is an essential tool for regularizing inferences. To achieve these goals, we proposed a two-component mixture model that combines an informative and a noninformative prior such that a prior-data conflict between the data and the informative prior is avoided. Our approach is based on the prior-data conflict measures developed by Evans & Moshonov (2006) and offers a new insight into a reasoned elicitation. If the mixture prior is not capable of fixing the issue, we are able to extend the method and consider a predictive prior in place of the reference informative prior chosen before the experiment. We justify our proposed priors by providing theoretical tools that measure the degree of informativity with respect to a reference informative prior. In terms of a broader interpretation, the family of mixture priors { q( ) + (1 − )p( ); ≥ 0} represents a natural hierarchy of priors before seeing the data; distinct priors can be identified as varies.
As motivated by the applications, this class of priors could be beneficial for regression models where prior-data conflicts may arise with respect to a subset of parameters, and the resulting inference may be misleading if the priors are even slightly misspecified. Generally speaking, use of our proposed mixture of priors seems to regularize the inference in a broad sense, not just in the case when a prior-data conflict arises.
One major concern in our method is that we advocate choosing data-dependent priors. However, data-dependent priors are widely used in applied statistics with convincing motivations (Wasserman, 2000;Gelman et al., 2008;Goodrich et al., 2018), and we feel our mixture prior q( ) + (1 − )p( ) somehow depends on the observed data in a marginal sense; the mixture weight is chosen using a prior-predictive check that is carried out before the model is fitted (Box, 1980;Gabry et al., 2019). In addition, this prior-predictive check is the same tool adopted by Evans & Moshonov (2006) and Evans & Jang (2011a) to assess whether or not a prior-data conflict has arisen and, if so, to replace the prior. Thus, as argued by Gelman et al. (2008), we do not believe that the dependence of our prior distribution on the observed data represents a major concern, because the inferential conclusions are derived from proper posteriors.
Further research is warranted to implement our proposed approach in more complex settings, such as hierarchical models, and to provide appropriate computational software that is easy to use with such a choice of prior. Some tools for checking the robustness of our proposed approach would also be desirable.