Identifying groups of determinants in Bayesian model averaging using Dirichlet process clustering

Model uncertainty is a pervasive problem in regression applications. Bayesian model averaging (BMA) takes model uncertainty into account and identifies robust determinants. However, it requires the specification of suitable model priors. Mixture model priors are appealing because they explicitly account for different groups of covariates as robust determinants. Specific Dirichlet process clustering (DPC) model priors are proposed; their correspondence to the binomial model prior derived and methods to perform the BMA analysis including a DPC postprocessing procedure to identify groups of determinants are outlined. The application of these model priors is demonstrated in a simulation exercise and in an empirical analysis of cross‐country economic growth data. The BMA analysis is performed using the Markov chain Monte Carlo model composition sampler to obtain samples from the posterior of the model specifications. Results are compared with those obtained under a beta‐binomial and a collinearity‐adjusted dilution model prior.


INTRODUCTION
Inference under model uncertainty is a pervasive problem in many regression applications. Bayesian model averaging (henceforth BMA) in combination with suitable computational tools has become a standard method to account for model uncertainty (Hoeting et al., 1999;Hofmarcher & Grün, 2020;Raftery et al., 1997). The use of BMA provides better predictive performance (Madigan & Raftery, 1994) and identifies robust determinants (Fernández et al., 2001b). The method has been widely adopted for a range of different application areas including cross-country long-term economic growth (Fernández et al., 2001a;Ley & Steel, 2009Masanjala & Papageorgiou, 2008) as well as other economic and finance applications (Moral-Benito, 2015;Ouysse & Kohn, 2010), wind speed prediction in weather forecasting (Baran, 2014;Sloughter et al., 2010), and the detection of differentially expressed genes in observational gene expression studies (Zhou et al., 2012).
A key aspect of any BMA application-as for any Bayesian estimation-is the specification of suitable priors. The priors in BMA consist of two parts: (1) the prior for covariate inclusion, so-called model priors, and (2) the prior for the regression coefficients and error variance (henceforth regression parameter priors). A standard default approach for the prior on the regression coefficients and the error variance is the use of (hyper-)g-priors for the regression coefficients of the covariates together with noninformative priors for the intercept and the error variance (see, e.g., Liang et al., 2008;Ley & Steel, 2009Hofmarcher et al., 2015).
More controversy exists with respect to the specification of suitable model priors. The basic model prior is the binomial model prior which assigns an equal a priori inclusion probability to all potential covariates. Sala-i-Martin et al. (2004) used = k∕K, with k denoting the prior expected model size and K is the total number of potential covariates. The prior expected model size k is fixed by the researcher a priori. To increase flexibility, and to reduce the dependence on the prior expected model size k, Ley and Steel (2009) proposed to put a beta hyperprior on the inclusion probability . This results in the beta-binomial (BB) model prior, which has become the standard workhorse model prior for most empirical BMA exercises. Indeed, the BB model prior is appealing if the model prior should cover the uncertainty regarding the true model size, that is, if researchers want to add one additional layer of uncertainty over the prior expected model size. One drawback of the BB model prior is that it does not take the correlation structure between covariates into account. To compensate for redundancy within the model space, Chipman (1996) and George (1999) introduced the concept of dilution priors. The aim of dilution priors is to account for similarities among the models (see also George, 2010) by, for example, down-weighting the prior probability of models which include highly correlated covariates and thus cover a similar model space. Moser and Hofmarcher (2014) studied the effect of dilution priors on BMA growth applications, in particular when interaction terms are present, to conclude that the different priors lead to similar predictive performance. Results presented in this article indicate that the BB prior should also be questioned as default model prior to be used if groups of covariates are suspected to be relevant and the aim of the BMA analysis is to identify these groups.
In the BMA framework, the model prior captures two potential veins of uncertainty: the model size and the inclusion of specific covariates or groups of covariates. To focus on model size uncertainty, that is, the number of included covariates, a vague prior on model size, which is, for example, induced by the BB model prior, needs to be imposed. The second vein of uncertainty might either only focus on the inclusion of single covariates or aim at assessing the joint inclusion of several covariates. The latter is of particular importance if, for example, due to different proposed theories, competing groups of covariates are expected to equally well constitute robust determinants for the dependent variable.
To assess the joint inclusion of several covariates, Durlauf et al. (2008) proposed a model prior which explicitly accounts for the presence of competing theories and different groups of covariates being associated with each of the theories. Within the group of covariates for a given theory, Durlauf et al. (2008) employed dilution priors. This approach assesses the importance of a theory conditional on the associated covariates being selected. However a drawback of this approach is, that it requires the theories as well as the set of covariates associated with a specific theory to be a priori known.
To assess the uncertainty with respect to the covariate inclusion of single as well as pairs or groups of covariates, different postprocessing tools for the results of a BMA analysis were also investigated. In particular bivariate jointness measures were proposed to identify pairs of covariates which tend to appear together in the models (complements), or which tend to exclude each other (substitutes) (see Doppelhofer & Weeks, 2009;Ley & Steel, 2007, 2009Strachan, 2009;Hofmarcher et al., 2018 as well as an empirical comparison by Man, 2018). Ley and Steel (2007) point out that these bivariate measures can easily be extended to more than two regressors and thus multivariate jointness measures obtained. In addition to these methods, Crespo Cuaresma et al. (2016) proposed an alternative postprocessing approach to account for the dependency in covariate inclusion. They analyzed the full multivariate posterior distribution of covariate inclusion using a Dirichlet process clustering (DPC) analysis (see, e.g., Kim et al., 2006;Molitor et al., 2010). This does not only account for pairwise jointness, but the full multivariate jointness between all covariates.
In this article we focus on a suitable BMA analysis for assessing model uncertainty with respect to the inclusion of groups of covariates where the specific group structure is not a priori known. We argue in favor of a DPC prior as model prior to facilitate the detection of a grouping structure in the covariates with varying group-specific joint inclusion probabilities. The reasons why DPC priors are suggested as model priors in BMA applications are twofold. First, mixture priors are a natural choice to a priori take into account that different groups of covariates may be relevant for the outcome variable. This is in line with the model formulation in Durlauf et al. (2008), where competing groups of explanatory variables emerging from different theories are assumed to be relevant for determining long-term per capita GDP growth. Second, mixture priors do not require to specify these groups of covariates a priori but they are determined in a data-driven way. In contrast to finite mixtures, Dirichlet process clustering priors have the advantage that they do not require the number of groups to be specified a priori. Compared with Durlauf et al. (2008), the proposed DPC model priors thus constitute a data-driven approach to identify theories and their associated covariates. Conditional on suitable model and regression parameter priors being specified, we outline how to proceed with the BMA analysis. We derive the correspondence between DPC model priors and the binomial model prior and describe how to obtain suitable posterior estimates given the suggested prior specifications. In particular we use the Markov chain Monte Carlo (MCMC) model composition (MC 3 ) sampler to obtain samples from the posterior of the model specifications. We propose to use a DPC postprocessing approach to identify the groups of covariates and enable the assessment of joint inclusion and the identification of complements and substitutes based on these posterior samples. Overall the BMA approach presented is tailored for a situation where the focus is on model uncertainty with respect to specific groups of covariates being included and accounts for jointness of inclusion to provide insights on robust determinants for the dependent variable. This article is organized as follows. Section 2 gives a short overview of the BMA methodology and describes the standard model priors used in the literature hitherto. The hierarchical specification of the DPC model prior and estimation and inference including a suitable Dirichlet process clustering postprocessing procedure to identify groups of determinants under this model prior are derived in Section 3. Section 4 presents a simulation exercise. The results based on an BMA analysis using the MCMC model composition sampler imply that the DPC model prior may lead to slightly better results than the BB prior and to considerably better results than the dilution prior when aiming at detecting the cluster structure present in the covariates for predicting the dependent variable when the size of the covariate groups is suitably specified and the state-of-the-art hyper-g-prior for the regression parameters is used. Section 5 presents the results of applying the DPC model prior to a cross-country growth data set and compares them to those obtained using the standard BB model prior as well as the dilution prior. Finally, Section 6 summarizes the results and relates them to previous work on post hoc jointness analysis of covariates in BMA.

BMA AND PRIOR SPECIFICATIONS
The standard BMA setting assumes that there is a linear relationship between n observations y and a group of covariates X l chosen from the complete set of potential covariates X. The linear model assumes additive noise and that the dependent variable follows a normal distribution with the mean given by the linear predictor: where N( , Σ) is the multivariate normal distribution with mean and variance-covariance matrix Σ, 1 is a vector of ones of length n and X ∈ R n×K . X l is obtained from X by selecting a subset of covariates. This subset selection can also be indicated by a 0-1 vector l of length K. Thus the vector l characterizes the set of covariates included in a specific regression model. The vector l can take 2 K different values, corresponds to the covariate inclusion vector for a given model and represents a particular model specification.
Taking model uncertainty into account the posterior distribution of the parameters given the data corresponds to A Bayesian approach requires specifying the regression parameter prior p( 0 , , 2 | l ) and the model prior p( l ). In general both priors jointly impact on the posterior of the regression parameters and on the posterior of the model specifications.

Regression parameter priors
A usual choice for the regression parameter prior, conditional on a specific model specification l , is a constant flat prior for 0 , an uninformative prior p( 2 ) ∝ 1∕ 2 for the error variance and the so-called Zellner's g-prior for the regression coefficients (Zellner, 1986): The g-prior has the advantage that only one parameter, namely g, needs to be specified and that the marginal likelihood can be evaluated analytically. The influence of g has also been investigated in combination with different model priors (see, e.g., Fernández et al., 2001a;Ley & Steel, 2009). In particular Eicher et al. (2011) give an explicit formula for a specific model prior setting to tradeoff the specification of g in the regression parameter prior with the model prior. Specific choices for g are discussed and compared in Fernández et al. (2001a). To alleviate the influence of g, hyperpriors for g were proposed with a benchmark prior given by Ley and Steel (2012). In the following the hyper-g-prior is considered using a beta distribution for g∕(1 + g) such that This implies that g∕(1 + g) has mean n∕(n + 1). This corresponds to the shrinkage imposed on the regression coefficients by the unit information prior where g = n (Zeugner & Feldkircher, 2015). Alternative regression parameter priors for BMA include, among others, independent Gaussian priors, lasso and elastic net priors, and robust g-priors. In the BMA context Malsiner-Walli et al. (2019) compare independent Gaussian priors to the g-prior and Hofmarcher et al. (2015) compare lasso and elastic net priors to other parameter priors. Bayarri et al. (2012) provide explicit results on the impact of the robust g-prior on the posterior of the model specifications. The horseshoe prior has also been proposed as regression parameter prior (Carvalho et al., 2010). However, this prior already constitutes an alternative to the spike-and-slab prior used in BMA to perform variable selection and thus is in general not used within BMA applications (Piironen & Vehtari, 2017).

Model priors
Several priors were proposed hitherto in the literature for the covariate inclusion vectors l . In the following we present the binomial model prior proposed by Sala-i-Martin et al. (2004) and the extension to the BB model prior put forward by Ley and Steel (2009). These two model priors treat the covariates exchangeably. In fact, for these priors, the prior weight of a specific covariate inclusion vector only depends on the number of covariates being included and not which specific covariates are included. The priors differ in the prior weight assigned to models of different size, with the binomial prior having more weight concentrated close to the prior expected model size and the BB prior distributing the prior weight in a more dispersed way. Finally, the collinearity-adjusted dilution (DIL) model prior proposed by George (2010) is presented. This prior takes the correlation between covariates into account and reduces the joint inclusion probability of highly correlated covariates thus not only taking the number of covariates included into account.

2.2.1
Binomial model prior Sala-i-Martin et al. (2004) specified their prior model probabilities by choosing a prior expected model size k, with each covariate having a prior inclusion probability of = k K , and K denoting the number of potential covariates. The marginal probability of the model induced by l is a priori given by where l contains k l covariates, that is, 1 ′ l = k l . The prior probability of a specific model l thus is the same for all models with the same number of covariates included.

BB model prior
To increase the flexibility of the model prior and to reduce the influence of the prior expected model size k, Ley and Steel (2009) proposed a hyperprior on the inclusion probability . This changes the prior for l to the BB prior, that is, in the hierarchical specification: In the BB model prior, the probability of a specific model l with k l included covariates conditional on a specific value of equals the probability of the binomial model prior with corresponding to k∕K. Integrating over gives that the marginal probability of a specific model l is proportional to the product of two Gamma functions, with arguments depending on the choice of the beta parameters a, b and the model size k l induced by l . Again the prior probability of a specific model does only depend on the number of covariates included k l , but not the specific covariates. A detailed discussion of this model prior can be found in Ley and Steel (2009). They proposed to set a = 1 and b = (K − k)∕k to induce a prior expected model size of k.

2.2.3
Collinearity-adjusted dilution model prior The model priors presented in the previous sections do not take the collinearity between the potential covariates into account, but assume two covariates are a priori equally likely to be included regardless of if they are uncorrelated or highly correlated. George (2010) criticized this approach as putting disproportionate mass on correlated covariates and proposed to adjust the prior model weight of a model l by taking the value of the determinant of the correlation matrix C l of X l into account. Note that |C l | = 1 with | ⋅ | denoting the determinant, if the included covariates are uncorrelated and |C l | = 0 if they are perfectly collinear. In the DIL model prior proposed by George (2010) the prior probability of a specific model l is given by for some monotonic function f satisfying f (1) = 1 and f (0) = 0. In the following we use the identity function for f . The function f controls the down-weighting applied to groups of covariates in dependence of their correlation structure compared with the binomial model prior with = k∕K. If a set of covariates is collinear the a priori probability of including that particular set is zero for this prior. Each noncollinear subset has a positive a priori probability to be included. If covariates are orthogonal, = k∕K also induces a prior model size of k as the binomial prior. The prior proposed by George (2010) is based on capturing the global dependency structure between the covariates in a model by only considering the determinant. Alternatively a prior could be considered which takes a more local (e.g., pairwise) correlation structure into account to adjust the model prior for redundancy in the covariates. This would result in an alternative dilution prior in addition to the DIL model prior.

THE DPC MODEL PRIOR
BMA is in general used to identify covariates which are robust determinants regardless of other covariates being included. If for a specific regression problem different groups of covariates are assumed to be present which are associated with the outcome variable, BMA may also be employed to identify robust determinants as part of a group of covariates. To be in line with such an assumption, using a mixture prior based on latent class analysis as a model prior explicitly accounts for different groups of covariates being relevant for the outcome. Lazarsfeld (1950) developed latent class analysis to account for dependencies in categorical multivariate data. These dependencies are assumed to be caused by latent groups where the categorical variables are independent within groups. If the categorical multivariate data consists of binary variables only, the latent class analysis model corresponds to a mixture of independent Bernoulli distributions. Finite or infinite mixture models could be used as mixture priors. However, infinite mixtures, such as Dirichlet process clustering models, have the advantage that they do not require to a priori fix the number of groups and thus facilitate the determination of a suitable number of groups for a given data set (see, e.g., McAuliffe et al., 2006). We thus propose to use a DPC model prior which also corresponds to the limiting case of a finite mixture prior where the number of components grows to infinity and the parameter used for the Dirichlet weights times the number of components converges to a fixed positive finite value (Green & Richardson, 2001

Hierarchical specification
The DPC model has been developed in Bayesian nonparametrics (Antoniak, 1974;Ferguson, 1973) and has various applications in mixture modeling (see, e.g., Kim et al., 2006). The hierarchical relationships of the DPC model imply the following model prior specification on : The Dirichlet process is denoted by DP(⋅|G 0 , ), where G 0 is the base measure and is the scaling parameter.
In the case of latent class analysis with a product of Bernoulli distributions as class distributions, the base measure G 0 corresponds to The set of hyperparameters for the DPC model prior thus consists of ( , a, b), the scaling parameter for the Dirichlet process and the parameters a and b for the base measure.
Alternatively the DPC model prior can be specified using the stick-breaking representation proposed by Sethurman (1994) (see also Liverani et al., 2015). The probability distribution for the covariate inclusion parameters l is then given by: where (⋅) denotes the point mass at . This distribution is determined by first drawing the sticks This corresponds to and induces the weights The cluster parameters c are drawn from the base measure: The scaling parameter influences how many different values for the covariate inclusion parameters are a priori expected. If is large, v c tends to be close to zero and therefore small cluster sizes and many clusters can be expected. On the other hand small values of imply large cluster sizes and only few clusters, that is, different values for . To attenuate the influence of a specific value of a Gamma hyperprior is added, that is, ∼ Gamma(2, 1).

Marginal prior
The marginal prior p( l | , a, b) is obtained by is independent of and the same as obtained for the binomial model prior for a given prior expected model size. The proposed DPC model prior is developed to be suitable for BMA applications where a group structure of covariate inclusion is suspected. It also has the advantage to marginally correspond to a model prior already previously used in the literature, thus providing also guidance for choosing a suitable model prior among already known model priors if the group structure in covariate inclusion is to be analyzed a posteriori.

Inference
In BMA analysis using a DPC model prior the posterior of interest does not only consist of the regression parameters ( 0 , , 2 ), but also the group-specific parameters and group sizes induced by V which characterize the posterior distribution of . The posterior of all parameters can be decomposed in the following way: For Bayesian inference the marginal posterior of the regression parameters consisting of 0 , and 2 can be determined separately from the marginal posterior of the Dirichlet process clustering parameters: If the usual choice for the regression parameter prior is made, the posterior of the regression parameters is in general approximated using samples from p( |y, a, b) as the conditional probabilities of the regression parameters given a particular model specification are available in closed form. Similar the posterior of the DPC parameters could also be approximated using these samples. However, this is complicated because the conditional probabilities are not directly available. We will propose an alternative approach to determine point estimates for the Dirichlet process clustering parameters.

Covariate inclusion and regression parameters
The results in Section 3.2 indicate that the binomial and DPC model priors are marginally identical. This implies that the posteriors p( 0 , , 2 |y, ) and p( |y, a, b) are also identical for both model priors. Conditional on the posterior for the regression parameters are independent of the model prior. The posterior for is proportional to the marginal likelihood times marginal prior and thus also identical for both model priors. Thus the same techniques can be used to approximate them.
To determine the posterior p( |y, a, b), all possible models can be evaluated for a small number of covariates K. This is not computational feasible for larger values of K and the posterior is approximated using MCMC methods to obtain a sample from p( |y, a, b). The MCMC methods allow to explore the model space in an efficient way and switch between models based on the ratio of the posterior model weights of the current and the proposed model. If the usual regression parameter prior is used, the Bayes factors for two models can be determined in closed form (Liang et al., 2008) and thus the posterior p( |y, a, b) can be directly approximated with the regression parameters integrated out. This allows to use the MCMC model composition (MC 3 ) algorithm developed for BMA by Madigan et al. (1995) and available in the R (R Core Team, 2020) package BMS (Zeugner & Feldkircher, 2015). The MC 3 algorithm implemented uses the marginal likelihood and prior probabilities of the model prior to switch between models using birth and death as well as swap moves, that is, single covariates are either added or dropped or a pair of covariates change their roles from being included to dropped. Using this sampling scheme one can either use the empirical frequencies of the visited models or the posterior probabilities implied by the marginal likelihoods times the prior probabilities to approximate the posterior probabilities.
If mixing of the MC 3 algorithm is poor using the specific model prior, one could alternatively also consider using a different sampling scheme which ensures that the sampling scheme explores the model space well, for example, by using suitably adapted values for the different models instead of the model prior values. The posterior model weights for the visited models may then only be determined based on the marginal likelihood and the prior probabilities of the specified model prior, while the empirical frequencies of the visited models are not suitable measures. Further alternative sampling schemes with better mixing have also been proposed, see, for example, Clyde et al. (2011) and Hubin and Storvik (2018). In the following empirical analysis, only the MC 3 algorithm is employed.
In the following it is assumed that M model vectors are available which are drawn using M MCMC iterations to approximate p( |y, a, b) and that these are combined in , that is, Given the posterior of the regression parameters in Equation (1) can be approximated using Alternatively the sum can be taken over the unique set of covariate inclusion vectors in using as weights the theoretic weights based on the marginal likelihoods, rescaled to sum to one over the set of considered covariate inclusion vectors (for the derivation of the marginal likelihoods see, e.g., Clyde et al., 2011).

DPC parameters
The posterior p( , V|y, , a, b) could also be approximated using where m are drawn from p( |y, a, b). However, the conditional probabilities of the Dirichlet process clustering parameters given the particular model specification are not readily available. In the following we thus proceed in a different way to obtain point estimates for and V. These point estimates of and V help to understand any group structure present in covariate inclusion. The distribution p( l |y, a, b), for l = 1, … , 2 K is approximated by . Assuming a mixture distribution for , a suitable partition into different groups is determined and point estimates for and V derived based on the partition. The approach follows the procedure suggested in Molitor et al. (2010) and Liverani et al. (2015) for employing the DPC model and consists of the following three steps: 1. Use MCMC sampling to obtain a sample of partitions which group covariate inclusion vectors together in the Dirichlet process clustering model. 2. Postprocess the sample of partitions to obtain a final grouping. 3. Determine point estimates for the group-specific parameters conditional on the final grouping.

Regarding
Step 1 different MCMC samplers have been proposed to obtain the sampled partitions. In the following the slice sampler proposed by Kalli et al. (2011) is used as implemented in the R package PReMiuM (Liverani et al., 2015) 1 . If J MCMC sampling iterations are recorded after discarding the burn-in iterations, Step 1 results in J sampled partitions  1 , … ,  J . Each partition  j groups the M covariate inclusion vectors m , m = 1, … , M contained in into nonempty subsets such that every vector m is included in one and only one subset. For Step 2, Molitor et al. (2010) proposed the following procedure to determine a consensus partition. First, an estimate of the posterior coassignment probabilities to the same cluster for pairs of observations using the sampled partitions is determined. This matrix of posterior coassignment probabilities represents a similarity matrix between observations and thus can be converted to a dissimilarity matrix and used as input to a standard clustering method for grouping observations based on a dissimilarity matrix. Each entry of the dissimilarity matrix contains the average number of partitions where the two corresponding covariate inclusion vectors are assigned to different groups of the partition, that is, the dissimilarity between k and l is determined by where 1(⋅) is the indicator function and the partition  j is given by the set {P j c } c=1, … ,C(j) , with C(j) the number of groups in the jth partition. Each P j c contains the indices of the covariate inclusion vectors assigned to the cth group in the partition. The complete dissimilarity matrix is then given by Given a dissimilarity matrix, a standard clustering method which may be employed is partitioning around medoids (PAM; Kaufman & Rousseeuw, 1990). PAM determines an optimal partition based on a dissimilarity matrix for a given number of subsets. The PAM algorithm is similar to the k-means algorithm and groups observations into k groups, with k being prespecified. The standard k-means algorithm uses the squared Euclidean distance and employs an iterative scheme where (1) cluster centers are determined which have minimal distance to the observations currently assigned to the cluster and (2) observations are assigned to their closest cluster centers. This iterative scheme is varied in PAM by enforcing that cluster centers correspond to actual observations in the data set and are thus referred to as "medoids" and by enabling also other distance and dissimilarity measures, that is, the dissimilarity measure used to determine the dissimilarity matrix. In contrast to k-means, PAM does not require the data matrix, but can also be applied if only the dissimilarity matrix between observations is available. Kaufman and Rousseeuw (1990) proposed to use the average silhouette width as criterion to select the optimal number of clusters. The average silhouette width is determined as the mean of the silhouette values of all observations. Rousseeuw (1987) defines the silhouette value for an observation as the difference between the average dissimilarity of this observation with all observations in its next closest, that is, neighboring, cluster and with the observations in its own cluster divided by the maximum of the two values to obtain a score in [−1, 1]. Negative values indicate that the average dissimilarity of this observation is smaller for the neighboring cluster than for its own cluster. The higher the average silhouette width the better is a clustering solution. Using average silhouette width as criterion only selects a minimum number of two clusters. In order to also consider one cluster as a suitable clustering solution the C index as implemented in Charrad et al. (2015) is used as an additional criterion to assess if forming only one cluster would be preferable (see also Charrad et al., 2014). In Charrad et al. (2015), the C index is determined by subtracting from the average within-cluster dissimilarities the minimum within-cluster dissimilarity and dividing by the difference between maximum and minimum within-cluster dissimilarity. If the C index does not suggest to prefer the one cluster solution, the best clustering solution (with at least two clusters) is selected according to the average silhouette width.
Step 2 results in a consensus partition of the covariate inclusion vectors in . In Step 3 this consensus partition is the basis for determining cluster sizes and cluster-specific parameters. In the following only point estimates based on the best consensus partition are considered. Point estimates for V are determined based on the relative sizes of the groups in the best consensus partition. The best consensus partition splits the model vectors into different groups. The point estimates for are obtained by calculating the mean covariate inclusion separately for each group of model vectors .
The DPC postprocessing procedure might be applied regardless of model prior used in the BMA analysis to obtain the model vector posterior samples . However, if not the DPC model prior is specified this corresponds to a post hoc analysis which aims at identifying a structure which has not been a priori supported by the specified model priors. In the following empirical analysis, we also combine the Dirichlet process clustering postprocessing step with BMA results obtained with other model priors to assess the impact of a suitable model prior specification.

SIMULATION STUDY
The simulation study focuses on investigating the differences in results obtained for a BMA analysis with different model priors and using the DPC postprocessing procedure to identify groups of determinants with artificial data. The model priors compared are the DPC, the BB, and the DIL model prior. The data generation process assumed for the artificial data is in line with a Dirichlet process clustering model prior. Two competing groups of covariates predict the dependent variable equally well and these two competing groups of covariates are of equal size. Of primary interest are the results obtained with model priors where the prior expected model size corresponds to the size of the true competing groups of covariates and using the hyper-g-prior for the regression parameters. In addition the differences in results obtained are discussed if the prior expected model size is increased and different regression parameter priors are considered. The artificial data are generated using a set of 20 potential covariates, x k , k = 1, … , 20. The dependent variable y is generated such that either the first three covariates (k = 1, 2, 3) or the next three covariates (k = 4, 5, 6) are suitable regressors in the linear model. This means where ∼ N(0, 2 ). To ensure that the dependent variable can be predicted using either group of covariates, the covariates are generated in the following way: First, 20 independent, standard normally distributed covariates x k are generated and the expected value y is assumed to be equal to ∑ 3 k=1 x k . Then, x 6 is redefined to ensure that − ∑ 6 k=4 x k ≈ ∑ 3 k=1 x k , that is, where ∼ N(0, 0.001 2 ). The normally distributed noise is added in order to avoid multicollinearity problems. This implies a negative correlation between x 6 and the covariates x j , j = 1, … , 5, while all other covariates are uncorrelated. The first three covariates as well as x 6 are directly correlated to y, whereas the covariates x 4 and x 5 have substantial partial correlation with y conditional on x 6 . Predictions of y based on the column space spanned by the first three covariates perform equally well as those based on the column space spanned by the next three covariates.
In the simulation study the number of observations is set to 40. This corresponds to a usual scenario where BMA methods are employed to account for model uncertainty in the context of long-term economic growth. The standard deviation of the error term is varied to assess how results differ depending on the strength of the linear relationship. The influence of the specified model prior on the results is assumed to be smaller in case is smaller and the information contained in the data is stronger. The five different values considered for are {0.01, 0.1, 0.5, 1, 2}. Note that the number of observations and are related and a similar signal-to-noise ratio can be achieved by increasing both, the number of observations and . For each setting 100 artificial data sets are generated.
The analysis is performed for three different model priors: (1) the DPC, (2) the BB, and (3) the DIL model prior. Standard BMA analysis is performed for each model prior to obtain a sample from the posterior distribution of covariate inclusion vectors . In the following detailed results are provided where the prior expected model size for all three model priors is set to 3 (under the assumption of orthogonal covariates for the DIL prior) and the hyper-g-prior for the regression coefficients is used. The sample is drawn using three million iterations after two million burn-in iterations of the MC 3 algorithm implemented in the R package BMS (Zeugner & Feldkircher, 2015).
The convergence of the MC 3 algorithm is assessed using the correlation between the posterior model probabilities determined using the marginal likelihoods as well as using the empirical frequencies how often the models were visited. On average across all artificial data sets and the different specifications, these correlations are high indicating a strong congruence between those posterior model probability estimates and suggesting convergence of the chains. For the three different model priors the average correlations vary between 0.980 for the DPC model prior, 0.971 for the BB model prior, and 0.973 for the DIL model prior. As an additional check to investigate the mixing of the chains, two chains are initialized using either of the true models and combined. Results obtained with this initialization scheme are almost the same for the Dirichlet process clustering and the BB model prior. Slightly better clustering performance is obtained in this case for the DIL model prior. This indicates that mixing of the chains is worse when using the DIL model prior and more care is needed when using this model prior. This also suggests that the performance of the DIL model prior observed might not only be influenced by the characteristics of the model prior imposed, but also by the difficulties of the sampling scheme employed to obtain suitable draws from the posterior of the model specifications.
To increase the computational efficiency in the Dirichlet process clustering postprocessing step the sample is reduced to only 10,000 sampled covariate inclusion vectors m such that these vectors have the highest posterior weights determined based on the marginal likelihoods and that their inclusion frequencies are proportional to these posterior weights. This reduced sample is analyzed using DPC as implemented in package PReMiuM (Liverani et al., 2015). For this analysis a Gamma(2, 1) prior for and an expected model size corresponding to the prior expected model size of the model priors are used, that is, a = 1 and b = (K − k)∕k with k = 3 and K = 20. Table 1 summarizes the model specifications used as input for the postprocessing procedure which are obtained with MC 3 for the three model priors consisting of (1) the Dirichlet process clustering (DPC) model prior, (2) the BB model prior, and (3) the DIL model prior. The table contains for each of the specifications the average performance over the 100 artificial data sets with TA B L E 1 Correct estimation, that is, fraction of "true" models for the different model priors and specifications including either the first three (k = 1, 2, 3) or the next three (k = 4, 5, 6) covariates (top part, columns 2-4); overestimation, that is, fraction of models which contain either the first three or the next three covariates plus additional covariates (top part, columns 5-7); missingness, that is, fraction of models which do include neither x 1 to x 3 nor x 4 to x 6 (bottom part, columns 2-4); mean model size (bottom part, columns 5-7) the standard deviation across data sets in parentheses. The results are summarized by reporting the correct estimation, that is, the fraction of model specifications containing either the first three (k = 1, 2, 3) or the next three (k = 4, 5, 6) covariates and no other covariates, the overestimation, that is, the fraction of model specifications containing either the first three or the next three covariates plus additional covariates, and the missingness, that is, the fraction of model specifications which contain neither the full set of the first three nor the full set of the second three covariates. The values for correct estimation, overestimation, and missingness sum to one. In addition also the mean model size is reported. Columns 2-4 at the top part of Table 1 display the correct estimation of the different model priors. For small values of all model priors have high values which decrease with increasing , that is, when the signal-to-noise ratio deteriorates. For small to medium values of the fraction of model specifications increases which also contain additional, superfluous covariates. For = 1 the correct estimation values are small for all model priors because of overestimation as well as missingness, whereas if is further increased overestimation is reduced and missingness is further increased. This is also reflected in the mean model size results which increase with increasing up to 1 for all model priors, but have the smallest values for = 2 for all model priors. A comparison of the three different model priors indicates that for this simulation setting the DPC and DIL prior perform similarly based on these performance criteria, while the BB prior has a slightly worse performance because of higher rates of overestimation for medium values of .

Correct estimation
The proposed DPC postprocessing approach is then applied to these sampled model specifications regardless of the model prior specified. First, the results for the automatic procedure to select a suitable number of clusters based on the average silhouette width in combination with  Table 2 contains the number of clusters selected based on this criterion for each of the 100 artificial data sets for the different values of as well as the three different model priors. If is at most 0.5, all three model priors correctly identify the two clusters for almost all data sets, with a slightly worse performance for the DIL prior. The two clusters are less frequently correctly identified if is increased beyond a value of 0.5. Higher values of lead to more clusters being identified for a large proportion of data sets regardless of the model prior employed. For small values of the number of clusters is underestimated in rare cases for any of the model priors. This might be due to poor mixing of the Markov chain for small values where switching between the two groups of covariates which predict the dependent variable equally well is less likely. In fact an improved clustering performance is observed if a different sampling scheme is employed which consists of combining the sampled model specifications using two different initializations where each of the initializations corresponds to one of the two true solutions. In the following the DPC postprocessing procedure is applied assuming that it is known that there are two clusters in the data. This means not the average silhouette width in combination with the variant of the C index is used to select the number of clusters, but the number is fixed to two. Tables 3, 4, and 5 contain the marginal posterior inclusion probabilities (PIPs) on the aggregate level for all three model priors as well as the cluster-specific PIPs determined based on the results from the DPC analysis applied to after fixing the number of clusters to two. Only the tables for the results for ∈ {0.01, 0.5, 2} are reported. Mixing of the MC 3 algorithm deteriorates for small values of , for example, = 0.01. For = 0.5 or larger, the correlation between the posterior model probabilities induced by the marginal likelihoods as well as by the empirical frequencies is at least 0.98 regardless of the model prior employed, indicating good convergence behavior. Average results over the 100 artificial data sets are shown. In addition only the results for the first 10 covariates are included because the results for covariates 11-20 are similar to those obtained for covariates seven to 10 and these rather redundant results were thus omitted. Table 3 contains the results for = 0.01 and clearly shows that results are very similar for the DPC and BB model priors. The aggregate results indicate that the first six covariates have a PIP of about 0.5 with a posterior mean (PM) of also about 0.5 or −0.5. The two clusters of covariates which both equally well predict the dependent variable are correctly identified when applying the DPC postprocessing procedure to for both model priors. This implies that if the information in the data is very strong, results are similar regardless of if a BB or DPC model prior is employed. However, the situation is different for the DIL model prior which fails to identify the TA B L E 3 Simulation results averaged over 100 data sets for = 0.01 group structure in the covariate inclusion vectors. In fact, for both identified clusters the inclusion probabilities are spread over the six data generating covariates x 1 , … x 6 . The DIL model prior fails because the covariates are not only correlated between groups of covariates, but also within one group of covariates in this simulation setup. Table 4 indicates how the performance of the different model priors changes if is increased from 0.01 to 0.5. In this case the aggregate PIP and PM deteriorate due to the worse signal-to-noise ratio of this setting with increased error variance. However, the cluster-specific estimates for all three model priors suggest that these two groups of covariates are well identified and thus primarily the group sizes vary in dependence of the model priors. We can infer from Table 5 that for = 2 the performance of estimating the aggregate PIPs deteriorates further compared with = 0.5 for all three model priors, with the DPC model prior having slightly better performance. This observation also applies to the cluster-specific estimates where the estimated PIPs for the DPC model prior relate better to the true groups of covariate inclusion than those estimated for the other two model priors.
To assess the compactness of the identified clusters the average deviation from the true cluster-specific covariate inclusion vector is determined for each cluster separately. The true cluster-specific covariate inclusion vectors c 1 and c 2 correspond to the vectors c 1 =  (1, 1, 1, 0, … , 0) and c 2 = (0, 0, 0, 1, 1, 1, 0, … , 0), respectively. The deviation for cluster j = 1, 2 is TA B L E 4 Simulation results averaged over 100 data sets for = 0. given by D j = ∑ K k=1 |PIP jk − c jk |, with K denoting the total number of covariates, and PIP jk the PIP of variable k for cluster j. The clusters are labeled to minimize the deviation. Results are shown in Table 6 for the three considered model priors and the different values of . The values are averaged over 100 artificial data sets and the clusters are identified assuming that two clusters are present. Again increasing leads to less compact clusters regardless of if a DPC or BB model prior is used. However, for the same value of the DPC model prior in general gives better results than the BB model prior. Although according to Table 1 the DIL model prior slightly outperforms the DPC prior in terms of correct estimation, the results of Table 6 show for the DIL model prior that this prior gives particularly bad results in case is small. This may be caused by the fact that the DIL model prior down-weights higher correlated models. This deteriorates the mixing of the MCMC sampler and the sampler is more likely to be stuck in one "good" region of the model space 2 . Table 6 also contains the average absolute differences between the two cluster sizes for the simulation exercise. Cluster sizes correspond to the proportion of model specifications in being 2 For MCMC sampling the correct prior model weights are used. Alternatively, one could consider to use different weights for sampling and then reweight the sampled models after the MCMC procedure. As a further alternative one could reduce the penalty term of the DIL prior, for example, by using √ |C l | instead of |C l |. This may also improve the mixing of the MCMC chain for the DIL prior in this case. TA B L E 5 Simulation results averaged over 100 data sets for = 2 assigned to either cluster 1 or 2. For the DPC prior with = 0.01 we can infer that the mean absolute difference between the cluster sizes is (approximately) 0.00. This implies that roughly 50% of the sampled models are assigned to cluster 1 and 50% to cluster 2, which is in line with our data generating process. On the other hand, we do not observe this pattern for the DIL prior, where the average difference in size is 0.72 for = 0.01. This result is also in line with the worse mixing of the MC 3 algorithm for the DIL model prior. This effect vanishes, if is increased, leading to all model priors having a similar performance with respect to the cluster sizes with an absolute difference of about 0.36-0.39 in cluster sizes. The simulation study was repeated using a different prior expected model size. The results reported so far correspond to the case where the prior expected model size corresponds to the correct model size. Investigating the results for a prior expected model size of 8 indicates the performance if a wrong prior expected model size is imposed. In this case the BB prior performs best regarding the sampled models and in fact comparable to the case where the prior expected model size corresponds to the "correct" number of 3. The correct estimation decreases for the DPC prior while the overestimation increases. The performance of the DIL prior with respect to the sampled models is in-between the DPC and BB model priors, while clustering results with respect to the number of clusters selected are rather stable regardless of prior expected model size TA B L E 6 Within-cluster deviation from the true cluster-specific covariate inclusion vectors and the absolute difference in cluster size (Δ) averaged over 100 artificial data sets specified. If two clusters are specified, the cluster profiles determined for the smallest value and the DPC model prior reflect the true grouping structure well even though also irrelevant covariates have higher inclusion probabilities due to the larger a posteriori expected model sizes. However, these inclusion probabilities are evenly distributed over all irrelevant covariates thus not negatively impacting the main interpretations drawn. But these profiles obviously lead to worse within-cluster deviations.
In a further extension of the simulation study, we varied the regression parameter priors using priors where g is fixed, in particular the unit information prior (UIP) with g = n and the risk inflation criterion prior (RIC) with g = K 2 in addition to the hyper-g-prior. Results indicate that changing the regression parameter priors from a hyper-g-prior to a UIP prior or a RIC prior has a similar effect as increasing the prior expected model size. For the RIC prior and even more pronounced for the UIP prior the posterior mean model size is increased compared with the hyper-g-prior and higher than 3 even if the prior expected model size is set to 3. While this leads to a decrease in correct models being sampled, the correct number of clusters is selected similarly well and the interpretation of the identified cluster profiles is unchanged as the increased PIPs are equally spread across the irrelevant covariates, thus only slightly increasing the individual PIPs of the irrelevant covariates. However, the within-cluster deviation criterion deteriorates. Changing the regression parameter priors changes the performance of the model priors, but their relative performance remains comparable with the DPC model prior performing best. These results indicate that clear advantages of the hyper-g-prior for the regression parameters compared with the UIP or RIC prior are discernible and the additional flexibility of the hyper-g-prior allows to better capture the correct model size a posteriori.
The results of the simulation study indicate that the DPC model prior performs particularly well in case not only different groups of covariates are relevant, but also if reliable guidance on the size of these groups is available.

EMPIRICAL APPLICATION
The economic growth data set presented in Sala-i-Martin et al. (2004)  jointness analysis for postprocessing BMA analysis results to identify complements and supplements in the covariates and thus provide a richer picture on robust determinants for long-term economic growth. The data set contains income per capita growth rates for 88 countries over the period 1960-1996 as well as 67 covariates which have been proposed as potential determinants of income growth in the literature. In line with Sala-i-Martin et al. (2004) the BMA analysis is performed with a prior expected model size of 7 (under the assumption of orthogonal covariates for the collinearity-adjusted dilution prior) and the hyper-g-prior for the regression coefficients. We base our inference concerning the PIPs of covariates on three million MCMC iterations, after discarding the first two million draws as burn-in. The MCMC sampler uses the MC 3 algorithm with birth and death as well as swap moves. covariates with the highest PIP values. These 20 covariates are the same regardless of which model prior is specified for BMA analysis. For all three model priors the same seven covariates have ranks 1-7 even though the ordering is not exactly the same. These covariates have a PIP of at least 0.24 which is considerably higher than the prior inclusion probability of 7∕67 ≈ 0.1. The covariate with the highest PIP, EAST, is the same for all three model priors with a PIP of around 90%. Considering the other covariates with high PIPs, their inclusion is in general higher for the DPC model prior than for the BB or the DIL model prior. The only exception is MALFAL66 where the PIP for the BB and the DIL model prior is substantially higher than the PIP for the DPC model prior. Covariate LIFE060 has rank 8 for the DPC and the BB model prior with a similar PIP of 0.14 and 0.13, while its PIP is much lower for the DIL model prior. For the DIL model prior LIFE060 has a PIP of 0.05 which corresponds to rank 18. On the other hand, covariates measuring religious fractions (BUDDHA, MUSLIM00) increase their importance using the DIL model prior and they also obtain better ranks in comparison.
Depending on the model prior specifications, the BMA analysis gives different samples from the posteriors of the model specifications. The DPC postprocessing procedures is applied to the samples from the posterior of the model specifications to identify groups of determinants. For computational efficiency only at most 10,000 covariate inclusion vectors are retained from the sampled vectors. The retained covariate inclusion vectors correspond to the models with the highest posterior weights (based on the marginal likelihoods) and the frequencies of these covariate inclusion vectors are proportional to the posterior weights. The priors used in the DPC postprocessing procedure are a Gamma(2, 1) prior for and a = 1 and b = (K − k)∕k with k = 7 and K = 67 for the beta prior.
In a first step in the postprocessing procedure the number of clusters needs to be determined. Table 8 contains the average silhouette width values obtained for the clustering results into the different number of clusters for the different model priors varying the number of clusters from 2 to 5. The average silhouette width values are highest for the DPC model prior, followed by the BB prior and with the smallest values for the DIL prior. This indicates that for the DPC prior a better cluster solution is obtained. The number of clusters are selected using the maximum value of the average silhouette widths. This criterion clearly suggests to select four clusters for the DPC and the BB model prior. For the DIL model prior the average silhouette width value indicates five clusters.
The cluster-specific PIP estimates obtained using the selected number of clusters, that is, four for the DPC and the BB and five for the DIL model prior, are displayed in Figure 1. The results are visualized by cluster and model prior with clusters arranged by column and the model priors by row. For the 20 most important covariates (according to the aggregate PIP values), the bar plots  Figure 1 shows that overall the cluster profiles indicated by the cluster-specific PIP values are very similar for the DPC and the BB model prior. The correspondence between the cluster profiles even seems to be stronger than the similarity between the aggregate PIP values for the two model priors. Essentially the same clusters, that is, groups of covariates which occur together, are detected and only the size of the clusters differ between the model priors (see Table 9). TA B L E 9 Cluster sizes, cluster-specific mean model sizes, and within-cluster mean absolute deviations obtained when applying Dirichlet process clustering postprocessing to obtained to approximate the posterior distribution of model specifications using as model priors the Dirichlet process clustering, the beta-binomial, and the collinearity-adjusted dilution priors The interpretation of the clusters is essentially the same for the DPC and BB model priors. For example, covariate EAST, which is a dummy covariate, has a high PIP on the aggregate level (black line), as well as for the first two clusters, but loses importance in the third and forth cluster. While EAST loses importance in the third cluster, covariates like LAAM, SAFRICA, or CONFUC are gaining importance. This implies that the groups of covariates where either EAST or LAAM, SAFRICA, and CONFUC are contained constitute robust determinants of economic growth. Furthermore, inspecting Figure 1, we find a complementary relationship between primary school enrollment P60 and price of investment goods IPRICE. For each cluster, both PIPs are either above or below the aggregate level. While Ley and Steel (2007) only found very weak evidence of jointness in the data at hand our results are perfectly in line with Doppelhofer and Weeks (2009).
The cluster profiles of the DIL model prior are harder to compare to the clustering solutions of the other two model priors because the solution contains five instead of four clusters. However, still some congruence between the solutions can be discerned. It seems that cluster 1 of the other two model priors is split into two clusters, cluster 1 and cluster 3, for the DIL model prior. This is also in line with the cluster sizes indicated in Table 9 where the sum of these two cluster sizes would be between the size of cluster 1 for the DPC model prior and the BB model prior. Cluster 2 has the same profile for all three model priors and also a similar size. Cluster 3 of the DPC and the BB model priors seems to re-emerge as cluster 5 in the DIL model prior results and the cluster size is again comparable. This leaves cluster 4 in all three model prior results to correspond to a similar group. The results of the DIL model prior also indicate that covariate EAST has high PIP for some clusters, in this case clusters 1-3, but has negligible PIP for other clusters, in this case clusters 4 and 5, whereas again covariates like LAAM, SAFRICA, or CONFUC gain importance for the latter two clusters.
The clustering solutions obtained are further characterized by the cluster sizes, the cluster-specific mean model sizes, and the within-cluster mean absolute deviations which are included in Table 9. The cluster sizes correspond to the fraction of model specifications in being assigned to each of the clusters by the DPC postprocessing procedure. For the DPC model prior cluster 1 is larger than for the BB prior. The reverse is true for cluster 2. The mean model sizes of the clusters indicate that the model size of cluster 1 is closer to the prior expected model size specified than cluster 2. These results are thus in line with the fact that the BB model prior distributes the prior mass over a larger range of model sizes. The compactness of the clusters are assessed using the within-cluster deviation from the cluster centroids where each cluster centroid is simply the mean over the covariate inclusion vectors assigned to the cluster. Similar compactness of clusters are observed for all considered model priors.

DISCUSSION
BMA identifies robust determinants in regression applications where model uncertainty is faced due to the availability of a large set of potential covariates. Due to the lack of theory suggesting the relevant covariates data-driven methods are used to assess the relevance of the different covariates. In general interest is not only in the importance of single covariates independent of other covariates being included or not, but also in the joint inclusion or exclusion of covariates. These joint patterns indicate importance of theories (Durlauf et al., 2008) and determine if covariates act as complements or substitutes based on bivariate jointness measures (Doppelhofer & Weeks, 2009;Ley & Steel, 2007;Strachan, 2009) or on Dirichlet process clustering postprocessing (Crespo Cuaresma et al., 2016). The DPC postprocessing approach proposed by Crespo Cuaresma et al. (2016) assumes that groups of covariates exist which interchangeably predict the dependent variable. These groups correspond to different theories, but neither the number of theories nor the covariates associated with each theory are a priori known. The approach proposed by Crespo Cuaresma et al. (2016) builds on the results of a standard BMA analysis and decomposes the dependency structure in the posterior distribution of covariate inclusion by splitting the inclusion patterns into clusters such that there is no dependency of covariate inclusion within clusters. Covariate importance measured by PIPs is then assessed on the cluster level.
Using a standard BMA analysis their approach does not impose a model prior which would correspond to the structure imposed a posteriori when clustering the covariate inclusion patterns. To obtain this congruence between prior and posterior structure imposed we propose to use a model prior in the BMA analysis which is based on Dirichlet process clustering. We show that the marginal distribution of this model prior is identical to that of the binomial model prior. This implies that collapsed sampling can be used to approximate the posterior distribution of covariate inclusion. This sampling scheme is identical to previously used sampling procedures. In a postprocessing step the visited models are clustered detect latent class structures within the covariate inclusion patterns and characterize these latent groups through group-specific point estimates. This step is equivalent to the postprocessing analysis already proposed in Crespo Cuaresma et al. (2016).
This article provides the framework for a BMA analysis where the a priori belief of competing groups of covariates is explicitly included and taken into account. A DPC model prior is proposed where the clusters correspond to these competing groups of covariates. Interestingly this approach can be implemented re-using already available tools for sampling from the posterior of covariate inclusion and clustering the visited models in a post hoc analysis. The posterior model weights in BMA are obtained based on the marginal likelihoods which are available in closed form and thus the MCMC sampling only needs to employ a model prior which allows to suitably explore the relevant model space. The simulation study implies that for the regression parameter prior the hyper-g-prior outperforms the priors using fixed values for g, while for the model prior the hierarchical specification using a BB model prior is not dominating the DPC prior if groups of covariates are to be identified in a postprocessing step to take the jointness of covariates into account when assessing the robustness of determinants. The empirical illustration indicates that similar clustering results are obtained regardless of the model prior imposed, even though slightly worse results are obtained for the DIL model prior. The simulation study suggests that this might be due to the data being informative of the clustering structure reducing the sensitivity on the model prior and leading to similar results for the DPC and the BB model priors.
In this article we only considered the MC 3 algorithm to perform the BMA analysis. This algorithm, however, might be prone to poor mixing. In case of different groups of covariates being relevant, it is necessary for the sampler to visit the different modes corresponding to each of these groups in order for the proposed approach to be able to correctly identify the different groups in the postprocessing step. Alternative sampling approaches for BMA with a better mixing behavior, such as those proposed in Clyde et al. (2011) and Hubin and Storvik (2018), might thus be preferable to be used in case of groups of relevant covariates being suspected. The impact of their use should be investigated in future work.