Bayesian model selection for generalized linear mixed models

We propose a Bayesian model selection approach for generalized linear mixed models (GLMMs). We consider covariance structures for the random effects that are widely used in areas such as longitudinal studies, genome‐wide association studies, and spatial statistics. Since the random effects cannot be integrated out of GLMMs analytically, we approximate the integrated likelihood function using a pseudo‐likelihood approach. Our Bayesian approach assumes a flat prior for the fixed effects and includes both approximate reference prior and half‐Cauchy prior choices for the variances of random effects. Since the flat prior on the fixed effects is improper, we develop a fractional Bayes factor approach to obtain posterior probabilities of the several competing models. Simulation studies with Poisson GLMMs with spatial random effects and overdispersion random effects show that our approach performs favorably when compared to widely used competing Bayesian methods including deviance information criterion and Watanabe–Akaike information criterion. We illustrate the usefulness and flexibility of our approach with three case studies including a Poisson longitudinal model, a Poisson spatial model, and a logistic mixed model. Our proposed approach is implemented in the R package GLMMselect that is available on CRAN.

GLMMs (Nouvellet et al., 2021;Tredennick et al., 2021).Even though the DIC is widely applicable, we show in a simulation study that the DIC has some undesirable behaviors when applied to GLMMs.To provide more reliable results, here we develop a novel Bayesian model selection approach for simultaneous selection of covariates and random effects for GLMMs.
Specifically, we focus on GLMMs where each random effect has a covariance matrix that is the product of an unknown variance component parameter and a known positive semi-definite symmetric matrix.The class of GLMMs we consider can be used for the analysis of spatial areal data (Banerjee et al., 2014;Clayton and Kaldor, 1987), genome-wide association studies (GWAS) (Williams et al., 2022), and longitudinal data (Breslow and Clayton, 1993;Xu et al., 2016).However, inference for GLMMs is difficult because the integrated likelihood function is not available in closed form.To deal with the issue of integration of random effects, we approximate the integrated likelihood function using a pseudo-likelihood approach (Wolfinger and O'Connell, 1993) that leads to a Gaussian likelihood approximation.We then assign a flat prior for the vector of regression coefficients and an approximate reference prior (Ferreira et al., 2021) for the variance components of the GLMMs, which is inspired by the reference prior proposed by Keefe et al. (2019) for Gaussian data.In addition, we also consider a half-Cauchy prior for the square root of variance components (Gelman, 2006;Polson & Scott, 2012).Because the prior on the vector of regression coefficients is improper, we develop a fractional Bayes factor (FBF) approach (O'Hagan, 1995).We note that Porter et al. (2023) have proposed FBF for Gaussian mixed models for the particular case of spatial areal data.In contrast, here we consider GLMMs.In addition, we consider not only spatial random effects but also many other types of random effects such as overdispersion random effects and longitudinal random effects.Because we use default priors combined with FBF, our proposed model selection approach is fully automatic, which obviates the need for subjective specification of hyperparameters and makes the method more accessible for practitioners.We call our two proposed model selection approaches the approximate reference method (ARM) and the half-Cauchy method (HCM).
To compare the performance of our methods ARM and HCM to the performance of the DIC, the Watanabe-Akaike information Criterion (WAIC) (Watanabe, 2010), and marginal likelihood computed by Integrated Nested Laplace Approximation (INLA) under different parameter settings, we present a simulation study based on Poisson GLMMs with a spatial random effect and an overdispersion random effect.In this simulation study, we vary the sample size, coefficient of non-null covariates, level of spatial dependence, and overdispersion level.The simulation study shows that DIC and WAIC cannot reliably distinguish the random effect when there is another random effect.In contrast, our methods ARM and HCM perform well at detecting covariates and correct dependence structure.In particular, ARM and HCM always correctly detect the case of no random effects.Finally, while the performances of the DIC and WAIC do not improve much with large sample sizes, our proposed ARM and HCM have large improvement with increasing sample size.In addition, the simulation study shows that marginal likelihood computed by INLA has similar performance to our methods ARM and HCM when selecting covariates.However, marginal likelihood computed by INLA does not perform well when selecting random effects.
Apart from the DIC, WAIC, and marginal likelihood, there are not many other Bayesian model selection approaches for GLMMs.One such approach proposed by Cai and Dunson (2006) for simultaneously selecting fixed and random effects in GLMMs assumes that the subjectspecific random effects have a covariance matrix with all its elements being free parameters to be estimated.As a consequence, the method proposed by Cai and Dunson (2006) is only applicable to problems with replications and cannot be readily applied to problems where the vector of observations is a realization from a structured multivariate distribution such as GWAS data and spatial areal data.In contrast, because we assume that each random effect has a covariance matrix that is the product of an unknown variance component parameter with a known positive semi-definite covariance matrix, our methods ARM and HCM can be applied to longitudinal data, GWAS data, and spatial areal data.
The remainder of this paper is organized as follows.Section 2 describes the GLMMs that we consider.Section 3 outlines how the pseudo-likelihood approach approximates GLMMs for non-Gaussian data by computing adjusted observations that are modeled using Gaussian LMMs.Section 4 introduces priors for model selection, the FBF approach for dealing with improper priors, and posterior computation.Section 5 presents the results of a simulation study.Section 6 illustrates our method with applications to two case studies.Section 7 concludes with a discussion and future directions.
The online supporting information contains details about the pseudo-likelihood method (Web Appendix A), additional tables for the case studies (Web Appendix B), one additional case study (Web Appendix C), several additional simulation studies (Web Appendix D), and additional figures (Web Appendix E).

GENERALIZED LINEAR MIXED MODELS
Consider a response vector    = ( 1 ,  2 , … ,   ) ⊤ of  observations.Let    be an  by  matrix of explanatory variables and    be the corresponding -dimensional vector of fixed effects.Let     be an  by   design matrix and     be the corresponding   -dimensional vector of random effects,  = 1, … , .Let vectors     and     be the th rows of    and     , respectively.Conditional on linear predictors  1 , … ,   , the observations  1 , … ,   are independent with probability density function belonging to the exponential family, that is (  |  ) = exp[    −   (  ) +   (  )],  = 1, … , , where the canonical parameter   is modeled as a linear function of fixed and random effects as   =    ⊤     + ∑     ⊤      .Each observation   has mean   =  ′  (  ) and variance   =  ′′  (  ).In addition, we assume that each vector of random effects     has a multivariate normal distribution with mean vector  and covariance matrix   Σ Σ Σ  , where the variance component parameter   is unknown and Σ Σ Σ  is a known symmetric positive semi-definite matrix.For example, if    is a vector of overdispersion random effects then the corresponding matrix Σ Σ Σ is an identity matrix.
As another example, in the case of spatial areal data, we assume that    is a vector of spatial random effects that follows a sum-zero constrained Gaussian intrinsic conditional autoregressive model (Keefe et al., 2018(Keefe et al., , 2019)), that is, where Σ Σ Σ is a known positive semi-definite covariance matrix that depends on the neighborhood structure of the spatial subregions.Specifically, an adjacency matrix    is defined such that if subregions  and  are adjacent, the entries in cells (i, j) and (j, i) are 1, otherwise 0. Let     be a diagonal matrix with each diagonal element equal to the summation of the corresponding row of   .Then, the covariance matrix Σ Σ Σ is the Moore-Penrose inverse of     −    (Keefe et al., 2018(Keefe et al., , 2019)).We note that computations for this model may be performed using the precision matrix.In addition, we note that the knowledge about the covariance matrix Σ Σ Σ has allowed, for the case of Gaussian hierarchical models with ICAR random effects, the derivation of a reference prior for the parameters (Keefe et al., 2019), and formal Bayesian model selection (Porter et al., 2023).

PSEUDO LIKELIHOOD FUNCTION FOR GENERALIZED LINEAR MIXED MODELS
A key step in Bayesian model selection is to integrate out random effects from the likelihood function.However, while for LMMs the random effects can be integrated out analytically, for GLMMs that is not possible.To overcome this difficulty, here we use a pseudo-likelihood approach that approximates a GLMM for non-Gaussian data by computing adjusted observations that are modeled using an approximate Gaussian LMM.
Let ) In Equation (2), the random effects    cannot be integrated out analytically.Our method approximates the integral in Equation (2) with a Gaussian LMM via a pseudo-likelihood approach.For a Gaussian LMM, the corresponding integral can be solved analytically, and then the likelihood function of parameters has an analytic expression.
The pseudo-likelihood approach was first proposed by Wolfinger and O'Connell (1993).The pseudo-likelihood approach is an iterative procedure that starts by writing the model as    =    +   , where    = ( 1 , … ,   ) ′ and    is a vector of errors with Cov(  ) =    = diag( 1 , … ,   ).Let α α α, β β β, μ μ μ, and V   be the current estimates of   ,   ,   , and   .Here, β β β is initialized at the estimate from a GLM fit.Now, approximate   with a first-order Taylor expansion around    = α α α and    = β β β.Rearrange all the terms in    =    +    such that the terms that depend on   , α α α, β β β, and μ μ μ appear on the left side of the equation and the remaining terms appear on the right side of the equation.Multiply both sides by V   −1 . As a result, the left side of the equation will (   − μ μ μ) +    β β β + ∑      α α α .The vector    ⋆ is known as the vector of pseudo-observations or the vector of adjusted observations.Equating    ⋆ to the right side of the equation, we obtain the following model for the adjusted observations: ∼ (0 0 0,   ). (3) Thus, the pseudo-likelihood approach assumes that    follows a Gaussian distribution with mean vector  and covariance matrix   .Substituting    with V   in Equation (3),    ⋆ can be approximately modeled with the LMM ).Therefore, we have the closed-form pseudo-likelihood function )−1 Further details about the pseudo-likelihood approach appear in Web Appendix A. To perform model selection, we first use the pseudo-likelihood function in Equation (4) in an iterative manner to estimate the parameters and to obtain adjusted observations    ⋆ .We then use these adjusted observations  ⋆  ⋆  ⋆ rather than the original observations    to perform model selection.

MODEL SELECTION
We perform model selection based on the pseudolikelihood function given in Equation (4).Similar to Ten Eyck and Cavanaugh (2018), we use the same vector of adjusted observations to compare all candidate models' posterior probabilities.Specifically, we compute the vector of adjusted observations using the full model with all candidate regressors and all candidate random effects.In addition, consider the model space  = {  ,  = 1, … , }, with C possible models.We assume that model . In Section 4.1, we specify the priors for model parameters.In Section 4.2, we specify the priors on the model space.Section 4.3 discusses approximation of the integral in Equation ( 5).In Section 4.4, we propose an FBF approach (Porter et al., 2023) to perform model selection with improper priors.

Priors for model parameters
We consider the approximate reference prior proposed by Ferreira et al. (2021) in the context of LMMs for    and the reciprocal of , which is based on the reference prior proposed by Keefe et al. (2019).In what follows, we consider the implied reference prior for  obtained by transformation of variables.For simple notation, let  without subscript represent a general model,    represent the corresponding vector of regressor coefficients, and  represent the variance component.In the reference prior (Keefe et al., 2019), all the parameters are independent.The vector of regression coefficients    is assigned a uniform prior on   .In addition, as  goes to infinity the reference prior () is proportional to  −2 .Further, as  goes to 0 the reference prior is proportional to a constant.Based on the tail behavior of the reference prior for , Ferreira et al. (2021) proposed the approximate reference prior , where   is a hyperparameter.We set   equal to 2. The choice of   = 2 is equivalent to the choice made by Ferreira et al. (2021) for Gaussian data.In addition, our simulation study shows that this choice also works well for GLMMs.Hence, for    we use the flat prior (  |) ∝ 1, and for  we use the approximate reference prior This approximate reference prior is related to the half-Cauchy prior , which has the same tail behavior.Gelman (2006)   has more mass near zero and more mass for large values of  than the approximate reference prior for  given in Equation ( 6).Here, we consider two variants of our pseudo-likelihood-based method: ARM, which uses the approximate reference prior given in Equation ( 6); and HCM, which uses the half-Cauchy prior for √ .We compare our methods ARM and HCM to the DIC and WAIC in the simulation studies presented in Section 5.

Priors on the model space
Let K denote the number of candidate covariates and Q denote the number of candidate random effects types.For example, in an application where we may have spatial random effects and/or overdispersion random effects,  = 2.
In addition, let   denote the number of covariates in Model   .For fixed effects, we use priors from Scott and Berger (2010), which automatically correct for multiplicity.Specifically, the prior probability for model With respect to random effects, there are 2  possibilities for inclusion and exclusion of random effects.Assuming that each random effect has 0.5 prior inclusion probability, the prior probability for Model   with   types of random effects is (  with   types of random effects) = 1∕2  .Because usually in practice the number of candidate random effects types Q is small, a discrete uniform prior for the inclusion of random effects is reasonable.Assuming a priori independence of inclusion of fixed effects and random effects, the prior probability for model

Integrated likelihood methods
After the priors for parameters have been defined, the integrated likelihood given in Equation ( 5) based on the adjusted observations )−1 The vector of regression coefficients     can be integrated out analytically.After integrating out     , we can write the integrated likelihood as where . Note that the vector of variance components     cannot be integrated out analytically.To compute the integral in Equation ( 7), we first perform a logarithm transformation on     .Let     = log(    ) be the vector obtained by applying the logarithm transformation to each element of     .Then, we integrate out     using a Laplace approximation to obtain where is the point that minimizes (    ), and  ′′ (    ) is the Hessian matrix.

Fractional Bayes factors
In order to obtain the posterior model probabilities of interest, we use an FBF approach.The FBF is a modification of the Bayes factor that allows for improper priors on parameters (O'Hagan, 1995).
To define the usual Bayes factor, let the baseline model   be the model with the largest integrated likelihood in the model space.Then, the Bayes factor BF  of model   versus the baseline model   is defined as the ratio of their integrated likelihoods BF  = ( ⋆  ⋆  ⋆ |  )∕( ⋆  ⋆  ⋆ |  ).Hence, we can compute the posterior probability of model   as proportional to its prior probability times its Bayes factor versus the baseline model, that is, . Note that the prior on the regression coefficients (    |  ) ∝  is improper, where  is an arbitrary constant.Thus, the Bayes factor computed with the integrated likelihood from Equations ( 7) and ( 8) is only defined up to an unspecified constant of proportionality and cannot be used to compare models directly.
To solve this problem, we use the FBF (O'Hagan 1995) to approximate the Bayes factor.Porter et al. (2023) developed the FBF method for Gaussian hierarchical models with ICAR random effects.We use the FBF approach to train the improper prior so that we can compute a meaningful Bayes factor.By training the improper prior, we mean using Bayes theorem to combine the improper prior with a fraction of the likelihood to obtain a proper distribution (O'Hagan, 1995;Porter et al., 2023).We can then use this latter distribution as a trained prior to compute a meaningful Bayes factor.Specifically, here we train the prior with a fraction  of the likelihood function.(1995), the resulting integrated likelihood of model   , called the fractional integrated likelihood, is equal to (9) The size of the training fraction  should be chosen carefully.When  is too small, the denominator in Equation (9) may diverge.If  is too large, a substantial part of the integrated likelihood is used to train the prior on the parameters, and then the remaining information in the integrated likelihood used to update the prior model probabilities will lead to less distinctive posterior model probabilities.Here, we consider a training fraction size equal to  = ∕, where  is the equivalent training size.
To guide the choice of  in our considered GLMM context, we use the fact that for LMMs with the reference prior proposed by Keefe et al. (2019) the minimal value of  that guarantees propriety of the fractional integrated likelihood is  + 1 (Porter et al., 2023).In particular, in all the GLMM applications we present in Section 6, the training fraction  = ( + 1)∕ yields well-defined Bayes factors.
Then, the FBF of model   versus model   is defined . Next, we compute the posterior probability of model   as proportional to the FBF, BF   , times the prior probability of model   , that is,

SIMULATION STUDY
To investigate the performance of our proposed model selection methods ARM and HCM when compared to the widely used DIC, WAIC, and marginal likelihood computed by INLA, we perform a simulation study for different combinations of parameter settings.Here, we present results for Poisson GLMMs.In the Web Appendix C, we present results for Bernoulli GLMMs.For each combination of parameter settings, we generate 100 datasets.We simulate samples on regular square grids and consider three sample sizes,  = 100, 400, and 900.Each sample may have spatial dependence based on a firstorder neighborhood structure modeled with a vector of spatial random effects    1 following the ICAR distribution given in Equation ( 1).For the variance component  1 of the spatial random effects, we consider values 0, 0.03, 0.05, 0.1, or 1, where  1 = 0 implies no spatial dependence.We also consider the possibility of overdispersion random effect    2 in the model.We set the variance component  2 of the overdispersion random effect to 0, 0.05, 0.1, 0.5, or 1, where  2 = 0 implies no overdispersion.We consider four candidate covariates  1 ,  2 ,  3 , and  4 sampled from a standard normal distribution.We assume that    = ( 0 ,  1 ,  2 , 0, 0) ⊤ , thus the last two covariates are not in the true model.Here,  0 is the intercept, with values equal to 1, 2, or 4. We let  1 =  2 with values 0, 0.1, 0.2, 0.3, 0.5, or 1.When  1 and  2 are both equal to 0, there is no covariate in the true model.Conditionally independent Poisson observations   are generated with the GLMM   |  ind ∼ Poisson(  ),  = 1, … , , with log   =  0 +  1  1 +  2  2 +  3  3 +  4  4 +  1 +  2 , spatial random effects    1 ∼ (0 0 0,  1 Σ Σ Σ), and overdispersion random effects    2 ∼ (0 0 0,  2   ).
For each parameter setting, there are  = 64 candidate models in total.Specifically, there are 2 4 possible combinations of fixed effects.In addition, there are four possible combinations of random effects types, one with both spatial random effects and overdispersion random effects, one with only spatial random effects, one with only overdispersion random effects, and one without any random effects.We calculate posterior model probabilities for all 64 models, and we compute posterior inclusion probabilities for each of the four covariates, for the spatial random effect, and for the overdispersion random effect.
We compare our model selection methods ARM and HCM to the DIC, the WAIC, and marginal likelihood computed by the R package INLA (Rue et al., 2009).For the ARM and HCM, we decide to include a component in the selected model if the posterior inclusion probability of that component is larger or equal to 0.5, that is, if that component is in the median probability model (Barbieri & Berger, 2004).For the criteria computed by INLA, we select the model with the lowest DIC and WAIC values, and the highest marginal likelihood, respectively.For the three criteria computed by INLA, we consider the INLA default prior specification as well as our proposed Approximate reference (AR) prior and HC prior.
Because currently the most widely used criteria for Bayesian selection of GLMMs are the DIC and WAIC computed with INLA default priors, here we compare these criteria with our ARM and HCM.We present a comparison of our methods ARM and HCM to DIC and WAIC computed using our AR and Half Cauchy (HC) priors in Section D4 of Web Appendix D. The conclusions are similar to those for DIC and WAIC computed with default INLA priors presented here.Figure 1 presents the probability of each competing method selecting the correct covariate structure as a function of the value of their regression coefficients  1 =  2 .Here, there are spatial random effects with  1 = 0.05 and overdispersion random effects with  2 = 0.05.Three sample sizes are considered:  = 100, 400, 900.Two values for the intercept are considered:  0 = 1 and 4. Figure 1 shows that the ARM and HCM perform much better than the DIC and the WAIC computed with INLA's default priors .For example, in the most challenging case considered with  = 100 and  0 = 1, the ARM and HCM have a higher probability than the DIC and WAIC of selecting the correct covariate structure when their regression coefficients  1 and  2 are zero.In addition, as the value of  1 =  2 increases, the probability of the ARM and HCM to correctly select the true non-null covariates  1 and  2 increases more quickly than that of the DIC and the WAIC.Finally, the probability of ARM and HCM to correctly select the two non-null regressors increases much closer to 1 than those of the DIC and WAIC as the sample size increases and as the intercept value increases.As the sample size increases, the probability of ARM and HCM detecting covariates with small coefficients increases substantially.For example, the left panels of Figure 1 show that when the intercept is equal to 1, the probabilities of our proposed methods choosing the correct covariates structure when the coefficient is equal to 0.1 are about 10%, 60%, and 90% for sample sizes 100, 400, and 900, respectively.
Figure 2 investigates the impact of different magnitudes of the variance components on the probability of selecting the correct covariate structure as a function of the value of the regression coefficients  1 =  2 .Figure 2a,b presents settings with small ( 1 = 0.01 and  2 = 0) and large ( 1 = 1 and  2 = 1) variance components, respectively.In both panels, the sample size is  = 400 and the intercept is  0 = 1.In the small variance components setting, ARM and HCM perform comparably to the DIC and WAIC for small values of  1 =  2 , but our methods ARM and HCM greatly outperform the DIC and WAIC for moderate to large values of  1 =  2 .Meanwhile, in the more challenging large variance components setting, when  1 =  2 = 0, our ARM and HCM correctly select the model with no regressor in the model for 100% of the simulated datasets samples.In contrast, when  1 =  2 = 0, the DIC and WAIC select the wrong covariate structure for 20% of the simulated datasets, respectively.Finally, as the magnitude of  1 =  2 increases, in comparison to the DIC and WAIC, ARM and HCM achieve much higher probabilities of selecting the correct model.
Figure 3 presents the probability of selecting correct spatial random effects structure as a function of the value of the variance component for the spatial random effects.Results are shown for sample sizes  = 100, 400, and 900, and variance of overdispersion random effects  2 = 0 and 0.1.Figure 3 shows that the DIC and WAIC have low probability of selecting the model with no spatial random effects when the true model does not have spatial random effects.In addition, this performance does not improve much as the sample size increases from 400 to 900.In contrast, our methods ARM and HCM have large probabilities of selecting the correct spatial random effects structure when the true model does not have spatial random effects, and have large probabilities of selecting spatial random effects when the variance component for the spatial random effects is large.Finally, the performance of ARM and HCM at correctly detecting spatial dependence greatly improves as the sample size increases.
ARM, HCM, DIC, and WAIC's performance when selecting overdispersion random effects is similar to selecting spatial random effects.Web Figure S1 in the Supporting information presents the probability of selecting correct overdispersion structure as a function of the value of the variance for overdispersion.Web Figure S1 shows that the DIC and WAIC have low probability of selecting a model with no overdispersion random effects even when overdispersion is not present in the true model, and this undesirable performance does not improve much when the sample size increases.In contrast, our methods ARM and HCM have large probabilities of selecting correct overdispersion structure when overdispersion is not present in the true model, and have large probabilities of selecting overdispersion when the proportion of variance due to overdispersion is large.Finally, the performance of ARM and HCM at correctly detecting overdispersion greatly improves as the sample size increases.
Web methods at selecting covariates when coefficients of covariates are small.INLA marginal likelihood with INLA's default prior or INLA marginal likelihood with our proposed priors are better than our methods ARM and HCM when the regression coefficient is large.For spatial random effects inclusion, Web Figure S13 shows that INLA marginal likelihood with any of the considered priors has difficulties to detect spatial random effects.For overdispersion random effects, Web Figure S14 shows that when there is no spatial random effects in the model, INLA marginal likelihood can correctly select overdispersion random effects.However, when there are spatial random effects in the model, marginal likelihood computed by INLA cannot correctly select overdispersion random effects.In summary, INLA marginal likelihood with our proposed priors works well for selection of regressors but does not work well for the selection of random effects.Meanwhile, our ARM and HCM methods work well in both cases.Strong dependence structure.Dependence structure can affect our method's performance for detecting covariates with small coefficients.However, the DIC and WAIC have difficulty detecting covariates even with large coefficients when spatial dependence is strong.

Longitudinal epilepsy seizure data
We analyze a dataset on epilepsy seizures previously analyzed by Thall and Vail (1990), Breslow and Clayton (1993), and others.The data were collected in four biweekly visits of 59 epileptics during a clinical trial to evaluate the effectiveness of a drug to control seizures (Leppik et al., 1987).The response variable is the number of seizures   for patient  on visit .The most general model we consider is   |  ind ∼ Poisson(  ), with log(  ) =    ⊤     +  1 +    2 +  3 ,   1 ∼ (0 0 0,  1  59 ),   2 ∼ (0 0 0,  2  59 ), and    3 ∼ (0 0 0,  3  236 ),  = 1, … , 59 and  = 1, … , 4, where     denotes a six-dimensional vector with a one for intercept and five covariates.The 59 subjects were randomly assigned to a new drug or a placebo.The first covariate is the treatment indicator (Trt), where Trt=1 indicates that the patient received the treatment and Trt=0 indicates that the patient received the placebo.The second covariate is the baseline level of seizures (base), equal to the logarithm of the average number of epileptic seizures per two weeks recorded in the 8-week period before the treatment.The third covariate is the interaction term of Base and Trt.The fourth covariate is the logarithm of age (Age).And the fifth covariate is the visit number (Visit), with the four visits coded as −3, −1, 1, and 3. Breslow and Clayton (1993) mentioned that preliminary analysis indicated that the counts were substantially lower during the fourth visit.Thus, they also define a binary variable V4, such that V4=1 indicates the fourth visit and V4=0 indicates the other visits.In the model above,    is the vector of regression coefficients,    1 = ( 11 , … ,  1 59 ) is the vector of patient-specific random effects,    2 = ( 21 , … ,  2 59 ) is the vector of patient-specific random effects for the slope of the variable Visit with    = (−0.3,−0.1, 0.1, 0.3), and    3 = ( 311 , … ,  3 59 1 ,  312 , … ,  3 59 2 , … ,  314 , … ,  3 59 4 ) is the vector of overdispersion random effects.
The covariates Trt, Base, Age, and Visit can be included in the model independently.However, the interaction term between Trt and Base is only included when both Trt and Base are in the model.Thus, there are 20 possible combinations of covariates.For the dependence structure, we follow the four cases considered by Breslow and Clayton (1993): no random effects in the model; only patientspecific random effects    1 ;    1 and overdispersion random effects    3 ;    1 and patient-specific random effects for the slope of the variable Visit    2 .Finally, we assume that the vectors of random effects    1 ,    2 , and    3 are independent.Therefore, the model space has 80 models, composed by 20 combinations of covariates and 4 possible settings of random effects.
Table 1 presents the posterior inclusion probabilities of the fixed and random effects.Both the ARM and the HCM F I G U R E 3 Probability of selecting the correct spatial random effects structure as a function of the value of variance component for spatial random effects.Settings:  0 = 2,  1 =  2 = 1, n=100 (top row), n=400 (middle row), n=900 (bottom row), and  2 = 0.1 (left column),  2 = 0 (right column).If the spatial variance proportion is zero then there is no vector of spatial random effects in the model, and the correct decision is to not select the vector of spatial random effects.
indicate that the baseline level of seizures (Base) should be included in the model.However, the posterior inclusion probabilities do not provide support for any of the other covariates.Further, both ARM and HCM strongly indicate that    2 , the patient-specific random effects for the slope of the variable Visit should not be included in the model.Finally, both ARM and HCM strongly indicate the need to include the patient-specific random effect    1 and overdispersion random effect    3 .
Web Table S1 in the Supporting information presents a summary of the model selection results for the epilepsy data by comparing methods ARM, HCM, DIC, and WAIC.
A check mark appears next to the effects (rows) selected by each method (column).In addition, Web Table S1 presents the selection of fixed effects and variance components based on estimates and standard errors reported by Breslow and Clayton (1993) for two models fitted with PQL, which we denote by PQL1 and PQL2.Web Table S2 presents estimates and standard errors for the parameters based on the full model.Model PQL1 includes random effects    1 and    2 , while Model PQL2 includes random effects    1 and    3 .Interestingly, while the original PQL method cannot choose between Model PQL1 or Model PQL2, our ARM and HCM clearly show that the data support exclusion of random effect    2 and inclusion of random effects    1 and    3 .Further, the DIC and WAIC agree with the ARM and HCM and also choose random effects    1 and    3 .Furthermore, in terms of fixed effects the DIC and WAIC are the least parsimonious, choosing Base, Trt and Trt×Base, while PQL chooses Base and Trt.Finally, the ARM and HCM are the most parsimonious and choose only the Base covariate.
In addition to selecting more parsimonious models, our ARM and HCM provide more definitive support for the inclusion or exclusion of each effect in the form of Bayesian posterior probabilities.For example, the posterior inclusion probabilities of the patient-level random effects    1 , overdispersion random effects    3 , and the covariate Base are all equal to 1. Further, there is a lot less support for the covariate V4, which has posterior inclusion probability of 0.12 by the ARM and 0.11 by the HCM.Furthermore, both ARM and HCM provide posterior inclusion probability equal to zero for the interaction between Trt and Base.Finally, the simulation study presented in Section 5 shows that we can rely on the uncertainty quantification provided by the ARM and HCM.

Spatial lip cancer data
In this section, we present an analysis of the Scottish lip cancer dataset previously analyzed by Clayton and Kaldor (1987), Breslow and Clayton (1993), Ferreira and De Oliveira (2007), among many others.,   1 ∼ (0 0 0,  1 Σ Σ Σ), and    2 ∼ (0 0 0,  2    56 ),  = 1, … , 56, where   is the expected number of lip cancer cases in the th county, calculated based on the age distributions by counties.In this analysis, the   's are assumed to be known constants.In addition, the vector     is a two-dimensional vector with one as the first element and AFF for the th county as the second element.Further,    1 is a vector of spatial random effects following a sum-zero constrained Gaussian intrinsic conditional autoregressive model (Keefe et al., 2018) and modeled by Equation (1).Finally,    2 is a vector of overdispersion random effects.There are two possible combinations for the fixed effects: with or without the covariate AFF.For the random effects, we follow the options considered by Breslow and Clayton (1993).When    1 and    2 are in the model at the same time, the PQL estimate of the overdispersion variance  2 is 0. Thus, we consider models with only three random effects combinations: spatial random effects    1 , overdispersion random effects    2 , and no random effects.
Table 2 presents the posterior inclusion probabilities for the fixed and random effects.Both the ARM and HCM select the model with the covariate AFF and spatial random effect    1 .Web Table S3 in the Supporting information presents the DIC and WAIC for the six models considered.In contrast to the results of the ARM and HCM, DIC and WAIC select the model without the covariate AFF but with spatial random effect    1 .Web Table S4 in the Supporting information summarizes model selection results for the ARM, HCM, DIC, WAIC, as well as the selection of model components based on PQL methods reported by Breslow and Clayton (1993) for two models: PQL1 includes    1 and PQL2 includes    2 .Results from PQL for the AFF regressor agree with the results from the HCM and ARM.An advantage of the HCM and ARM over PQL is that they clearly indicate that the model should include a spatial random effect and not include overdispersion.

DISCUSSION
We have proposed a novel Bayesian method for model selection for GLMMs.Our approach is based on a pseudolikelihood approximation of GLMMs by LMMs leading to a closed form solution for integrating out the random effects.We consider two priors for the model parameters.First, we use an approximate reference prior that is uniform for the fixed effects and has the tail behavior of the half-Cauchy prior for the variance parameters.Second, while keeping the improper flat prior for the fixed effects, we consider the half-Cauchy prior for the square root of the variance parameters (Gelman, 2006;Polson & Scott, 2012).Finally, to deal with the prior impropriety we have developed a FBF approach.
The simulation study has shown that our proposed methods ARM and HCM perform well for correctly selecting both covariates and dependence structure.ARM and HCM assign high posterior inclusion probability to covariates with large coefficients and also high posterior inclusion probability to random effects with large variance components.In particular, ARM and HCM are better than DIC and WAIC at correctly selecting covariates.In cases where random effects have large variances, the ability of DIC and WAIC to correctly select covariates is tremendously reduced.In contrast, ARM and HCM do not suffer as badly when selecting covariates in the presence of random effects with large variances.In addition, DIC and WAIC have high false positive rates and often select null fixed and random effects.In contrast, ARM and HCM correctly assign low posterior inclusion probability to null covariates and to null random effects.We also compared our methods with marginal likelihood computed by INLA.Our results show that when we use INLA with our priors instead of the default INLA priors, the marginal likelihood computed by INLA and the marginal likelihood computed by our pseudo-likelihood approach work similarly for the selection of regression coefficients.However, the marginal likelihood computed by INLA does not work well for the selection of spatial random effects and overdispersion random effects.Therefore, it seems that our pseudo-likelihood approximation works better than the INLA approximation to the marginal likelihood for the selection of random effects.
We illustrate the application of our proposed methods ARM and HCM with three case studies.In the first case study, we consider epilepsy seizures as a type of longitudinal count data.ARM and HCM are more parsimonious, selecting baseline covariate, patient-level random effects, and overdispersion random effects.DIC and WAIC select two more covariates: treatment and interaction term between baseline and treatment.In the second case study, we study Scottish lip cancer data as a type of spatial count data.Our methods ARM and HCM select spatial dependence and covariate AFF, whereas DIC and WAIC select the model without covariate AFF but include spatial random effects.In the third case study, presented in Web Appendix C, we look at binary salamander mating data.For fixed effects, our methods ARM and HCM select WSF and WSF×WSM, whereas DIC and WAIC select all three covariates.For random effects, our two methods ARM and HCM have totally different results than DIC and WAIC: while DIC and WAIC select male random effect, our methods ARM and HCM select female random effect.Given the results from the simulation study, we recommend the models selected by ARM and HCM.
There are many potential avenues for future research.One possible future research topic is the use of Bayesian model averaging for computing credible intervals for regression coefficients.This can be facilitated by the fact that our methods provide posterior probabilities for different models.Another promising research direction is the use of nonlocal priors for the fixed effects.Finally, another possible research topic is model selection for GLMMs when the number of possible regressors is much larger than the sample size.We are currently working on the latter two research topics and will report the results in a future paper.

A C K N O W L E D G M E N T S
The work of Ferreira and Xu was supported in part by National Science Foundation awards DMS 1853549 and DMS 2054173.Computations for this paper have been performed on supercomputers of Advanced Research Computing at Virginia Tech.The authors would like to thank the associate editor and an anonymous referee for their insightful comments that helped substantially improve this paper.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The datasets analyzed in this paper are available in the R package mdhglm (Lee et al., 2018).

R E F E R E N C E S
proposed a half-Cauchy prior, however, for the standard deviation of random effects in a two-level Gaussian model.Assuming a half-Cauchy prior for the square root of the variance component parameter  implies for  the prior density  2 ()  → ∞.Hence, the half-Cauchy prior for √ The trained prior density for model   is obtained by Bayes theorem as   (    ,     ) =   ( ⋆  ⋆  ⋆ |    ,     )(    ,     |  )∕ ∫   ( ⋆  ⋆  ⋆ |    ,     )(    ,     |  )        .The integrated likelihood is then computed as an integral of the product of the likelihood function raised to 1 −  and the trained prior.Following O'Hagan Figures S12-S14 present a comparison of the performance of INLA marginal likelihood with our ARM and HCM methods.Web Figure S12 shows that INLA marginal likelihood with INLA's default priors is worse than our F I G U R E 1 Probability of selecting the correct covariate structure as a function of the value of the regression coefficient, settings:  1 = 0.05,  2 = 0.05, n=100 (top row), n=400 (middle row), n=900 (bottom row), and  0 = 1 (left column),  0 = 4 (right column).
Epilepsy data: posterior inclusion probabilities of fixed and random effects.
TA B L E 1 Lip cancer data: posterior inclusion probabilities of fixed and random effects.