## 1. Introduction

Many studies collect data that have hierarchical or clustered structures. Examples include randomized studies in which patients are clustered within practices, educational studies in which students are clustered in schools or environmental studies in which individuals are clustered in homes clustered in counties. An analysis that ignores such clustering assumes that all observations are independent, resulting in incorrect model-based standard errors that can lead to misleading scientific inferences. Multilevel models are used to account for the correlation of observations within a given group by incorporating group-specific random effects. These random effects can be nested (e.g. repeated observations of students nested in schools, with random effects at the student and school levels), cross-nested (e.g. repeated observations of students nested in high schools taking different courses, with random effects at the student, school and course levels) or even non-nested (e.g. individuals clustered within job categories and states, with random effects at the job and state level). For an introduction to multilevel models, see Gelman and Hill (2007), Fitzmaurice *et al.* (2004), Sullivan *et al.* (1999) and Bryk and Raudenbush (1992).

### 1.1. Motivating data

Birth records were obtained for all live births in New York City (NYC) in 2003 and linked to the hospital discharge data from the statewide planning and research co-operative system by the New York State Department of Health. These data include information on mother's demographic characteristics, previous births, smoking, rate of weight gain during pregnancy, maternal birth outside the USA and the infant's gender, birth weight and gestational age (Savitz *et al.*, 2008), all collected from the birth certificate. These data were also linked to US census data to obtain additional demographic information at the census tract level. Investigators are interested in identifying significant predictors of birth weight among term births adjusting for gestational age, with particular emphasis on exploring disparities that are related to race and ethnicity.

Research has shown a persistent racial disparity in birth outcomes in many countries (e.g. Osypuk and Acevedo-Garcia (2008) and Kelly *et al.* (2009)). Although individual and community level covariates have been shown to account for some of the racial disparity in low birth weight (Buka *et al.*, 2003; Roberts, 1997; Rauh *et al.*, 2001; O'Campo *et al.*, 1997), much of this excess risk remains unexplained. Howard *et al.* (2006) found substantial variability in the risk of preterm birth and low birth weight among black race subgroups that were defined by eight distinct maternal ancestries (African, American, Asian, Cuban, European, Puerto Rican, South and Central American, and West Indian and Brazilian). They also found nativity (US or foreign born) to be a significant predictor that varied by ancestry. Additionally, 48 of 67 (72%) studies that were reviewed by Gagnon *et al.* (2009) found differences in birth weight outcomes between migrants and natives in western industrialized countries. These studies have been limited by coarse ethnic categorization that obscures substantial within-group heterogeneity in behavioural, psychosocial and environmental exposures. Many data sets are also limited to the crude socio-economic indicators on the birth certificate, such as mother's completed years of education.

To expand on this research, investigators in the NYC birth study classified maternal ancestry into 62 country regions to determine whether variability in birth weight in ancestries exists within smaller geographical regions, and whether potential ancestry effects are modified by the effects of race, maternal rate of weight gain and nativity. Such variability and potential patterns therein may help researchers to understand further the factors that are associated with racial disparities in birth outcomes. More specifically, the association with race may depend on maternal ancestry (for example the association with black race may depend on whether the mother has Nigerian or Jamaican ancestry), the association with ancestry may depend on nativity (for example the association with Nigerian ancestry may depend on whether the mother lived primarily in or outside the USA) and the association with maternal rate of weight gain may depend on maternal ancestry (e.g. whether a mother has Nigerian or Jamaican ancestry). Additionally, it is common for individuals with similar demographic characteristics to live close together, resulting in social as well as biological similarities between subjects. Hence, investigators are also interested in controlling for and assessing the effect of residential location as defined by census tract of residence. Neighbourhood factors, such as the neighbourhood deprivation index (NDI), which is a standardized score of various socio-economic factors at the tract level (in which higher scores represent higher levels of deprivation), may explain some racial disparities in birth outcomes.

We fitted a multilevel linear model for infants’ birth weight, predicted by infant gestational age, gender, maternal race, parity, smoking status, age, rate of weight gain, nativity and the NDI. The maternal rate of weight gain is defined as total gestational weight gain (pounds) divided by the length of each woman's pregnancy (weeks). We consider random effects that allow heterogeneity in birth weights across maternal ancestries and across census tract groups, as well as interactions between maternal ancestry and race, maternal rate of weight gain and nativity. To address the important question of whether heterogeneity exists within census tracts and maternal ancestries and whether potential heterogeneity across ancestries is affected by race, maternal rate of weight gain and nativity, we must be able to evaluate whether the variances of the random effects are different from 0.

### 1.2. Testing variance components

Testing whether the variance of a random coefficient is equal to 0 is problematic because the null hypothesis lies on the boundary of the parameter space. Such issues are addressed extensively in the literature in the context of two-level linear models, e.g. strategies that use a mixture of *χ*^{2}-distributions (Self and Liang, 1987; Stram and Lee, 1994), score tests (Lin, 1997; Commenges and Jacqmin-Gadda, 1997; Verbeke and Molenberghs, 2003; Molenberghs and Verbeke, 2007; Zhang and Lin, 2008), Wald tests (Molenberghs and Verbeke, 2007; Silvapulle, 1992) and generalized likelihood ratio tests (Crainiceanu and Ruppert, 2004). These methods are only proposed in the context of the two-level linear model although some may be generalized to further cases. In this paper, we use the term ‘two-level linear model’ to denote a class of linear models that accommodates two levels in the data hierarchy (e.g. repeated observations nested within subjects); a notable example of this model is the standard linear mixed model (see Laird and Ware (1982)) for repeated measures on subjects over time. The broader term ‘multilevel linear model’ is used to denote a class of linear models that can have more than two levels in the data hierarchy or more than one level of clustering (e.g. repeated observations nested within subjects nested in schools). Such clusters can be nested, non-nested or cross-nested with other clusters. The two-level linear model can then be viewed as a special case of the multilevel linear model.

Methods for testing variance components in the two-level linear model are useful to some extent in nested multilevel models for testing single variance components, but the null distributions are not easily obtained for testing multiple variance components, and random effects that are non-nested or cross-nested introduce additional complications. There is very little research specifically on testing variance components in multilevel models with more than two levels. Bryk and Raudenbush (1992) proposed a *χ*^{2}-test of the residuals for evaluating variance components in multilevel models and incorporated this test in the multilevel modelling software package HLM. Other approaches for nested models include various versions of the likelihood ratio test (Snijders and Bosker, 1999; Bliese, 2002; Hox, 2002), e.g. using a one-tailed significance level or using a mixture of *χ*^{2}-distributions. Berkhof and Snijders (2001) proposed three score tests for variance components in multilevel models and compared their methods via simulation with the likelihood ratio test, fixed *F*-test and Wald test. Their simulations considered only two-level models and it is not clear whether generalizations to a larger number of levels are possible. Goldstein (1986) proposed a simple algorithm for fitting a multilevel linear mixed effects model for variance components near the boundary but did not provide a method for testing whether or not a variance component is equal to 0. Fitzmaurice *et al.* (2007) proposed a permutation test for variance components in multilevel generalized linear mixed models. They applied their method to two-level generalized mixed models and suggested strategies for multilevel models with more than two levels. Their strategy cannot be directly applied to multilevel models with crossed random effects and can test only one variance component at a time.

Bayes factors, or ratios of marginal likelihoods under equal prior probabilities, provide alternatives to frequentist hypothesis testing (see Kass and Raftery (1995)). In multilevel modelling settings, Bayes factors are ideal for comparing various types of model (e.g. multiple random effects, cross-nested or non-nested random effects), but the marginal likelihoods typically involve high dimensional integrals and are not available in closed form. Hence one must rely on approximations to the Bayes factor. The most widely used approximation to the Bayes factor is based on the Laplace approximation (Tierney and Kadane, 1986), resulting in the Bayesian information criterion (Schwarz, 1978) under certain assumptions. These approximations suffer in performance from high dimensionality (Kass and Raftery, 1995) and hence have limited applicability in multilevel models. The Bayesian information criterion and Laplace approximations are based on the assumption that the dimension of parameters is fixed as the sample size goes to ∞. This is problematic in multilevel models because the dimension of parameters increases as the sample size increases. Because of a violation of regularity conditions underlying the approximation, the Laplace method can fail when the parameter lies on the boundary of the parameter space (Pauler *et al.*, 1999; Hsiao, 1997; Erkanli, 1994).

Markov chain Monte Carlo (MCMC) methods provide alternatives for approximating Bayes factors. Many of these methods can fail for certain types of ‘default’ priors on the variance components (Pauler *et al.*, 1999). Bayesian stochastic search variable selection methods using MCMC methods in the two-level linear model (e.g. Cai and Dunson (2006) and Kinney and Dunson (2008)) may be generalizable to multilevel models, but these methods are generally computationally demanding and time consuming. Many other MCMC methods exist for model comparisons, e.g. the logarithm of the pseudo-marginal-likelihood (Gelfand, 1996), the deviance information criterion (Spiegelhalter *et al.*, 2002) and other related methods for estimating marginal likelihoods (e.g. Chib and Jeliazkov (2001)). These methods generally require the fitting of each model being compared (i.e. MCMC samples from the posterior distribution for each model) and are computationally demanding in high dimensional models. In addition, even though conceptually we can obtain a perfect estimate of the Bayes factor by using MCMC methods run for infinitely many iterations, in practice MCMC algorithms can only be run for a finite number of samples and the existing algorithms may require a very large number of iterations to obtain an accurate estimate. Hence, in practice MCMC-based estimates of Bayes factors are also approximate and it is not clear that such estimates will in general be closer to the truth (given chains of the length that are typically run for practical reasons) than faster analytic approximations. In an attempt to develop a more efficient approximation to the Bayes factor, Saville and Herring (2009) proposed a method for approximating Bayes factors in the two-level linear model via a relatively simple Laplace approximation to the marginal likelihood. Their method does not require the fitting of a model via MCMC methods but applies only to the simple case of a two-level multilevel linear model.

It is well known that Bayes factors can be sensitive to the choice of prior distributions (Kass and Raftery, 1995). This is challenging in model selection problems in which we have no prior information on the parameters. In these situations it is common to use default priors that do not require subjective inputs. One must choose these default priors with care because, as the prior variance increases, the Bayes factor will increasingly favour the null model (Bartlett, 1957). Berger and Pericchi (1996) discussed various procedures for default priors for model selection via Bayes factors. These include their proposed *intrinsic Bayes factors*, the Schwarz approximation (Schwarz, 1978) and the methods of Jeffreys (1961) and Smith and Spiegelhalter (1980). For improper non-informative priors, the Bayes factor involves an arbitrary constant and hence is not well defined (Spiegelhalter and Smith, 1982). Gelman (2006) discussed various approaches to default priors specifically for variance components. Common approaches include the uniform prior (e.g. Gelman (2007)), the half *t* family of prior distributions and the inverse gamma distribution (Spiegelhalter *et al.*, 2003). These prior distributions can encounter difficulties when the variance components are close to 0. Other discussions of selecting default priors on variance components were presented by Natarajan and Kass (2000), Browne and Draper (2006) and Kass and Natarajan (2006). As an alternative to these approaches, Saville and Herring (2009) scaled the random effects to the residual variance and introduced default priors that were shown to have good frequentist properties in the two-level linear model.

As noted previously, testing hypotheses on the boundary is problematic for certain classical (i.e. frequentist) approaches because traditional asymptotic results do not apply directly (for example it becomes more difficult to approximate the *p*-value for a likelihood ratio test). In the Bayesian case, there are no conceptual problems with testing null hypotheses on the boundary of the parameter space, but the Laplace approximation to the marginal likelihood under the alternative can be inaccurate when the parameter lies close to the boundary. We generalize the approach of Saville and Herring (2009) for testing variance components via Bayes factors to multilevel linear models with more than two levels in the data hierarchy (i.e. more than one level of clustering). The method does not require MCMC samples from the posterior distribution or the fitting of each model being compared; hence it is computationally more efficient than many of the current Bayesian methods that are available for multilevel linear models. The strategy is to scale the random effects to the residual variance and to introduce parameters that control the relative contribution of the random effects. This scaling enables us to integrate over the random effects and variance components from the posterior in closed form, such that the resulting integrals that are needed to calculate the Bayes factor are of small dimension and can be efficiently approximated with Laplace's method. In addition, we have improved the accuracy of the Laplace approximation by transforming the scale parameter so that the boundary lies at −∞ instead of 0. Our method is relatively fast to implement and may incorporate default prior distributions that have been shown to have good frequentist properties in the two-level linear model (Saville and Herring, 2009). We present the Bayesian model selection problem in Section 2. We summarize our method for approximating the marginal likelihood in Section 3 and apply our method to the NYC birth data in Section 4. A discussion follows in Section 5.