• Open Access

Assessing variance components in multilevel linear models using approximate Bayes factors: a case-study of ethnic disparities in birth weight

Authors


Benjamin R. Saville, Department of Biostatistics, Vanderbilt University School of Medicine, S-2323 Medical Center North, 1161 21st Avenue South, Nashville, TN 37232-2158, USA.
E-mail: b.saville@vanderbilt.edu

Abstract

Summary.  Racial or ethnic disparities in birth weight are a large source of differential morbidity and mortality world wide and have remained largely unexplained in epidemiologic models. We assess the effect of maternal ancestry and census tract residence on infant birth weights in New York City and the modifying effects of race and nativity by incorporating random effects in a multilevel linear model. Evaluating the significance of these predictors involves a test of whether the variances of the random effects are equal to 0. This is problematic because the null hypothesis lies on the boundary of the parameter space. We generalize an approach for assessing random effects in the two-level linear model to a broader class of multilevel linear models by scaling the random effects to the residual variance and introducing parameters that control the relative contribution of the random effects. After integrating over the random effects and variance components, the resulting integrals that are needed to calculate the Bayes factor can be efficiently approximated with Laplace's method.

1. Introduction

Many studies collect data that have hierarchical or clustered structures. Examples include randomized studies in which patients are clustered within practices, educational studies in which students are clustered in schools or environmental studies in which individuals are clustered in homes clustered in counties. An analysis that ignores such clustering assumes that all observations are independent, resulting in incorrect model-based standard errors that can lead to misleading scientific inferences. Multilevel models are used to account for the correlation of observations within a given group by incorporating group-specific random effects. These random effects can be nested (e.g. repeated observations of students nested in schools, with random effects at the student and school levels), cross-nested (e.g. repeated observations of students nested in high schools taking different courses, with random effects at the student, school and course levels) or even non-nested (e.g. individuals clustered within job categories and states, with random effects at the job and state level). For an introduction to multilevel models, see Gelman and Hill (2007), Fitzmaurice et al. (2004), Sullivan et al. (1999) and Bryk and Raudenbush (1992).

1.1. Motivating data

Birth records were obtained for all live births in New York City (NYC) in 2003 and linked to the hospital discharge data from the statewide planning and research co-operative system by the New York State Department of Health. These data include information on mother's demographic characteristics, previous births, smoking, rate of weight gain during pregnancy, maternal birth outside the USA and the infant's gender, birth weight and gestational age (Savitz et al., 2008), all collected from the birth certificate. These data were also linked to US census data to obtain additional demographic information at the census tract level. Investigators are interested in identifying significant predictors of birth weight among term births adjusting for gestational age, with particular emphasis on exploring disparities that are related to race and ethnicity.

Research has shown a persistent racial disparity in birth outcomes in many countries (e.g. Osypuk and Acevedo-Garcia (2008) and Kelly et al. (2009)). Although individual and community level covariates have been shown to account for some of the racial disparity in low birth weight (Buka et al., 2003; Roberts, 1997; Rauh et al., 2001; O'Campo et al., 1997), much of this excess risk remains unexplained. Howard et al. (2006) found substantial variability in the risk of preterm birth and low birth weight among black race subgroups that were defined by eight distinct maternal ancestries (African, American, Asian, Cuban, European, Puerto Rican, South and Central American, and West Indian and Brazilian). They also found nativity (US or foreign born) to be a significant predictor that varied by ancestry. Additionally, 48 of 67 (72%) studies that were reviewed by Gagnon et al. (2009) found differences in birth weight outcomes between migrants and natives in western industrialized countries. These studies have been limited by coarse ethnic categorization that obscures substantial within-group heterogeneity in behavioural, psychosocial and environmental exposures. Many data sets are also limited to the crude socio-economic indicators on the birth certificate, such as mother's completed years of education.

To expand on this research, investigators in the NYC birth study classified maternal ancestry into 62 country regions to determine whether variability in birth weight in ancestries exists within smaller geographical regions, and whether potential ancestry effects are modified by the effects of race, maternal rate of weight gain and nativity. Such variability and potential patterns therein may help researchers to understand further the factors that are associated with racial disparities in birth outcomes. More specifically, the association with race may depend on maternal ancestry (for example the association with black race may depend on whether the mother has Nigerian or Jamaican ancestry), the association with ancestry may depend on nativity (for example the association with Nigerian ancestry may depend on whether the mother lived primarily in or outside the USA) and the association with maternal rate of weight gain may depend on maternal ancestry (e.g. whether a mother has Nigerian or Jamaican ancestry). Additionally, it is common for individuals with similar demographic characteristics to live close together, resulting in social as well as biological similarities between subjects. Hence, investigators are also interested in controlling for and assessing the effect of residential location as defined by census tract of residence. Neighbourhood factors, such as the neighbourhood deprivation index (NDI), which is a standardized score of various socio-economic factors at the tract level (in which higher scores represent higher levels of deprivation), may explain some racial disparities in birth outcomes.

We fitted a multilevel linear model for infants’ birth weight, predicted by infant gestational age, gender, maternal race, parity, smoking status, age, rate of weight gain, nativity and the NDI. The maternal rate of weight gain is defined as total gestational weight gain (pounds) divided by the length of each woman's pregnancy (weeks). We consider random effects that allow heterogeneity in birth weights across maternal ancestries and across census tract groups, as well as interactions between maternal ancestry and race, maternal rate of weight gain and nativity. To address the important question of whether heterogeneity exists within census tracts and maternal ancestries and whether potential heterogeneity across ancestries is affected by race, maternal rate of weight gain and nativity, we must be able to evaluate whether the variances of the random effects are different from 0.

1.2. Testing variance components

Testing whether the variance of a random coefficient is equal to 0 is problematic because the null hypothesis lies on the boundary of the parameter space. Such issues are addressed extensively in the literature in the context of two-level linear models, e.g. strategies that use a mixture of χ2-distributions (Self and Liang, 1987; Stram and Lee, 1994), score tests (Lin, 1997; Commenges and Jacqmin-Gadda, 1997; Verbeke and Molenberghs, 2003; Molenberghs and Verbeke, 2007; Zhang and Lin, 2008), Wald tests (Molenberghs and Verbeke, 2007; Silvapulle, 1992) and generalized likelihood ratio tests (Crainiceanu and Ruppert, 2004). These methods are only proposed in the context of the two-level linear model although some may be generalized to further cases. In this paper, we use the term ‘two-level linear model’ to denote a class of linear models that accommodates two levels in the data hierarchy (e.g. repeated observations nested within subjects); a notable example of this model is the standard linear mixed model (see Laird and Ware (1982)) for repeated measures on subjects over time. The broader term ‘multilevel linear model’ is used to denote a class of linear models that can have more than two levels in the data hierarchy or more than one level of clustering (e.g. repeated observations nested within subjects nested in schools). Such clusters can be nested, non-nested or cross-nested with other clusters. The two-level linear model can then be viewed as a special case of the multilevel linear model.

Methods for testing variance components in the two-level linear model are useful to some extent in nested multilevel models for testing single variance components, but the null distributions are not easily obtained for testing multiple variance components, and random effects that are non-nested or cross-nested introduce additional complications. There is very little research specifically on testing variance components in multilevel models with more than two levels. Bryk and Raudenbush (1992) proposed a χ2-test of the residuals for evaluating variance components in multilevel models and incorporated this test in the multilevel modelling software package HLM. Other approaches for nested models include various versions of the likelihood ratio test (Snijders and Bosker, 1999; Bliese, 2002; Hox, 2002), e.g. using a one-tailed significance level or using a mixture of χ2-distributions. Berkhof and Snijders (2001) proposed three score tests for variance components in multilevel models and compared their methods via simulation with the likelihood ratio test, fixed F-test and Wald test. Their simulations considered only two-level models and it is not clear whether generalizations to a larger number of levels are possible. Goldstein (1986) proposed a simple algorithm for fitting a multilevel linear mixed effects model for variance components near the boundary but did not provide a method for testing whether or not a variance component is equal to 0. Fitzmaurice et al. (2007) proposed a permutation test for variance components in multilevel generalized linear mixed models. They applied their method to two-level generalized mixed models and suggested strategies for multilevel models with more than two levels. Their strategy cannot be directly applied to multilevel models with crossed random effects and can test only one variance component at a time.

Bayes factors, or ratios of marginal likelihoods under equal prior probabilities, provide alternatives to frequentist hypothesis testing (see Kass and Raftery (1995)). In multilevel modelling settings, Bayes factors are ideal for comparing various types of model (e.g. multiple random effects, cross-nested or non-nested random effects), but the marginal likelihoods typically involve high dimensional integrals and are not available in closed form. Hence one must rely on approximations to the Bayes factor. The most widely used approximation to the Bayes factor is based on the Laplace approximation (Tierney and Kadane, 1986), resulting in the Bayesian information criterion (Schwarz, 1978) under certain assumptions. These approximations suffer in performance from high dimensionality (Kass and Raftery, 1995) and hence have limited applicability in multilevel models. The Bayesian information criterion and Laplace approximations are based on the assumption that the dimension of parameters is fixed as the sample size goes to ∞. This is problematic in multilevel models because the dimension of parameters increases as the sample size increases. Because of a violation of regularity conditions underlying the approximation, the Laplace method can fail when the parameter lies on the boundary of the parameter space (Pauler et al., 1999; Hsiao, 1997; Erkanli, 1994).

Markov chain Monte Carlo (MCMC) methods provide alternatives for approximating Bayes factors. Many of these methods can fail for certain types of ‘default’ priors on the variance components (Pauler et al., 1999). Bayesian stochastic search variable selection methods using MCMC methods in the two-level linear model (e.g. Cai and Dunson (2006) and Kinney and Dunson (2008)) may be generalizable to multilevel models, but these methods are generally computationally demanding and time consuming. Many other MCMC methods exist for model comparisons, e.g. the logarithm of the pseudo-marginal-likelihood (Gelfand, 1996), the deviance information criterion (Spiegelhalter et al., 2002) and other related methods for estimating marginal likelihoods (e.g. Chib and Jeliazkov (2001)). These methods generally require the fitting of each model being compared (i.e. MCMC samples from the posterior distribution for each model) and are computationally demanding in high dimensional models. In addition, even though conceptually we can obtain a perfect estimate of the Bayes factor by using MCMC methods run for infinitely many iterations, in practice MCMC algorithms can only be run for a finite number of samples and the existing algorithms may require a very large number of iterations to obtain an accurate estimate. Hence, in practice MCMC-based estimates of Bayes factors are also approximate and it is not clear that such estimates will in general be closer to the truth (given chains of the length that are typically run for practical reasons) than faster analytic approximations. In an attempt to develop a more efficient approximation to the Bayes factor, Saville and Herring (2009) proposed a method for approximating Bayes factors in the two-level linear model via a relatively simple Laplace approximation to the marginal likelihood. Their method does not require the fitting of a model via MCMC methods but applies only to the simple case of a two-level multilevel linear model.

It is well known that Bayes factors can be sensitive to the choice of prior distributions (Kass and Raftery, 1995). This is challenging in model selection problems in which we have no prior information on the parameters. In these situations it is common to use default priors that do not require subjective inputs. One must choose these default priors with care because, as the prior variance increases, the Bayes factor will increasingly favour the null model (Bartlett, 1957). Berger and Pericchi (1996) discussed various procedures for default priors for model selection via Bayes factors. These include their proposed intrinsic Bayes factors, the Schwarz approximation (Schwarz, 1978) and the methods of Jeffreys (1961) and Smith and Spiegelhalter (1980). For improper non-informative priors, the Bayes factor involves an arbitrary constant and hence is not well defined (Spiegelhalter and Smith, 1982). Gelman (2006) discussed various approaches to default priors specifically for variance components. Common approaches include the uniform prior (e.g. Gelman (2007)), the half t family of prior distributions and the inverse gamma distribution (Spiegelhalter et al., 2003). These prior distributions can encounter difficulties when the variance components are close to 0. Other discussions of selecting default priors on variance components were presented by Natarajan and Kass (2000), Browne and Draper (2006) and Kass and Natarajan (2006). As an alternative to these approaches, Saville and Herring (2009) scaled the random effects to the residual variance and introduced default priors that were shown to have good frequentist properties in the two-level linear model.

As noted previously, testing hypotheses on the boundary is problematic for certain classical (i.e. frequentist) approaches because traditional asymptotic results do not apply directly (for example it becomes more difficult to approximate the p-value for a likelihood ratio test). In the Bayesian case, there are no conceptual problems with testing null hypotheses on the boundary of the parameter space, but the Laplace approximation to the marginal likelihood under the alternative can be inaccurate when the parameter lies close to the boundary. We generalize the approach of Saville and Herring (2009) for testing variance components via Bayes factors to multilevel linear models with more than two levels in the data hierarchy (i.e. more than one level of clustering). The method does not require MCMC samples from the posterior distribution or the fitting of each model being compared; hence it is computationally more efficient than many of the current Bayesian methods that are available for multilevel linear models. The strategy is to scale the random effects to the residual variance and to introduce parameters that control the relative contribution of the random effects. This scaling enables us to integrate over the random effects and variance components from the posterior in closed form, such that the resulting integrals that are needed to calculate the Bayes factor are of small dimension and can be efficiently approximated with Laplace's method. In addition, we have improved the accuracy of the Laplace approximation by transforming the scale parameter so that the boundary lies at −∞ instead of 0. Our method is relatively fast to implement and may incorporate default prior distributions that have been shown to have good frequentist properties in the two-level linear model (Saville and Herring, 2009). We present the Bayesian model selection problem in Section 2. We summarize our method for approximating the marginal likelihood in Section 3 and apply our method to the NYC birth data in Section 4. A discussion follows in Section 5.

2. Bayes factors and the multilevel linear model

2.1. Notation

We define the general multilevel linear model with q random factors as

image(1)

in which Yi is the response for observation i, i=1,…,m, and xi is a p×1 vector of predictors with corresponding fixed effects β. Defining inline image and inline image, we note that zih is a dh×1 vector of predictors with corresponding random effects bh[i], in which [i] indexes the group in factor h pertaining to the ith observation, and bh[i]N(0,Ψh) is independent of ɛiN(0,σ2), with bh[i] independent of bh′[i] for hh. From a Bayesian perspective, prior distributions are specified for β, Ψh and σ2 to reflect prior knowledge of the parameters. When one of the q random factors is nested within another random factor (e.g. maternal ancestry nested within geographical region), a hierarchical structure is created. A key feature of multilevel modelling is the incorporation of covariates xi that can be measured at any level of the hierarchy. This allows us to address the effect of a given covariate, say at the ancestry level, while controlling for the effect of a higher level covariate, say at the geographical region level. We must interpret such regression parameters carefully because some covariates can operate at many different levels.

To illustrate, consider the NYC birth data for 2003, in which there are 104710 observations within 62 ethnic ancestries and 2128 census tracts. The aims of our analysis are to identify significant predictors of infant birth weight and to determine whether there is heterogeneity across ancestry groups and census tracts. To start, we shall consider one predictor, maternal rate of weight gain during pregnancy, which has been linked to infant birth weight. Because of social and biological characteristics that are shared by people of the same ancestry, the effect of maternal rate of weight gain may vary by country of origin. A non-nested multilevel linear model, with a random intercept and slope (for rate of weight gain) at the ancestry level and a random intercept at the tract level, can evaluate this hypothesis. One model is

image(2)

in which Yi is the weight of infant i, xi is the rate of weight gain of the ith mother, β0 is the model intercept, β1 is the parameter corresponding to rate of weight gain, b10[i] is the random intercept, b11[i] is the random slope corresponding to the ancestry of mother i and b20[i] is the random intercept corresponding to the census tract of mother i. There are a total of 2×62=124 random effects at the ancestry level and 2128 random effects at the census level. To test whether there is heterogeneity in birth weights across ancestries (h=1) or census tracts (h=2), we can conduct a test of whether the variance of the respective random effects is equal to 0. This corresponds to a test of H0:Ψh=0, which lies on the boundary of the parameter space.

2.2. Laplace approximation to the Bayes factor

From a Bayesian perspective, we can evaluate H0:Ψh=0 by calculating the Bayes factor, or posterior odds of M1versusM0 given equal prior odds, given by

image(3)

in which M0 is the model corresponding to the null hypothesis (variance components equal to 0) and M1 is the model corresponding to the alternative hypothesis (variance components greater than 0). Calculating the Bayes factor requires the marginal likelihood

image(4)

in which p(Y|θk,Mk) is the data likelihood for model Mk, θk is the vector of model parameters and π(θk|Mk) is the prior distribution of θk.

To approximate the marginal likelihood, we consider the Laplace approximation, which is based on a linear Taylor series approximation of inline image. The marginal likelihood p(Y|Mk) is estimated by

image(5)

in which inline image is the inverse of the negative Hessian matrix of inline image evaluated at the posterior mode inline image, inline image is the marginal posterior evaluated at the posterior mode, inline image is the prior evaluated at the posterior mode and dk is the dimension of θk. Hence, to implement the Laplace approximation, we need only the matrix of second partial derivatives and the posterior mode of inline image, which for small dimensions is easily computed in standard statistical software packages. As noted previously, multilevel models are typically high dimensional and may involve variance components that are near the boundary, meaning that the Laplace approximation cannot be directly applied to the integral in equation (4).

3. Approximating the marginal likelihood

We outline the general strategy for the methods proposed and provide complete mathematical details in Appendix A. For computational convenience, we reparameterize the multilevel linear model that is given in equation (1) so that all random effects are contained in one vector b. For example, in equation (2), there are 62×2=124 random effects corresponding to the random intercept and slope for maternal ancestry, and there are 2128 random effects corresponding to the random intercept of census tract. These random effects are stacked into one vector b of dimension 2252×1. A corresponding sparse design matrix wi is created of the same dimension (i.e. a vector) that will contain mostly 0s, with non-zero elements corresponding to the appropriate random effects for observation i. Prior distributions that are specific to a given application are specified for β, σ2 and Ψh, which is the covariance matrix of the random effects bhl corresponding to factor h and classification l. We assume normality of the random effects bhlN(0,Ψh) which are independent of the residual error ɛiN(0,σ2).

Extending the work of Saville and Herring (2009), we scale the random effects, which are now denoted as inline image, to the residual variance such that inline image and introduce a parameter vector φh that controls the relative contribution of the scaled random effects. We also allow correlation between the respective random effects for a given factor through a parameter vector γh. For example, consider the cross-classified (non-nested) model that is given by equation (2). The reparameterized model takes the form

image(6)

in which inline image is the scaled random intercept and inline image the scaled random slope corresponding to the ancestry of mother i, inline image is the scaled random intercept corresponding to the census tract of mother i and inline image, in which γ110 allows correlation between the scaled random intercept and slope corresponding to ancestry (where the subscript h10 on γ denotes correlation between inline image and inline image for factor h). The random effects for this example correspond to a random intercept and slope at the ancestry level and a random intercept at the census level. Expression (6) is related to reparameterizations that are used to reduce auto-correlation in MCMC algorithms for multilevel models (Browne et al., 2009), though our focus and motivation are fundamentally different.

Let Y=(Yi,…,Ym) and σ2∼InvGam(v,w). The primary reason for scaling the random coefficients to the residual variance is that it allows the integration of inline image and σ2 from the posterior distribution in closed form, i.e. the marginal posterior p(Y|β,φ,γ) has a multivariate t-distribution. This enables us to obtain an accurate approximation of the marginal likelihood by using Laplace's method. We assume the default prior φhkN{ log (0.3),2} (corresponding to the kth random effect for factor h) that was suggested by Saville and Herring (2009) and use the Laplace method to integrate over (β,φ,γ) to obtain the marginal density p(Y). This default prior was shown to have good frequentist properties in simulation studies in the two-level linear model. The prior distributions for β and γ as well as the values of v and w in the prior for σ2 are set by the investigator on the basis of the specific application. Following Gelman et al. (2006, 2008), we advocate weakly informative priors that are chosen by subject matter knowledge in an application area but with the prior variance modestly inflated relative to one's subjectively chosen prior variance to allow robustness. The elicitation process is illustrated through the motivating application in the following section. Because of the rescaling and subsequent integration, the dimension of the marginal posterior p(Y|β,φ,γ) is much smaller than that of the data likelihood inline image and lacks parameters with boundary constraints (i.e. variance components). For example, in the model that is given by equation (6), the density p(Y|β,φ,γ) incorporates only two parameters in β, three parameters in φ and one parameter in γ. In addition, the number of parameters in the marginal posterior p(Y|β,φ,γ) is fixed as the sample size goes to ∞. Hence the Laplace approximation can be used to approximate the marginal likelihood p(Y) efficiently.

4. Application

We are interested in comparing various multilevel linear models for infant's birth weight, predicted by infant gestational age at delivery, gender, maternal race, parity, smoking status, age, rate of weight gain, maternal nativity and the NDI, with random effects for census tracts and ethnic ancestries. We focus on singleton term births with a gestational age of 37 weeks or longer and a birth weight between 900 g and 5300 g. After exclusions, we have a total of 93938 subjects with complete data available for the analysis.

We consider several competing models with various random-coefficient structures (Table 1). The first model M1 that we investigate allows a random intercept for ancestry (country of origin), defined as

image(7)

with

image(8)

The explanatory variables Blacki, Hispi, Asiani and Otheri are indicator variables for race corresponding to black, Hispanic, Asian or Pacific Islander, and other (white is the reference group). Gesti is the infant gestational age in weeks for subject i and Gestinline image is the corresponding quadratic variable. The variables Pbirthi, Femalei, Smokei and Foreigni are indicator variables for any previous births, female infant gender, maternal smoking and maternal birth outside the USA respectively. The variable NDIi is the NDI corresponding to the census tract of subject i (with higher values indicating more deprived living conditions). At the request of our epidemiologist collaborators, maternal age was categorized into the following groups: less than 25 years (the reference group), 26–30 years (Age2i), 31–35 years (Age3i), 36–40 years (Age4i) and 40 years and older (Age5i). The variable Wtgaini is the difference in pounds in maternal prepregnancy weight and weight at delivery (deliver weight minus prepregnancy weight), and Wtgaininline image and Wtgaininline image are the corresponding quadratic and cubic variables respectively. The continuous variables Gesti, NDIi and Wtgaini are centred and standardized by 2 standard deviations to place the regression coefficients on a similar scale to that of the binary indicators (Gelman, 2008). The quadratic and cubic versions of those variables are based on the standardized variables. The random intercept b1[i]N(0,ψ1) corresponds to the ancestry of subject i independent of ɛiN(0,σ2).

Table 1.   Random effects for models considered
FactorRandom coefficientM0M1M2M3M4M5M6M7
Ancestry (country of origin)Intercept × ×  ××
 Intercept: ancestry*race    ×   
 Intercept: ancestry*nativity     ×  
 Slope: maternal rate of weight gain      × 
Census tractIntercept  ××××××
Geographical regionIntercept       ×

We also consider a model M2 with a random intercept for census tracts but without random effects for ancestries,

image(9)

in which b2[i]N(0,ψ2) is the random intercept corresponding to the census tract of subject i. Incorporating random intercepts for both ancestries and census tracts, the two-factor cross-classified (non-nested) model M3 takes the form

image(10)

in which b1[i]N(0,ψ1) is independent of b2[i]N(0,ψ2). As discussed previously, the effect of race may depend on maternal ancestry. Hence we consider a variation of M3 with random intercepts for both ancestry and census tract, but we allow the effect of race to vary by ancestry. This model M4 can be written as

image(11)

in which b1p[i]N(0,ψ1p) is the random intercept corresponding to the ancestry (factor 1) of subject i within race p, independent of b2[i]. This model assumes that two people of the same ancestry with different races have different random intercepts. Similarly, it may also be that the effect of ancestry varies by nativity. Hence we consider model M5,

image(12)

in which b1s[i]N(0,ψ1s) is the random intercept corresponding to the ancestry (factor 1) of subject i within nativity s, independent of b2[i]. This model assumes that two people of the same ancestry but different nativity (one foreign born and one not foreign born) have distinct random intercepts. It may also be that the effect of maternal rate of weight gain on infant birth weight is affected by ancestry. This may result from either biological or social factors that are correlated with a given ancestry. We can model this heterogeneity by including a random slope for maternal rate of weight gain for the ancestry factor. Adding this component to model M3, we have model M6:

image(13)

in which b10[i] is the random intercept and b11[i] is the random slope for the rate of weight gain corresponding to the ancestry of subject i, and b1[i]=(b10[i],b11[i])N(0,Ψ1) are independent of b2[i].

Previous research has shown heterogeneity of infant birth weights from women in different geographical regions (Howard et al., 2006). Hence we also consider a model that includes random intercepts for the 15 geographical regions on the basis of maternal ancestry in addition to random intercepts for maternal ancestry (country of origin) and census tract, given by model M7,

image(14)

in which b3[i]N(0,ψ3) is the random intercept corresponding to the geographical region of subject i, independent of b1[i] and b2[i]. Finally, we consider model M0 without random effects,

image(15)

Our goal is to identify the preferred model and to proceed with inference by using this chosen model.

The mean value for infant birth weight is 3362 g with a standard deviation of 460 g. Converting to kilograms for computational convenience, we use prior distributions β0N(3.36,1), βN(0,I), and σ2∼InvGam(0.1,0.1) which are weakly informative priors given the scale of the response and predictors. We found very strong evidence for heterogeneity in birth weights across census tracts and across ancestries (inline image, inline image and inline image, in which inline image denotes the estimated Bayes factor comparing Mk with Mk as given by equation (3)), with birth weights tending to vary across maternal ancestries in greater magnitude than across census tracts. We found that the effects of race (inline image), nativity (inline image) and maternal rate of weight gain (inline image) do not vary by ancestry. Additionally, birth weights did not vary significantly by geographical region after accounting for maternal ancestry and census tract of residence ( log (B73)=−2).

We fitted the preferred model M3 by using MCMC methods and based inference on 20000 samples after discarding 5000 as a burn-in. The posterior means and 95% credible intervals of the fixed effects are given in Table 2. Results are presented in grams for better interpretability. Predictors with 95% credible intervals greater than 0 include parity (99,111), maternal age 26–30 years (45,60), maternal age 31–35 years (64,80), maternal age 36–40 years (74,93), maternal age greater than 40 years (60,92) and maternal foreign nativity (3,19). Hence, previous live births, greater maternal age and maternal birth outside the USA are all associated with greater infant birth weights. Predictors with 95% credible intervals that are less than 0 include maternal Asian race (−92,−21), black race (−75,−5), infant female gender (−126,−115), maternal smoking (−186,−143) and higher neighbourhood deprivation (95% credible interval (−23,−9) for an increase of 2 standard deviations). Hence, Asian and black race (compared with white), female infants (compared with males), smokers (compared with non-smokers) and greater NDI values are associated with lower infant birth weights. Both maternal rate of weight gain and infant gestational age showed non-linear associations with infant birth weight. Fig. 1 shows that, in the range 0.25–2 lb per week, a greater maternal rate of weight gain is associated with greater infant birth weights; in the range less than 0.25 or greater than 2 lb per week, a greater maternal rate of weight gain is associated with smaller infant birth weights, although some caution should be exercised in the interpretation at the extremes of the data. Fig. 1 shows that greater gestational age is associated with greater infant birth weights, but this association flattens somewhat as gestational age nears the right-hand tail of its distribution (44 weeks), perhaps because of inaccurate pregnancy dating. The variables with the largest effects on infant birth weight are smoking (inline image), female infant gender (inline image), maternal rate of weight gain (non-linear) and infant gestational age (non-linear). Variables with weaker yet ‘significant’ associations include an increase of 2 standard deviations in NDI (inline image), maternal foreign nativity (inline image) and black versus white race (inline image). Although these smaller effects have little clinical relevance at the individual level, they are interesting findings for aetiologic purposes, as a shift in the population distribution by a few grams can push many individuals beyond a critical point in the tail regions, potentially affecting perinatal mortality or other related outcomes at the population level. The 95% credible intervals for Hispanic ethnicity (−31,57) and ‘other’ race (−81,75) contain 0, indicating non-significant associations with infant birth weight. The non-significant result for Hispanic race may be due to the nature in which the variable was constructed. Data were not initially collected for Hispanic race, and investigators therefore constructed a Hispanic indicator variable by using the ethnic ancestry variable. Hence this predictor may lack the precision of the other race indicator variables. The ‘other’ race group suffered from a small sample size.

Table 2.   Model posterior means and 95% credible intervals
ParameterPosterior mean(g)2.5 percentile(g)97.5 percentile(g)
  1. †Estimates for an increase of 2 standard deviations.

β0333132953366
β1 (Black)−40−75−5
β2 (Hisp)13−3157
β3 (Asian)−57−92−21
β4 (Other)−4−8175
β5 (Gest†)296290301
β6 (Gest†)2−63−71−54
β7 (Pbirth)10599111
β8 (Female)−120−126−115
β9 (Smoke)−165−186−143
β10 (Foreign)11319
β11 (NDI†)−16−23−9
β12 (Age 26–30)524560
β13 (Age 31–35)726480
β14 (Age 36–40)847493
β15 (Age >40)766092
β16 (Wtgain†)182175189
β17 (Wtgain†)2483957
β18 (Wtgain†)3−35−41−29
Figure 1.

 Estimated change in infant birth weight by (a) gestational age and (b) maternal rate of weight gain

Fig. 2 displays 95% credible intervals for the ancestry random intercepts. Ancestries with the greatest estimated infant birth weights include Peru, Morocco and Nigeria, whereas ancestries with the lowest estimated infant birth weights include Guyana, Bangladesh, Gambia and Ivory Coast. There were no notable trends across geographical regions.

Figure 2.

 Posterior means and 95% credible intervals of random intercepts

5. Discussion

In these data with uniquely rich ancestry and geographic information, we found very strong evidence for heterogeneity in full-term infant birth weights across census tracts and across ancestries. Moreover, the variation in birth weight across maternal ancestries was greater in magnitude than across census tracts and did not vary substantially by race, maternal rate of weight gain or nativity. We note that the tests of heterogeneity for birth weights across maternal ancestries by race or nativity may suffer from low power, because many countries are comprised predominantly of one race and similar nativity (Table 3). The finding of heterogeneity in birth weights across maternal ancestries is generally consistent with the findings of Howard et al. (2006), although they studied only black women in NYC and observed the effects of nativity to vary by maternal region of ancestry. One limitation of their study was the grouping of West Indian and Brazilian ancestry, which was an artefact of the coding scheme that was used in data collection. Furthermore, Howard et al. (2006) and many previous references focused on preterm birth (gestational age less than 37 weeks), whereas the current paper examines birth weight variability among full-term births only. The advantage of our outcome definition approach is to focus more clearly on variations in fetal growth, as small babies can arise from two mechanisms: shorter gestational age and intrauterine growth retardation, and the aetiologies of these mechanisms may be entirely distinct (Wilcox and Skjaerven, 1992). Although more aetiologically focused on infant growth, however, this approach does restrict the distribution of birth weights that were included in our analyses, since much of the natural variability is contributed by gestational age. Therefore we are decomposing a subset of the true variability in birth weights, and our results apply specifically to mechanisms that operate through modifying intrauterine growth rate. The causal mechanisms for heterogeneity across ancestries may be due to any of a large number of unmeasured social and biological factors, including diet, physical activity, social support and maternal health conditions. Furthermore, it is important to note that coefficient estimates that are shown here are adjusted for measured covariates, but that in reality groups differ widely in mean values for these covariates. For example, the posterior means that are shown in Fig. 2 hold constant all variables that were included as covariates in model M3, but the reality is that these covariates are not constant across these groups in the population. Furthermore, the group means could differ even more dramatically if preterm births were also included.

Table 3.   Frequency counts for ancestry by race
RegionAncestryResults for the following races:Total
WhiteBlackHispanicAsianOther
Non-Hispanic US whiteNon-Hispanic US white24749000024749
North AfricaMorocco20321040228
Egypt3470070354
Other north Africa6544040113
Sub-Saharan AfricaNigeria3410030416
Ghana2450000452
Guinea0256000256
Senegal1206010208
Gambia0177000177
Ivory Coast0161000161
Mali2187000189
Other west Africa5219010225
Central–east–southern Africa38283040325
East AsiaChina25130550605544
Hong Kong00036036
Taiwan10065066
Korea8207840794
Japan9303520364
Other east Asia193051073
South-east Asia–Pacific IslandsVietnam64013023
Malaysia00078280
Philippines22906460677
Other south-east Asia12501510168
South-central AsiaIndia8560137471445
Bangladesh30200119001240
Pakistan4010096001010
Afghanistan6520700137
Iran96002098
Other south-central Asia149301480300
Non-Hispanic CaribbeanJamaica5207601402095
Haiti612690001275
Trinidad and Tobago121140028301435
Grenada0220030223
Barbados0175000175
St Vincent0160000160
Antigua and Barbuda0118000118
St Lucia1142010144
Virgin Islands24000042
Other non-Hispanic Caribbean169560130985
Hispanic CaribbeanDominican Republic008426018427
Puerto Rico007997038000
Cuba0019200192
MexicoMexico006585006585
South AmericaGuyana0017850731858
Ecuador003053003053
Colombia001239011240
Peru0052100521
Brazil0017800178
Argentina0019800198
Venezuela0018100181
Other south America0028300283
Central AmericaHonduras00740023763
El Salvador0064000640
Guatemala00397013410
Panama0022600226
Belize0010900109
Nicaragua0011400114
Other central America00590059
African AmericanAfrican American6212323012612403
American IndianAmerican Indian–Eskimo–Aluet518001235
Other ethnicityOther ethnicity5934401377547
Other US-born HispanicOther US-born Hispanic001356001356
Total 2607321525342791191314893938

The estimates here are useful for demonstrating how dramatically subpopulations can differ in outcomes, even when controlling for the important known determinants of birth weight. Groups in Fig. 2 range over several hundred grams in their adjusted mean weights, which is a value that is large compared with known risk factors, such as maternal smoking. Furthermore, despite many references on racial predisposition to adverse birth outcomes (Kistka et al., 2007), the greatest variation observed in these data is at the level of national ancestry. For example, Nigeria and Gambia are both West African populations that are tied to the ancestral origins of the African-American population, and yet the former has an adjusted mean that is about 100 g above the overall grand mean, whereas the latter has an adjusted mean that is about 100 g below the overall grand mean. Unique patterns of selective migration from these countries are among a large number of possible explanations for such patterns, but they are less consistent with theories of racial predisposition. Countries with ancestries with the greatest adjusted infant birth weights include Peru, Morocco and Nigeria, which have no obvious connection. Nor do the countries with ancestries with the lowest adjusted infant birth weights, such as Guyana, Bangladesh, Gambia and Ivory Coast. As noted previously, there were no notable trends across broader geographical regions after accounting for country of origin. This contrasts somewhat with earlier work that found heterogeneity in adverse birth outcomes across large ancestry regions for black women (Howard et al., 2006), though this work did not account for country-specific effects.

In summary, we have developed statistical methodology that has enabled the testing of random effects in the NYC birth weight study. Our approach avoids issues with testing on the boundary of the parameter space, uses low dimensional approximations to the Bayes factor and incorporates default priors for the variance components. Simulation studies (which are available from the authors by request) suggest that these priors have good frequentist properties and large sample consistency. The methodology is applicable to designs with any number of random effects for any number of nested, non-nested or cross-nested factors, although computational limitations may exist for extremely high dimensional problems (see Appendix A for discussion and proposed strategies). A major contribution of our method is the ability to test several variance components from multiple factors simultaneously, and to do so for nested, non-nested or cross-nested multilevel designs.

Acknowledgements

We are grateful to Dr David Savitz of the Mount Sinai School of Medicine for providing the NYC birth data and for his helpful comments on the manuscript. This study was supported by the National Institute of Child Health and Human Development (grant R21-HD050739) and National Institute of Environmental Health Sciences (grant T32-ES007018).

Appendix

Appendix A

A.1. Approximating the marginal likelihood

A.1.1. Reparameterization

For computational convenience, we reparameterize the multilevel linear model that is given in equation (1). Let

image(16)

in which inline imageinline image, wih is an rh×1 vector of predictors with corresponding random effects bh and rh=dhch is the total number of random effects for factor h, with dh the number of random effects for one observation for factor h, and ch the total number of classifications for factor h. For example, in equation (2), d1=2 and c1=62, corresponding to a random intercept and slope (two random effects for observation i) for 62 classifications of ethnicity, and d2=1 and c2=2128, corresponding to a random intercept (one random coefficient for observation i) for 2128 classifications of census tracts. Additionally, wih=(δizih), in which δi is a ch×1 vector of indicator variables (equal to 1 if yes, and 0 if no) for group membership of observation i in each of the ch classifications, and ‘⊗’ denotes the left Kronecker product. The basic idea of this reparameterization is that all random effects in the model are stacked into one large vector b. The design matrix wi will contain mostly 0s, with non-zero elements corresponding to the appropriate random effects for observation i, and has dimension r×1, with inline image the total number of random effects in the model. Also, inline imageinline image, in which inline imageinline image is the vector of all random effects for factor h. We assume that bhlNdh(0dh,Ψh) (corresponding to factor h and classification l) independent of ɛiN(0,σ2). Prior distributions are specified for β, Ψh and σ2 that are appropriate for the application.

A.1.2. Rescaling the random effects

Extending the work of Saville and Herring (2009), we scale the random effects to the residual variance such that inline image. We then express the model as

image(17)

in which inline image is the vector of scaled random effects and inline image with inline image, and φh=(φh1,…,φhdh) are parameters that control the relative contribution of the random effects. The role of inline image with inline image, in which Γh is a lower triangular matrix with 1dh along the diagonal and lower off-diagonal elements γh, is to induce correlation between the random effects within factor h. There are a total of inline image parameters in the matrix Φ, or one parameter for each ‘random effect’ in the model.

We can stack all observations into one response vector Y and write the model as

image(18)

in which Y=(Yi,…,Ym), W=(w1,…,wm), X=(x1,…,xm) and ɛ=(ɛ1,…,ɛm). Let σ2∼InvGam(v,w). By integrating out inline image and σ2 from the posterior distribution, the marginal posterior p(Y|β,φ,γ) can be shown to have the multivariate t-distribution that is given by

image(19)

in which Γ(·) denotes the gamma function and Σ=(WΦΓΓΦW+Im).

A.2. Computational considerations

A.2.1. Product of likelihoods

Although the theory that was previously outlined can accommodate any number of random effects for any number of nested, non-nested or cross-nested factors, there are computational limitations that should be considered. If the number of factors is extremely large (which is unrealistic for most settings), the Laplace approximation may eventually break down because the multivariate t-distribution may not be of sufficiently small dimension. Aside from this issue, for studies with large sample size m, the covariance matrix Σ in equation (19) may be too large to handle computationally. For example, in applying model (2) to the complete 2003 NYC data (m=104710), the covariance matrix Σ is (104710 × 104710). We note that this matrix has the potential to be extremely sparse, and even with very large m may be computationally feasible by using sparse matrix computations. When the matrix is large and not sufficiently sparse, it may be advantageous to work with the product of independent likelihoods (conditional on the random effects) as opposed to the likelihood of the vector of response variables. To illustrate, the marginal distribution can be written as

image(20)

with

image
image

and

image

in which Ir denotes the identity matrix with dimension r×r.

Using this approach, it should be computationally possible to approximate the marginal likelihood regardless of the size of m. The computation is limited, however, by the total number of random effects r. If r is very large, it may not be feasible to compute the inverse and determinant of the r×r matrix A (or may be very computationally expensive). For example, in applying model (2) to the NYC data, r=2252. Although it may be possible to compute the inverse and determinant of A in this example, computations are likely to be very slow. Hence, an alternative computational approach is to write the data likelihood as products of marginal likelihoods for lower dimensional response vectors or scalars.

A.2.2. Alternative for non-nested models (cross-classified)

Consider the NYC data in which there are two non-nested (cross-classified) factors: ancestry and census tracts. We denote the factor with fewer groups as h=1 (ancestry) and the factor with a larger number of groups as h=2 (census tracts). We can write the marginal likelihood as

image
image(21)

in which c2 is the number of groups in factor 2, Yk is the vector of responses for group k in factor 2, inline image are the random effects for factor 2, inline image are the random effects corresponding to group k in factor 2, inline image are the random effects for factor 1, mk is the number of subjects in group k of factor 2 and Yki is the response of the ith subject in group k of factor 2. This approach allows us to integrate out the random effects for factor 2 in smaller dimensions, as inline image is only a d2×1 vector. For model (2) applied to the NYC data, inline image is a scalar (representing a random intercept for census tract k) and results in matrices with smaller dimensions than those obtained from equation (21). These derivations are specific to a model with two cross-classified factors, but the general strategy could be applied to models with a larger number of cross-classified factors.

A.2.3. Alternative for nested models

Consider a three-level nested design, such as subjects nested within maternal ancestry nested within geographical region. In such cases we can use the nested structure for easier computation. Let h=1 denote the maternal ancestry factor and h=2 denote the geographical region factor. Then

image(22)

in which c1k is the number of groups for factor 1 within group k of factor 2, mkj is the number of subjects in group j of factor 1 within group k of factor 2, Ykj is the response vector for subjects in group j of factor 1 within group k of factor 2, Ykji is the response of subject i within group j of factor 1 within group k of factor 2, inline image are the random effects for factor 1 within group k of factor 2 and inline image are the random effects corresponding to group j of factor 1 within group k of factor 2. This approach allows us to integrate out the random effects inline image and inline image which have smaller dimensions equal to d1×1 and d2×1 respectively. For the NYC data with a random intercept for maternal ancestry and geographical region, inline image and inline image are both scalars. These derivations are specific to a three-level nested design, but the general strategy could be applied to models with larger numbers of nested factors, or even combinations of nested and cross-nested factors. For example, such strategies could be used on the NYC data, which have both nested and cross-classified random effects via factors for census tracts and maternal ancestry nested within geographical region.

Ancillary