### Abstract

- Top of page
- Abstract
- 1. Introduction
- 2. Inverse probability weighting in surveys
- 3. Multistage sampling and multilevel models
- 4. Pseudo-maximum-likelihood estimation for multilevel models
- 5. Sandwich estimator of the standard errors
- 6. Application
- 7. Simulation
- 8. Discussion
- Acknowledgements
- References
- Appendix

**Summary. ** Multilevel modelling is sometimes used for data from complex surveys involving multistage sampling, unequal sampling probabilities and stratification. We consider generalized linear mixed models and particularly the case of dichotomous responses. A pseudolikelihood approach for accommodating inverse probability weights in multilevel models with an arbitrary number of levels is implemented by using adaptive quadrature. A sandwich estimator is used to obtain standard errors that account for stratification and clustering. When level 1 weights are used that vary between elementary units in clusters, the scaling of the weights becomes important. We point out that not only variance components but also regression coefficients can be severely biased when the response is dichotomous. The pseudolikelihood methodology is applied to complex survey data on reading proficiency from the American sample of the ‘Program for international student assessment’ 2000 study, using the Stata program gllamm which can estimate a wide range of multilevel and latent variable models. Performance of pseudo-maximum-likelihood with different methods for handling level 1 weights is investigated in a Monte Carlo experiment. Pseudo-maximum-likelihood estimators of (conditional) regression coefficients perform well for large cluster sizes but are biased for small cluster sizes. In contrast, estimators of marginal effects perform well in both situations. We conclude that caution must be exercised in pseudo-maximum-likelihood estimation for small cluster sizes when level 1 weights are used.

### 1. Introduction

- Top of page
- Abstract
- 1. Introduction
- 2. Inverse probability weighting in surveys
- 3. Multistage sampling and multilevel models
- 4. Pseudo-maximum-likelihood estimation for multilevel models
- 5. Sandwich estimator of the standard errors
- 6. Application
- 7. Simulation
- 8. Discussion
- Acknowledgements
- References
- Appendix

Surveys often employ multistage sampling designs where clusters (or primary sampling units (PSUs)) are sampled in the first stage, subclusters in the second stage, etc., until elementary units are sampled in the final stage. This results in a multilevel data set, each stage corresponding to a level with elementary units at level 1 and PSUs at the top level *L*. At each stage, the units at the corresponding level are often selected with unequal probabilities, typically leading to biased parameter estimates if standard multilevel modelling is used. Longford (1995a,b, 1996), Graubard and Korn (1996), Korn and Graubard (2003), Pfeffermann *et al.* (1998) and others have discussed the use of sampling weights to rectify this problem in the context of two-level linear (or linear mixed) models, particularly random-intercept models. In this paper we consider generalized linear mixed models.

When estimating models that are based on complex survey data, sampling weights are sometimes incorporated in the likelihood, producing a pseudolikelihood (e.g. Skinner (1989) and Chambers (2003)). For two-level linear models, Pfeffermann *et al.* (1998) implemented pseudo-maximum-likelihood estimation by using a probability-weighted iterative generalized least squares algorithm. For generalized linear mixed models, a weighted version of the iterative quasi-likelihood algorithm (e.g. Goldstein (1991)), which is analogous to probability-weighted iterative generalized least squares, is implemented in MLwiN (Rasbash *et al*., 2003). Unfortunately, this method is not expected to perform well since unweighted penalized quasi-likelihood often produces biased estimates, in particular when the responses are dichotomous (e.g. Rodríguez and Goldman (1995, 2001)). Furthermore, Renard and Molenberghs (2002) reported serious convergence problems and strange estimates when using MLwiN with probability weights for dichotomous responses.

A better approach for generalized linear mixed models is full pseudo-maximum-likelihood estimation, for instance via numerical integration. Grilli and Pratesi (2004) accomplished this by using SAS NLMIXED (Wolfinger, 1999) which implements maximum likelihood estimation for generalized linear mixed models by using adaptive quadrature. However, they had to resort to various tricks and the use of frequency weights at level 2 since probability weights are not accommodated. SAS NLMIXED is furthermore confined to models with no more than two levels. Another limitation is that it provides only model-based standard errors which are not valid for pseudo-maximum-likelihood estimation. Grilli and Pratesi (2004) therefore implemented an extremely computer-intensive nonparametric bootstrapping approach.

In this paper we describe full pseudo-maximum-likelihood estimation for generalized linear mixed models with any number of levels via adaptive quadrature (Rabe-Hesketh *et al*., 2005). Appropriate standard errors are obtained by using the sandwich estimator (Taylor linearization). Our approach is implemented in the Stata program gllamm (e.g. Rabe-Hesketh *et al*. (2002, 2004a) and Rabe-Hesketh and Skrondal (2005)), which allows specification of probability weights, as well as PSUs (if they are not included as the top level in the model) and strata. These methods are applied to the American sample of the ‘Program for international student assessment’ (PISA) 2000 study.

For linear mixed models Pfeffermann *et al.* (1998) pointed out that the scaling of the level 1 weights affects the estimates of the variance components, particularly the random-intercept variance, but may not have a large effect on the estimated regression coefficients (if the number of clusters is sufficiently large and the scaling constants do not depend on the responses). In contrast, for multilevel models for dichotomous responses we expect the estimated regression coefficients to be strongly affected by the scaling of the level 1 weights. This is because the regression coefficients are intrinsically related to the random-intercept variance. Specifically, for given marginal effects of the covariates on the response probabilities, the regression coefficients (which have conditional interpretations) are scaled by a multiplicative factor that increases as the random-intercept variance increases (see Section 3.2). Thus, the maximum likelihood estimates of the regression coefficients and the random-intercept variance are correlated in contrast with the linear case (e.g. Zeger *et al*. (1988)). As far as we are aware, this potential problem has not been investigated or pointed out before. Although Grilli and Pratesi (2004) considered pseudo-maximum-likelihood estimation for dichotomous responses, they focused mostly on the bias of the estimated random-intercept variance. Moreover, they simulated from models with small regression parameters (0 and 0.1), making it difficult to detect multiplicative bias unless it is extreme.

Using estimates from the multilevel model, approximate marginal effects can be obtained by rescaling the regression coefficients (conditional effects) according to the random-intercept variance (e.g. Skrondal and Rabe-Hesketh (2004), page 125). We conjecture that these marginal effects will be less biased and less affected by the scaling of the level 1 weights than the original parameters. This would imply that marginal effects can be more reliably estimated in the presence of level 1 weights.

The plan of the paper is as follows. In Section 2 we briefly review descriptive and analytic inference for complex survey data with unequal selection probabilities. We then extend these ideas to multistage designs and introduce multilevel and generalized linear mixed models in Section 3. In Section 4 we suggest a pseudolikelihood approach to the estimation of multilevel and generalized linear mixed models incorporating sampling weights. We also describe various scaling methods for level 1 weights. In Section 5 we present a sandwich estimator for the standard errors of the pseudo-maximum-likelihood estimators, taking weighting into account. Having described the pseudolikelihood methodology, it is applied to a multilevel logistic model for complex survey data on reading proficiency among 15-year-old American students from the PISA 2000 study in Section 6. In Section 7 we carry out simulations to investigate the performance of pseudo-maximum-likelihood estimation using unscaled weights and different scaling methods. We also assess the coverage of confidence intervals based on the sandwich estimator and compare estimators by using different sampling designs at level 1. Finally, we close the paper with a discussion in Section 8.

### 5. Sandwich estimator of the standard errors

- Top of page
- Abstract
- 1. Introduction
- 2. Inverse probability weighting in surveys
- 3. Multistage sampling and multilevel models
- 4. Pseudo-maximum-likelihood estimation for multilevel models
- 5. Sandwich estimator of the standard errors
- 6. Application
- 7. Simulation
- 8. Discussion
- Acknowledgements
- References
- Appendix

From standard likelihood theory (e.g. Pawitan (2001), pages 372–374 and 407), the asymptotic covariance matrix of the maximum likelihood estimator is

- (10)

Here ℐ is the expected Fisher information and

where *ϑ*_{0} is the true parameter vector and the expectations are over the (true) distribution of the responses given the covariates. For model-based standard errors, the sandwich form of the covariance matrix in equation (10) collapses to ℐ^{−1} because ��=ℐ if the likelihood represents the true distribution of the responses (given the covariates). The expected Fisher information ℐ is typically estimated by the observed Fisher information *I* at the maximum likelihood estimates. Since the pseudolikelihood does not represent the distribution of the responses, the sandwich does not collapse. Instead, we estimate by , where *I* is the observed (pseudo-) Fisher information at the pseudo-maximum-likelihood estimates and the estimator *J* of �� is obtained by exploiting the fact that the pseudolikelihood is a sum of independent cluster contributions so that

We then estimate �� by

where **s**_{t} is the weighted score vector of the top level unit *t*.

We now consider a more complex design where the top level units of the multilevel model are clustered in even higher level clusters. We need to consider only the highest level clusters or PSUs which may have been sampled using stratified sampling. To accommodate this situation we shall use **s**_{hgt} for the weighted score vector of the top level unit *t* in stratum *h*, *h*=1,…,*H*, and cluster *g*, *g*=1,…,*G*_{h}, where *t*, *t*=1,…,*N*_{hg}, is now an index within stratum *h* and cluster *g*. The gradient of the log-pseudolikelihood can then be expressed as

The corresponding covariance matrix, taking stratification and additional clustering into account, becomes

where

Pseudolikelihood inference for complex surveys is discussed in Skinner (1989). The sandwich estimator that is described in this section has been implemented in gllamm.

As we shall see in our application, the procedures that were described above allow us to adopt a hybrid aggregated–disaggregated approach where lower levels of substantive interest are explicitly included in the multilevel model, whereas PSUs are considered a nuisance and are used only to adjust the standard errors.

### 6. Application

- Top of page
- Abstract
- 1. Introduction
- 2. Inverse probability weighting in surveys
- 3. Multistage sampling and multilevel models
- 4. Pseudo-maximum-likelihood estimation for multilevel models
- 5. Sandwich estimator of the standard errors
- 6. Application
- 7. Simulation
- 8. Discussion
- Acknowledgements
- References
- Appendix

We analyse data from the 2000 Organisation for Economic Co-operation and Development PISA study on reading proficiency among 15-year-old American students.

In a three-stage cluster sampling design, geographic areas (PSUs) were sampled at stage 1, schools at stage 2 and students at stage 3. Stage 1 yielded 52 PSUs. In stage 2, public schools with more than 15% minority students were twice as likely to be sampled as other schools. Within high minority and other schools, the probability of selection was proportional to an estimate of size. Of the 220 schools that were sampled, only 128 were both eligible and willing to participate. These schools were supplemented with 32 replacement schools each having similar characteristics to those of a non-participating school. In stage 3, up to 35 students were sampled from 160 schools. In public schools with more than 15% minority students, minority students were oversampled (Lemke *et al*. (2001), appendix 1), but otherwise all students aged 15 years had an equal chance of being selected. Many of the students sampled did not participate owing to ineligibility, withdrawal, exclusion or failure to take assessments. 145 schools with more than 50% student participation were classified as ‘responding’ and eight schools with between 25% and 50% responding as ‘partially responding’. These 153 schools with a total of 3846 participating students are included in the PISA database.

Here *f*_{1j} compensates for non-participation by other schools that are similar to school *j* (in terms of variables including region, metropolitan or non-metropolitan status, percentage minority and percentage of students eligible for free lunch).

The provided student weights (called W_FSTUWT) are given by

We consider the response variable [Proficiency], an indicator taking the value 1 for the two highest reading proficiency levels as defined in Organisation for Economic Co-operation and Development (2000). Specifically, the threshold 552.89 was applied to the weighted maximum likelihood estimates (Warm, 1989) of reading ability. Ability scoring was based on a partial credit model, estimated by maximum marginal likelihood on a subset of the international data; see Adams and Wu (2002) for details.

As student level explanatory variables we use gender and most of the family background variables that were considered in Lemke *et al*. (2001):

- (a)
[Female]—the student is female (dummy);

- (b)
[ISEI]—international socio-economic index (see

Ganzeboom *et al*. (1992));

- (c)
[Highschool]—highest education level by either parent is high school (dummy);

- (d)
[College]—highest education level by either parent is college (dummy);

- (e)
[English]—the test language (English) is spoken at home (dummy);

- (f)
[Oneforeign]—one parent is foreign born (dummy);

- (g)
[Bothforeign]—both parents are foreign born (dummy).

We also consider *contextual* or *compositional* effects of socio-economic status, i.e. the difference between the between-school and within-school effects. This has attracted considerable interest in education (e.g. Willms (1986) and Raudenbush and Bryk (2002)). For instance, Willms (1986) argued that the benefits of comprehensive schooling depend to a large extent on whether the socio-economic mix of a school has an effect on students’ outcomes above the effect of individual student characteristics. In addition to the student level socio-economic index [ISEI] we therefore also consider its school mean [MnISEI] as a school level covariate.

We use the two-level random-intercept logistic regression model (1) for student *i* in school *j*, where [Proficiency] is regressed on the covariates that are mentioned above. The PISA database does not include any identifier for the PSUs but this information was kindly provided by the National Council for Education Statistics. We do not include PSUs as a level in the model because the variance between PSUs (with undisclosed definition) does not appear to be of substantive interest and estimation would require knowledge of the PSU selection probabilities. PSUs were instead accounted for in the sandwich estimator of the standard errors. Because of missing data on some of the covariates (mostly for [Highschool], [College] and [ISEI]), estimation was based on 2069 students from 148 schools in 46 PSUs. The rescaling of level 1 weights was based on the estimation sample.

For the 148 schools contributing to the analysis, the provided school level weights had mean 262, standard deviation 539, lower decile 32 and upper decile 499. For the 2069 students, the provided student level weights had mean 843, standard deviation 410, lower decile 347 and upper decile 1353. Because of a very large intraclass correlation of 0.98, the rescaled student level weights using methods 1 or 2 were close to 1 and almost identical, both having standard deviations of 0.05 and lower and upper deciles of 0.94 and 1.07 respectively.

Estimates using (unweighted) maximum likelihood and pseudo-maximum-likelihood with scaling method 1 are shown in Table 1. We used 12-point and 20-point adaptive quadrature, giving the same results to the precision that is reported. Model-based standard errors SE are given (for maximum likelihood only), together with robust standard errors from the sandwich estimator not taking PSUs into account (SE_{R}) and taking PSUs into account (SE). Estimates using scaling method 2 (which are not shown) were almost identical to those using method 1 because the level 1 weights are close to 1.

Table 1. Maximum likelihood estimates (with model-based and robust standard errors) and pseudo-maximum-likelihood estimates by using scaling method 1 (with robust standard errors taking and not taking PSUs into account) *Parameter* | *Unweighted maximum likelihood* | *Weighted pseudo-maximum-likelihood* |
---|

*Estimate* | *SE* | *SE*_{R} | | *Estimate* | *SE*_{R} | |
---|

*β*_{0}, [Constant] | −6.034 | 0.539 | 0.547 | 0.458 | −5.878 | 0.955 | 0.738 |

*β*_{1}, [Female] | 0.555 | 0.103 | 0.102 | 0.111 | 0.622 | 0.154 | 0.161 |

*β*_{3}, [ISEI] | 0.014 | 0.003 | 0.003 | 0.003 | 0.018 | 0.005 | 0.004 |

*β*_{4}, [MnISEI] | 0.069 | 0.009 | 0.009 | 0.009 | 0.068 | 0.016 | 0.018 |

*β*_{5}, [Highschool] | 0.400 | 0.256 | 0.262 | 0.224 | 0.103 | 0.477 | 0.429 |

*β*_{6}, [College] | 0.721 | 0.255 | 0.257 | 0.235 | 0.453 | 0.505 | 0.543 |

*β*_{7}, [English] | 0.695 | 0.285 | 0.269 | 0.301 | 0.625 | 0.382 | 0.391 |

*β*_{8}, [Oneforeign] | −0.020 | 0.224 | 0.200 | 0.159 | −0.109 | 0.274 | 0.225 |

*β*_{9}, [Bothforeign] | 0.099 | 0.236 | 0.245 | 0.295 | −0.280 | 0.326 | 0.292 |

*ψ* | 0.271 | 0.086 | 0.082 | 0.088 | 0.296 | 0.124 | 0.115 |

The pseudo-maximum-likelihood estimates are in accordance with educational theory and previous research. For instance, controlling for other covariates, reading proficiency is better for females than for males, better for students with parents having higher levels of education and better for students having English as their home language. As expected, socio-economic status [ISEI] has a positive effect; for students from the same school, a one within-school standard deviation change in [ISEI] of 15.5 is associated with an increase in the log-odds of 0.22, controlling for the other covariates. For students from different schools, a 1 standard deviation change in school mean socio-economic status [MnISEI] of 8.9 is associated with an increase in the log-odds of 0.74 after controlling for student level [ISEI] and the other covariates. There is thus evidence of a contextual effect of socio-economic status. The intraclass correlation of the latent responses in equation (2), given the covariates, is estimated as about 0.08 (0.14 when the school level covariate [MnISEI] is excluded).

Many of the pseudo-maximum-likelihood estimates are very different from the corresponding unweighted maximum likelihood estimates. This illustrates the importance of weighting and suggests that sampling probabilities are informative in the present application. However, the loss in efficiency due to weighting is also apparent with substantially larger standard errors for pseudo-maximum-likelihood estimators. Note that taking PSUs into account does not necessarily increase the standard errors.

### 7. Simulation

- Top of page
- Abstract
- 1. Introduction
- 2. Inverse probability weighting in surveys
- 3. Multistage sampling and multilevel models
- 4. Pseudo-maximum-likelihood estimation for multilevel models
- 5. Sandwich estimator of the standard errors
- 6. Application
- 7. Simulation
- 8. Discussion
- Acknowledgements
- References
- Appendix

A Monte Carlo experiment was carried out to assess the performance of pseudo-maximum-likelihood estimation and the sandwich estimator.

First, dichotomous responses were simulated for a finite population from the two-level logistic random-intercept model (1) with linear predictor

and *β*_{0}=*β*_{1}=*β*_{2}=1 and *ψ*=1. Since performance of estimators may differ for coefficients of between-cluster and within-cluster covariates, we simulated both types of covariate. For both types of covariate we drew independent samples from a Bernoulli distribution with probability 0.5. For the between-cluster covariate *x*_{1j}, a single value was sampled for the entire cluster and, for the unit-specific covariate *x*_{2ij}, different values were first sampled for each unit and then the cluster mean was subtracted (so that *x*_{2ij} varied purely within clusters). The finite population had 500 level 2 units, each with the same number of level 1 units.

Second, we sampled from the finite population by using the following two-stage sampling design. A subset of the level 2 units were sampled by using stratified random sampling without replacement with approximate (due to rounding) sampling fractions

The average overall sampling fraction is about 0.6, yielding about 300 level 2 units.

From each level 2 unit, level 1 units were sampled (again by using stratified random sampling without replacement) with approximate sampling fractions

where *ɛ*_{ij} are the residuals in the latent response formulation (2). Sampling at level 1 was similar to the proportional allocation method that was used by Pfeffermann *et al.* (1998) and Grilli and Pratesi (2004) except that our sampling fraction is higher, nearly half the units () being sampled from each cluster.

By making the sampling probabilities at stages 1 and 2 dependent on the corresponding residuals and *ɛ*_{ij}, we are ensuring that sampling is informative at both levels. In practice, sampling probabilities would depend on observed design variables and our simulation corresponds to the situation where these design variables are strongly associated with the residuals. In a school survey, sampling at stage 1 could be stratified by school size or type (e.g. private and public) where the oversampled schools are more homogeneous with school level residuals closer to 0 (with a Pearson correlation between stratification variable and absolute value of school level residual of 0.82). At stage 2, strata could be based on some student characteristic (e.g. minority status) which correlates with the student level residual (here a correlation of 0.76).

The weights and that were used in pseudo-maximum-likelihood estimation were based on the proportions of units that were actually sampled. (Because of the small strata that are involved when sampling level 1 units from level 2 units, the proportion that was sampled could differ considerably from 0.25 and 0.75.)

We varied the cluster sizes of the finite population and simulated 100 data sets for each condition. Although a stratified sampling design at level 1 may be unusual with cluster sizes as small as 5 or 10, these situations might correspond to longitudinal data (occasions nested in subjects) where missingness depends on a time-varying covariate that is correlated with the level 1 residual. Five estimation methods were used for each simulated data set:

- (a)
unweighted maximum likelihood,

- (b)
pseudo-maximum-likelihood using raw unscaled weights,

- (c)
pseudo-maximum-likelihood using scaling method 1,

- (d)
pseudo-maximum-likelihood using scaling method 2 and

- (e)
pseudo-maximum-likelihood using method D.

Estimation was performed by using gllamm with 12-point adaptive quadrature.

Table 2. Cluster size : mean estimates and standard deviations *Parameter* | *True value* | *Unweighted maximum likelihood estimate* | *Weighted pseudo-maximum-likelihood estimates* |
---|

*Raw* | *Method 1* | *Method 2* | *Method D* |
---|

*Model parameters: conditional effects* |

*β*_{0} | 1 | 0.40 | 1.03 | 0.68 | 0.75 | 0.42 |

| | (0.11) | (0.19) | (0.16) | (0.15) | (0.15) |

*β*_{1} | 1 | 1.08 | 1.19 | 0.96 | 0.98 | 1.05 |

| | (0.18) | (0.32) | (0.26) | (0.26) | (0.25) |

*β*_{2} | 1 | 1.06 | 1.22 | 0.94 | 0.96 | 1.02 |

| | (0.22) | (0.35) | (0.25) | (0.26) | (0.26) |

√*ψ* | 1 | 0.39 | 1.47 | 0.58 | 0.70 | 0.62 |

| | (0.37) | (0.21) | (0.31) | (0.30) | (0.51) |

*Rescaled regression coefficients: approximate marginal effects* |

| 0.88 | 0.39 | 0.80 | 0.64 | 0.70 | 0.38 |

| 0.88 | 1.04 | 0.92 | 0.91 | 0.90 | 0.96 |

| 0.88 | 1.02 | 0.94 | 0.89 | 0.89 | 0.93 |

Table 3. Cluster size : mean estimates and standard deviations *Parameter* | *True value* | *Unweighted maximum likelihood estimate* | *Weighted pseudo-maximum-likelihood estimates* |
---|

*Raw* | *Method 1* | *Method 2* | *Method D* |
---|

*Model parameters: conditional effects* |

*β*_{0} | 1 | 0.37 | 1.04 | 0.83 | 0.88 | 0.37 |

| | (0.11) | (0.16) | (0.14) | (0.14) | (0.17) |

*β*_{1} | 1 | 1.13 | 1.06 | 0.91 | 0.94 | 1.13 |

| | (0.14) | (0.23) | (0.20) | (0.20) | (0.22) |

*β*_{2} | 1 | 1.14 | 1.11 | 0.91 | 0.97 | 1.13 |

| | (0.14) | (0.20) | (0.16) | (0.17) | (0.16) |

√*ψ* | 1 | 0.77 | 1.19 | 0.40 | 0.73 | 1.04 |

| | (0.10) | (0.13) | (0.34) | (0.16) | (0.12) |

*Rescaled regression coefficients: approximate marginal effects* |

| 0.88 | 0.34 | 0.87 | 0.79 | 0.82 | 0.32 |

| 0.88 | 1.04 | 0.89 | 0.87 | 0.87 | 0.98 |

| 0.88 | 1.05 | 0.93 | 0.88 | 0.90 | 0.98 |

Table 4. Cluster size : mean estimates and standard deviations *Parameter* | *True value* | *Unweighted maximum likelihood estimate* | *Weighted pseudo-maximum-likelihood estimates* |
---|

*Raw* | *Method 1* | *Method 2* | *Method D* |
---|

*Model parameters: conditional effects* |

*β*_{0} | 1 | 0.36 | 1.02 | 0.91 | 0.94 | 0.36 |

| | (0.09) | (0.16) | (0.14) | (0.15) | (0.17) |

*β*_{1} | 1 | 1.16 | 1.05 | 0.94 | 0.97 | 1.16 |

| | (0.14) | (0.22) | (0.20) | (0.21) | (0.23) |

*β*_{2} | 1 | 1.16 | 1.05 | 0.95 | 0.99 | 1.15 |

| | (0.10) | (0.14) | (0.12) | (0.13) | (0.12) |

√*ψ* | 1 | 0.82 | 1.09 | 0.70 | 0.83 | 1.10 |

| | (0.06) | (0.09) | (0.13) | (0.16) | (0.08) |

*Rescaled regression coefficients: approximate marginal effects* |

| 0.88 | 0.32 | 0.87 | 0.84 | 0.85 | 0.31 |

| 0.88 | 1.06 | 0.89 | 0.87 | 0.88 | 0.99 |

| 0.88 | 1.05 | 0.90 | 0.89 | 0.89 | 0.98 |

Table 5. Cluster size : mean estimates and standard deviations *Parameter* | *True value* | *Unweighted maximum likelihood estimate* | *Weighted pseudo-maximum-likelihood estimates* |
---|

*Raw* | *Method 1* | *Method 2* | *Method D* |
---|

*Model parameters: conditional effects* |

*β*_{0} | 1 | 0.35 | 1.01 | 0.96 | 0.98 | 0.35 |

| | (0.08) | (0.13) | (0.12) | (0.12) | (0.14) |

*β*_{1} | 1 | 1.18 | 1.03 | 0.98 | 1.00 | 1.18 |

| | (0.11) | (0.17) | (0.17) | (0.17) | (0.19) |

*β*_{2} | 1 | 1.18 | 1.02 | 0.98 | 0.99 | 1.17 |

| | (0.06) | (0.08) | (0.07) | (0.07) | (0.07) |

√*ψ* | 1 | 0.87 | 1.05 | 0.87 | 0.94 | 1.14 |

| | (0.04) | (0.08) | (0.08) | (0.07) | (0.08) |

*Rescaled regression coefficients: approximate marginal effects* |

| 0.88 | 0.31 | 0.87 | 0.87 | 0.87 | 0.29 |

| 0.88 | 1.07 | 0.90 | 0.88 | 0.89 | 1.00 |

| 0.88 | 1.06 | 0.88 | 0.88 | 0.88 | 0.99 |

When no weights are used, the deliberate undersampling of level 2 units with large absolute values leads to a downward bias for the random-intercept standard deviation √*ψ*. The deliberate undersampling of level 1 units with positive values of *ɛ*_{ij} leads to a downward bias for the fixed intercept *β*_{0}. This downward bias for *β*_{0} is also observed for the weighted estimates by using method D because this method uses the cluster averages of overall inclusion weights *w*_{ij} as level 2 weights whereas the level 1 sampling probabilities *π*_{i|j} vary mostly within clusters.

Scaling methods 1 and 2 both appear to overcorrect the positive bias for √*ψ*. This may be due to the within-cluster stratification based on the sign of *ɛ*_{ij} as discussed in Section 4.3.1. Scaling method 2 seems to perform better than method 1, giving results that are intermediate between those for raw weights and scaling method 1 as would be expected since the scaling constants tend to be closer to 1 than for method 1. The three methods employing both level 1 and level 2 weights (raw, method 1 and method 2) produce biased estimates for the regression coefficients whenever they are biased for the random-intercept standard deviation. Interestingly, these biases roughly cancel out in the expression (3) for the marginal effects.

Our simulation results appear to be consistent with the results in Grilli and Pratesi (2004) for informative sampling at both levels with small cluster sizes. For the level 2 variance, their unscaled fully weighted (our ‘raw’) estimators are upward biased whereas scaling method 2 overcorrects this bias. However, for the intercept and regression coefficients, the weighted estimators are less severely biased in Grilli and Pratesi (2004). As mentioned in Section 1, this could be due to the small true values for these parameters. In Grilli and Pratesi (2004), the sampling variability is considerably lower by using scaled weights than by using raw weights, whereas this difference is less pronounced in our simulations. The reason could be the lower sampling fractions that were used in their simulations.

To study the performance of the sandwich estimator, we simulated the model 1000 times for cluster size . We used raw level 1 weights, which produced only small biases for this cluster size. Table 6 shows the mean estimates and their standard deviations, as well as the mean standard errors and coverage for approximate 95% confidence intervals based on the normal distribution. The mean standard errors are almost identical to the standard deviations of the estimates. The coverage is close to the nominal level, even for the random-intercept standard deviation where the normality approximation may be dubious.

Table 6. Coverage of 95% confidence intervals for cluster size by using raw weights (1000 replications) *Parameter* | *True value* | *Mean estimate* | *Standard deviation of estimate* | *Mean SE* | *95% confidence interval coverage* |
---|

*β*_{0} | 1 | 1.01 | 0.13 | 0.13 | 94.1 |

*β*_{1} | 1 | 1.02 | 0.18 | 0.18 | 94.7 |

*β*_{2} | 1 | 1.03 | 0.08 | 0.08 | 94.1 |

√*ψ* | 1 | 1.07 | 0.07 | 0.08 | 92.4 |

To investigate whether the relatively small bias for √*ψ* using raw weights (and downward bias using scaled weights) is due to the within-cluster stratification as discussed in Section 4.3.1, we conducted further simulations for cluster size , using three stratification methods:

- (a)
stratification based on the sign of *ɛ*_{ij} as used in all simulations so far,

- (b)
the same design but with stratification determined by the sign of a standard normal random variable that was correlated 0.5 with *ɛ*_{ij} and

- (c)
stratification determined by the sign of a standard normal random variable that was uncorrelated with *ɛ*_{ij} and thus independent of the response.

Table 7 shows that the reasonable performance of the raw method that was seen earlier for case (a) deteriorates as the stratification becomes less related to the response. Scaling method 1 works very well for (c) where stratification is independent of the response. Qualitatively the same results (worse performance of the raw method and better performance of scaling method 1 as stratification becomes less related to the response) are obtained when the sampling fractions in both strata are 0.5. This suggests that the results are not due to varying the ‘informativeness’ of the weights (the correlation between the weights and the responses).

Table 7. Effect of stratification method: mean estimates and standard deviations for cluster size *Parameter* | *True value* | *Results for raw weights and the following stratification methods:* | *Results for method 1 and the following stratification methods:* |
---|

*(a)* | *(b)* | *(c)* | *(a)* | *(b)* | *(c)* |
---|

*Model parameters: conditional effects* |

*β*_{0} | 1 | 1.04 | 1.10 | 1.29 | 0.83 | 0.88 | 1.01 |

| | (0.16) | (0.16) | (0.21) | (0.14) | (0.13) | (0.16) |

*β*_{1} | 1 | 1.06 | 1.11 | 1.26 | 0.91 | 0.92 | 0.99 |

| | (0.23) | (0.26) | (0.30) | (0.20) | (0.23) | (0.25) |

*β*_{2} | 1 | 1.11 | 1.12 | 1.17 | 0.91 | 0.91 | 0.96 |

| | (0.20) | (0.21) | (0.25) | (0.16) | (0.17) | (0.19) |

√*ψ* | 1 | 1.19 | 1.33 | 1.77 | 0.40 | 0.61 | 0.98 |

| | (0.13) | (0.15) | (0.15) | (0.34) | (0.24) | (0.16) |

*Rescaled regression coefficients: approximate marginal effects* |

| 0.88 | 0.87 | 0.88 | 0.92 | 0.79 | 0.83 | 0.89 |

| 0.88 | 0.89 | 0.89 | 0.90 | 0.87 | 0.86 | 0.86 |

| 0.88 | 0.93 | 0.90 | 0.83 | 0.88 | 0.86 | 0.84 |

### 8. Discussion

- Top of page
- Abstract
- 1. Introduction
- 2. Inverse probability weighting in surveys
- 3. Multistage sampling and multilevel models
- 4. Pseudo-maximum-likelihood estimation for multilevel models
- 5. Sandwich estimator of the standard errors
- 6. Application
- 7. Simulation
- 8. Discussion
- Acknowledgements
- References
- Appendix

We have described a pseudolikelihood approach for generalized linear mixed modelling of data from complex sampling designs. Unlike previous contributions (e.g. Pfeffermann *et al.* (1998), Skinner and Holmes (2003) and Grilli and Pratesi (2004)), our approach can handle multilevel models with any number of levels, as well as allowing for stratification and PSUs that are not represented by a random effect in the model.

The pseudolikelihood methodology was applied to three-stage complex survey data on reading proficiency from the American PISA 2000 study, using the Stata program gllamm (e.g. Rabe-Hesketh *et al*. (2002, 2004a) and Rabe-Hesketh and Skrondal (2005)). The performance of pseudo-maximum-likelihood for two-level logistic regression using different methods for handling level 1 weights was investigated in a Monte Carlo experiment. This revealed that not only the estimated random-intercept variance but also the (conditional) regression coefficients were biased for small cluster sizes. Thus, considerable caution should be exercised in this case and sensitivity analyses should be conducted by comparing estimates from different scaling methods. It may also be useful to simulate a finite population from the estimated model, select a sample by mimicking the actual sampling design and investigate how well different methods recover the model parameters. The estimated marginal regression coefficients performed well even for small cluster sizes, suggesting that interpretation may best be confined to marginal effects in this case. We also conducted a small Monte Carlo experiment to investigate the performance of the sandwich estimator and found that the coverage was good.

The contribution of this paper is not confined to multilevel models but also applies to factor, item response, structural equation and latent class models. For these models we view the variables, indicators or items measuring the latent variables as level 1 units and the subjects as level 2 units, since the latent variables vary at the subject level (e.g. Skrondal and Rabe-Hesketh (2004)). Most previous pseudolikelihood approaches for latent variable models have been confined to weighting at level 2 (subjects) where the problem of appropriately scaling the weights does not arise. An exception is Skinner and Holmes (2003) who discussed structural equation models for longitudinal data by using weights both at level 2 (the subject) and level 1 (the occasion). Asparouhov (2005) considered pseudo-maximum-likelihood estimation for structural equation modelling with level 2 weights. Muthén and Satorra (1995), Stapleton (2002) and Skinner and Holmes (2003) considered a related approach where weighted mean and covariance matrices are used in the fitting functions of the weighted least squares estimator implemented in standard software for structural equation modelling. Wedel *et al*. (1998), Patterson *et al*. (2002) and Vermunt (2002) discussed pseudo-maximum-likelihood estimation of latent class models by using complex survey data with weighting at level 2 (subjects).

A major practical obstacle in using pseudo-maximum-likelihood estimation for multilevel modelling of complex survey data is that the necessary information is often not provided in publicly available data sets. For instance, many surveys include only a single overall weighting variable for the level 1 units, whereas the pseudolikelihood approach requires the weights corresponding to the levels of the hierarchical sampling design. Approaches to retrieving this information from the overall weights have been suggested by Kǒvacević and Rai (2003), pages 116–117, and Goldstein (2003), page 79, but little appears to be known regarding the performance of their approximations.

Non-response at any level can easily be addressed by adjusting the weights. However, post-stratification weights are typically constructed by considering sampling proportions for level 1 units by subpopulations such as males and females. These weights are not *conditional* weights dependent on the selected cluster *j* as required for the pseudolikelihood for multilevel models. An exception may be cross-national surveys where the level 2 clusters are nations and post-stratification weights are constructed by nation. Another example where post-stratification weights can be used is panel surveys since the subject-specific weights are then at level 2.

Level 1 weights are not only used in standard multilevel models. In panel surveys, waves can be regarded as level 1 units and subjects as level 2 units. In this case *π*_{j} are the usual sample selection probabilities for the first panel wave whereas the level 1 weights are determined by non-response and attrition (Skinner and Holmes, 2003). Level 1 weighting to account for drop-out has also been used in generalized estimating equations (e.g. Robins *et al*. (1995)).

The pseudolikelihood methodology that is discussed here and implemented in gllamm accommodates the wide range of models subsumed in the generalized linear latent and mixed model framework of Rabe-Hesketh *et al*. (2004b) and Skrondal and Rabe-Hesketh (2004). In addition to conventional multilevel and latent variable models, this framework includes extensions such as multilevel structural equation models and models with nonparametric random effects or latent variable distributions (Rabe-Hesketh *et al*., 2003). Responses can be continuous, dichotomous, ordinal or nominal variables (Skrondal and Rabe-Hesketh, 2003), as well as counts and durations.