Methods for Observed-Cluster Inference When Cluster Size Is Informative: A Review and Clarifications

Clustered data commonly arise in epidemiology. We assume each cluster member has an outcome Y and covariates . When there are missing data in Y, the distribution of Y given in all cluster members (“complete clusters”) may be different from the distribution just in members with observed Y (“observed clusters”). Often the former is of interest, but when data are missing because in a fundamental sense Y does not exist (e.g., quality of life for a person who has died), the latter may be more meaningful (quality of life conditional on being alive). Weighted and doubly weighted generalized estimating equations and shared random-effects models have been proposed for observed-cluster inference when cluster size is informative, that is, the distribution of Y given in observed clusters depends on observed cluster size. We show these methods can be seen as actually giving inference for complete clusters and may not also give observed-cluster inference. This is true even if observed clusters are complete in themselves rather than being the observed part of larger complete clusters: here methods may describe imaginary complete clusters rather than the observed clusters. We show under which conditions shared random-effects models proposed for observed-cluster inference do actually describe members with observed Y. A psoriatic arthritis dataset is used to illustrate the danger of misinterpreting estimates from shared random-effects models.


Introduction
Clustered data are common in epidemiology. Repeated measures are clustered in individuals; teeth in patients; pups in litters. Suppose interest is in the association between outcome Y and covariates X measured on members of the clusters. Often Y and X are missing for some members of sampled clusters. For simplicity, we assume that a member's X is observed whenever Y is observed. We call members with observed Y "observed members," those with missing Y "missing members," the original clusters "complete clusters," and the subclusters that remain after discarding missing members "observed clusters." Missing data may arise because although a variable could, in principle, be measured, circumstances meant it was not, for example, because an individual missed a visit. We call such missing data "potentially observable." When missing data are potentially observable, a model can be proposed for the distribution of Y given X in all cluster members, and methods used that, under specified assumptions about the missingness (e.g., missing at random, MAR), give consistent estimates for this model. We call this "complete-cluster inference." Alternatively, missing data may arise because in a fundamental sense a variable does not exist. We call such missing data "unobservable." Three examples of unobservable Y are measures of: (1) cognitive function of an individual after death; (2) degree of disablement of an individual who is not disabled; (3) health of a tooth that has been lost. Although missing Y could be set to zero when a patient is dead/not dis-abled/tooth is lost, in practice often a model is instead proposed for Y given X in observed members only (so conditional on alive/disabled/tooth not lost). We call this "observedcluster inference." Sometimes observed-cluster inference may be of interest even when missing data are potentially observable. When missing data are unobservable "complete-cluster" inference is philosophically problematic: what does it mean to model cognitive function in dead people?
When the size M of complete clusters varies, it is usually assumed that Y is independent of M given X. In observed clusters, however, Y and N may be conditionally dependent given X, where N is size of observed cluster. For example, in a dental study, the fewer teeth a patient has, the worst their condition tends to be. This is called "informative cluster size" (ICS).
So far we have assumed observed clusters are generated from complete clusters by excluding missing members, but ICS can also arise where observed clusters are complete in themselves. For example, in toxicology, exposed dams who are more sensitive to a toxin may tend to have smaller litters and offspring with greater probability of deformation than less sensitive dams, so that Y (pup being deformed) and N (litter size) are dependent given X (exposure of dam).
We shall show that three of the methods proposed for observed-cluster inference under ICS, viz. weighted and doubly weighted generalized estimating equations (GEE) and shared random effects models, can be seen as actually giving inference for complete clusters. When the Y -X associations in complete and observed clusters are the same, the distinction is unimportant. However, ICS causes them to differ in general. So, it is important to understand when methods proposed for observed-cluster inference really do describe observed clusters. In the literature on modeling repeated measures in cohorts with high death rates (Dufouil et al., 2004;Kurland et al., 2009) a distinction has been made between completecluster (termed "immortal-cohort") inference and observedcluster ("mortal-cohort") inference. However, conditions under which the two inferences are equivalent have not been set out, and in the wider literature the distinction seems to be less well recognized.
In Section 2 we define notation and discuss methods for complete-cluster inference from observed data. Section 3 defines ICS and discusses how ICS relates to missing-data mechanisms. Section 4 relates two weighted GEE methods, one proposed for complete-cluster inference in the missing-data literature, and one for observed-cluster inference in the ICS literature. We also show that doubly weighted GEE, proposed for observed-cluster inference, actually give complete-rather than observed-cluster inference, and that, moreover, there is no single complete-cluster inference. Shared random-effects models give complete-cluster inference, but have also been used for observed-cluster inference. In Section 5 we discuss when this is valid, and in Section 6 we use a psoriatic arthritis dataset to illustrate that some parameters of such a model may be relevant to observed clusters but others not. In brief, we replicate an analysis of association between disability and covariates, with measurements clustered by patient. Our interest is in how sex affects degree of disability in the "observed clusters" of measurements where degree is greater than zero, that is, given disability. The analysis uses models for probability of disability and for degree of disability given disability which share a random intercept. Because probability of disability is higher in women than in men with the same intercept and other covariates, intercept and sex are not independent given disability and other covariates. Consequently, the effect of sex on degree of disability given disability is less than is suggested by the estimated parameter.

Notation and Complete-Cluster Inference
Let K be the number of complete clusters in the sample. When needed we use subscript i to index cluster, but usually omit this. Let M (known) be size of complete cluster. Let Y j and X j (j = 1, . . . , M) be outcome and covariate vector, respectively, for member j of the complete cluster, andỸ Members with R j = 1 are "observed members"; those with R j = 0 are "missing members." Let N = M j=1 R j be size of observed cluster. Assume (R i ,X i ,Ỹ i ) (i = 1, . . . , K) are i.i.d. For any value r of R, partitionỸ = (Ỹ (r) ,Ỹ (r) ), where Y j belongs toỸ (r) if r j = 1 and toỸ (r) if r j = 0. For example, Y ((1,0,1)) = (Y 1 , Y 3 ) T andỸ ((1,0,1)) = Y 2 . PartitionX likewise, except that if some elements of X are observed even on missing members, these elements belong toX (r) .
Data are missing at random (MAR) if P(R = r |X,Ỹ ) = π(r,X (r) ,Ỹ (r) ) ∀r for some function π(.) (informally, P(R |X ,Ỹ ) = P(R |X (R) ,Ỹ (R) , M)) and missing completely at random (MCAR) if P(R |X,Ỹ ) = P(R | M) (Seaman et al., 2013) (note M is a function ofX, asX has M columns). Otherwise they are missing not at random (MNAR). We say data are missing with equal probability (MWEP) if P(R j = 1 | X,Ỹ , N) = N/M ∀j. MCAR means that which members are observed does not depend on X or Y values in the cluster. This would be so if, for example, missing data had been lost by the researchers. MAR allows missingness to depend on data on observed members plus any observed data on missing members. For example in a longitudinal study individuals' probability of dropout may depend on past health measurements but not on current health. If it also depends on current health, the data are MNAR. MWEP means the number N of observed members may depend on X and Y but given this number all sets of N observed members are equally likely. This could be so if missingness depends only on cluster-level summaries of X and Y .
To make "complete-cluster" inference, a model is specified forỸ givenX. To fit this using observed data (Ỹ (R) ,X (R) ), an assumption (e.g., MAR) is made about the missingness process and a method used that is valid under this assumption, for example, inverse probability weighting (IPW) or randomeffect models (Albert and Follmann, 2009). We consider two approaches to complete-cluster inference that relate to methods proposed for observed-cluster inference. The first specifies a (marginal) model for E(Y j | X j = x, M = m) and assumes . This model is fitted to observed clusters using GEE with IPW. The second approach uses a shared random-effects model. This gives cluster-specific inference, but random effects can be integrated out to get e C (x). Whereas e T (x) is the expectation of Y given X = x giving equal weight to each observed cluster, e A (x) gives equal weight to each observed member. Clusters with N = 0 play no role in e T (x) or e A (x). Hoffman et al. (2001), Williamson et al. (2003) and Benhin et al. (2005)

Semi-Parametric Marginal Models
They advocate using e T (x). Use of e A (x) has been proposed for mortal cohorts when missing data are due to death, and for modeling degree of disability or health of teeth when missing data are due to non-disabled patients or absent teeth (Dufouil et al., 2004;Kurland et al., 2009;Su et al., 2011;Li et al., 2011). Hoffman et al. (2001) gave an estimator for e T (x). Williamson et al. (2003) and Benhin et al. (2005) gave an asymptotically equivalent and computationally less intensive method: weighted independence estimating equations (WIEE) (see also Wang et al. (2011) for three-level data). The same equations without weighting (IEE) estimate e A (x). We describe WIEE and IEE in Section 4.1. Dunson et al. (2003), Gueorguieva (2005), Chen et al. (2011), andNeuhaus andMcCulloch (2011) consider cluster-specific inference using a linear or generalized linear mixed model (LMM/GLMM). They interpret NICS to mean the random effects u in the mixed model are independent of N, and ICS to mean they are not. NICS in this sense implies NICS in the sense of Hoffman et al., but the converse is not true. To deal with ICS when fitting the LMM/GLMM, several authors have combined it with a model for N or R, with the same or correlated random effect (Dunson et al., 2003;Gueorguieva, 2005;Chen et al., 2011;Su et al., 2009Su et al., , 2011Li et al., 2011). We discuss this model in Section 5. Hoffman et al. (2001) wrote that ICS is "closely related" to violation of the MCAR condition. In fact, MCAR is not a sufficient condition for NICS. For example, suppose all complete clusters have size M = 2 and haveỸ = (0, 1) T , there are no covariates, and P{R = (1, 1) |Ỹ } = P{R = (1, 0) |Ỹ } = 1/2. It is easy to show that e T = 1/4 but e A = 1/3.

Proposition 1
Cluster size will be non-informative if data are MCAR and, moreover, either i) equation (1) holds, or ii) N ⊥ ⊥ M and the data are MWEP.
Note (1) is often assumed with GEEs, but N ⊥ ⊥ M is unlikely, as N ≤ M. Proofs of Propositions are in Web Appendices A and E. Just as both ICS and NICS can arise from MCAR mechanisms, so they can from MAR and MNAR (examples in Web Appendix B).
When (1) holds, so e C (x) is defined, a sufficient condition for e C (x) = e T (x) is MWEP and P(N ≥ 1) = 1, because the Y -X relation in a randomly chosen member of an observed cluster is then the same as in a random member of the corresponding complete cluster.

Weighted GEE (WGEE)
Assume (1) holds and e C (x) = g −1 (β T x), where g is a link function. IfỸ andX were observed, β could be estimated with GEE. With missing data, WGEE can be used. These weight member j by R j /P(R j = 1 |X,Ỹ ). Robins et al. (1995) proposed use of WGEE when M does not vary, missingness is monotone and MAR, and P(N ≥ 1) = 1.
When data are MWEP and P(N ≥ 1) = 1, weights R j /P(R j = 1 |X,Ỹ , N) = R j M/N can be used instead (proof in Web Appendix C). In this case, e C (x) = e T (x) (Section 3.3), so WGEE with weights R j M/N also give observed-cluster inference. In fact, with independence working correlation they are the WIEE proposed by Williamson et al. (2003) for estimating β in e T (x) = g −1 (β T x). So, WIEE have a dual interpretation: they estimate e T (x) under any missingness mechanism; and e C (x) when data are MWEP and P(N ≥ 1) = 1. (Dufouil et al., 2004).

Doubly Weighted GEE (DWGEE)
If there is ICS and the distribution of X depends on N, interpretation of e T (x) may be awkward, because the Y -X association is confounded by N (Williamson et al., 2003).
and P(X j = 1 | N) be increasing functions of N. Then typical members with X = 1 tend to come from larger clusters than typical members with X = 0, so e T (1) > e T (0) even though X has no effect on Y within clusters. Huang and Leroux (2011) proposed DWGEE1 and DWGEE2. DWGEE1 can be used when X is categorical and every observed cluster contains at least one member with each of the possible values of X. DWGEE1 are the same as WIEE except that member j is inversely weighted not by M/N but by the total number of observed members in the same cluster who have X = X j . Thus the total weight of members with X = x is the same for all possible x. Rather than estimating e T (x), DWGEE1 estimate E(Y | X) in the population formed by each cluster in the population contributing one member with each possible value of X.
DWGEE2 was proposed for when not all observed clusters contain a member with each possible value of X. In DWGEE2 observed member j is inversely weighted by the expected (rather than actual, as in DWGEE1) number of observed members with X = X j . In Web Appendix D we show that DWGEE2 estimates E(Y | X) in a population of larger "complete" clusters in which each cluster contains at least one member with each possible value of X. Each cluster in the dataset is considered to be the observed component of one of these larger clusters, with the rest being missing. The problem with this is that, unless observed clusters really do arise from larger clusters in which all values of X are represented (which is not so in Huang and Leroux's example), the larger clusters are purely hypothetical and it is unclear why they should be of scientific interest. Further, as shown in Web Appendix D, the distribution of Y given X in the hypothetical population of complete clusters depends on which predictors are included in the model for the expected number with X = x, and there is no obvious reason to prefer one set of predictors to any other.

LMM, GLMM, and Shared Random Effect Model
The general form of the LMM is (continuing to omit the subscript i for cluster) where Z is a subvector of X, and u a cluster-specific latent variable. This is a model for Y -X association in complete clusters. Assumption u ⊥ ⊥X means that u ⊥ ⊥ M and hence that size of complete clusters is non-informative. Elements of X not in Z are said to have fixed effects; those in Z have random effects. It follows from (2) and (3) that e C (x) = β T x. So, β also has a marginal interpretation in complete clusters. LMMs are a special case of GLMMs. In GLMMs, Y j |X, u is assumed to belong to the exponential family, (2) is replaced by where g(.) is the link function, and (3) and (4) are assumed to hold. If Y is binary, Z = 1 and u has a bridge distribution with rescaling parameter φ (0 < φ < 1), then e c (x) = φβ T x and so β (in combination with φ) has a marginal interpretation in complete clusters (Wang and Louis, 2003). More generally, β does not have a marginal interpretation, though e C (x) can be calculated as e C (x) = g −1 (β T x + u T z) f u (u; α) du.
The MLE of (β, α) from fitting the mixed model to observed clusters is consistent when data are MAR, but not, in general, when MNAR. However, Neuhaus and McCulloch (2011) showed that for LMMs, if (i) X includes an intercept term, (ii) , u), and (iv) the only random effect is an intercept (i.e., Z = 1), then β is consistently estimated except for the intercept. They found the same was approximately true of GLMMs. More generally, they say that if u sub and X sub are subvectors of u and X with X sub ⊥ ⊥ u sub and P(R |X,Ỹ , u) = P(R | M, u sub ), then their results suggest that the MLE of elements of β corresponding to X sub will be approximately unbiased.
For MNAR data, a model for P(R |X, u) can be added to the LMM/GLMM. The result is a shared random-effects model (Albert and Follmann, 2009). When P(R = r |X,Ỹ , u) = π(r,X (r) , u) ∀r for some function π(.), the MLEs of β and α from this model are consistent. An indirect way (Su et al., 2009;Li et al., 2011;Su et al., 2011) to model P(R |X, u) is to introduce another random effect v, assume Y j ⊥ ⊥ v |X, u, and specify mod-els f u,v (u, v; α) for the distribution of (u, v) and π * (r,X (r) , v) for P(R = r |X,Ỹ , u, v). We call the resulting model for (Ỹ , R) "a correlated random-effects model." It is a special case of the shared random-effects model, with π(r,X (r) , u) = π * (r,X (r) , v)f v (v | u; α) dv and f u (u; α) = f u,v (u, v; α)dv.

Interpretation of β and α in Complete Clusters
Partition X and β as X = (X (l) , X (−l) ) T and β = (β (l) , β (−l) ), where X (l) and β (l) are the lth elements of X and β, respectively. If X (l) has a random effect, partition u as u = (u (l) , u (−l) ), where u (l) corresponds to X (l) , and partition Z similarly. If X (l) has a fixed effect, u (l) = z (l) = 0, u (−l) = u and z (−l) = z. Let I (l) denote a vector of the same length as X, with lth element equal to one and all other elements equal to zero.

within-cluster effects
If X (l) is cluster varying with fixed effect, β (l) is its withincomplete-cluster effect in clusters of size M ≥ 2. That is, if two members of the same complete cluster have X values that differ only by I (l) for some , then their expected Y values differ by β (l) for an LMM. In a GLMM, the expected value is transformed by link function g; for example, for logit link, β (l) is their log odds ratio. If X (l) is cluster varying with random effect, β (l) and Var(u (l) ) are the mean and variance of the within-cluster effect.
between-cluster effects β (l) and α can be interpreted in terms of differences between expected Y in members of different complete clusters. That is, if for some , two complete clusters are randomly sampled conditional on one containing a member with X = x and the other a member with X = x + I (l) , then the difference between the expected Y values of these two members is This reduces to β (l) for the LMM and to φβ (l) for the GLMM with bridge distribution.

causal effects
If X (l) is manipulable, for example, treatment, β (l) may be interpretable as a causal effect in complete clusters. Let Y j (x, X (−l) j ) be the potential outcome of member j when X (l) j is manipulated to equal x. We make the following "causal assumptions" (Vansteelandt, 2007) )} = 1, that is, observed outcome equals outcome that would be seen if X (l) were set to its observed value. Second, manipulating X (l) j does not affect X , where X is set of possible values of X (l) . With these assumptions, the conditional expected causal effect E{Y j (x, X (−l) , u) reduces to (β (l) + u (l) )x. The conditional expected causal effect E{Y j (x, X , u)f u (u; α) du, which reduces to β (l) x for LMMs and to φβ (l) x for GLMMs with bridge distribution.

Interpretation of β and α in Observed Clusters
Section 5.2 discussed how β and α in the model defined by (2)-(4) or (3)-(5) describe the Y -X association in complete clusters. Now we discuss how the same β and α relate to associations in observed clusters.

within-cluster fixed effects
When (6) holds and X (l) is cluster varying with fixed effect, β (l) is not only the within-complete-cluster effect of X (l) , it is also the within-observed-cluster effect, which is the same in all observed clusters of size N ≥ 2. That is, if two members of the same observed cluster of size N ≥ 2 have X values that differ only by I (l) for some , then their expected values (transformed by link function g in the case of the GLMM) of Y differ by β (l) .
When considering within-observed-cluster effects of covariates with random effects, between-observed-cluster effects and causal effects, we find it convenient to introduce the concept of the LMM/GLMM given by equations (2)-(4) or (3)-(5) "describing observed random subclusters." For a cluster with N ≥ n, let H n denote the set of indices of a simple random sample of size n from the N observed members, and let X (Hn) = {X j : j ∈ H n }. Note that H 1 is the same as what we denoted in Section 3 by H. We say "the LMM given by (2)-(4) describes observed random subclusters of size n from observed clusters of size ≥ n" (or, more concisely, "the LMM describes observed random subclusters of size n") if {Y j : j ∈ H n } are independent givenX (Hn) , u, N ≥ n where β and α in (8)-(11) are the same parameters (i.e., have the same values) as in equations (2)-(4). Similarly, "the GLMM (given by (3)-(5)) describes observed random subclusters of size n" if and (9)-(11) hold. If (8)-(11) or (9)-(12) hold for one or more values of n, we have a basis for interpreting the estimates of β and α obtained by fitting the LMM/GLMM given by (2)-(5) (which describes complete clusters) in terms of effects in observed clusters. We give these interpretations below. Later (Proposition 2) we give sufficient conditions for the LMM/GLMM to describe observed random subclusters of size n and (Section 5.4) show what can happen when these conditions are not satisfied. Note that the statement that LMM/GLMM describes random subclusters of size n is a statement about the Y -X relation only in observed members of clusters with N ≥ n; the association in missing members or in clusters with N < n is not relevant. We shall focus on n = 1 when discussing between-cluster effects, but for within-cluster effects we need n ≥ 2, because within-cluster comparisons only make sense in clusters with at least two members. In most realistic settings, if the sufficient conditions (Proposition 2) are satisfied for n, they are also satisfied for n * < n.

within-cluster random effects
If the LMM/GLMM describes observed random subclusters of size n (with n ≥ 2) and X (l) is a cluster-varying covariate with random effect, then β (l) and Var(u (l) ) are the mean and variance of the within-observed-cluster effect of X (l) . That is, if an observed cluster is randomly sampled conditional on N ≥ n and on n members randomly chosen from it having X values that differ only in X (l) , then the expected values (transformed by link function g) of Y of any pair of these n members differ by (β (l) + u (l) ) , where is the difference between their X (l) values, and the distribution of u (l) is given by u ∼ f u (u; α).

between-cluster effects
If the LMM/GLMM describes observed random subclusters of size n = 1, β are the between-observed-cluster effects of X.
That is, if two clusters each with N ≥ 1 are randomly sampled conditional on X H = x in one cluster and X H = x + I (l) in the other, then the difference between the expectations of Y H in the two clusters is Since (13) has the same form as (7), between-cluster effects in observed and complete clusters are equal and β and α describe them both. As with (7), (13) reduces to β (l) for the LMM. When X (l) has fixed effect, this is true even if u is not independent of N, so (10) is not necessary for β (l) to be interpreted as a between-observed-cluster fixed effect in a LMM.

causal effects
Let X (l) be manipulable and the "causal assumptions" of Section 5.2 hold. LetX If the LMM/GLMM describes observed random subclusters of size n (n ≥ 1) and , u), then β (l) and α describe a causal effect of X (l) in observed random subclusters of size n. That is, the expected causal effect givenX (Hn) and u in the members whose indices belong to H n is equal to c(x, X ; α), and the expected causal effect givenX (Hn) is equal to c * (x, X M , this causal interpretation is problematic because membership of observed clusters may change as X (l) is manipulated, that is, some observed members would not have been observed if their X (l) values had been otherwise, while some missing members would have been observed.

Proposition 2
The LMM/GLMM describes observed random subclusters of size n if (i) P(R |X,Ỹ , u) = P(R | X con , u), where X con is a cluster-constant subvector of X; either (iia) (X 1 , . . . , X M ) are exchangeable given M or (iib) P(R = r | X con , u) = P(R = r | X con , u) whenever r is a permutation of r; and (iii) P(N ≥ n | X con , u) = P(N ≥ n).
Note that (iii) holds if the minimum possible observed cluster size is ≥ n, but is unlikely to hold otherwise; and if (iii) is replaced by the weaker condition P(N ≥ n | X con , u) = P(N ≥ n | u), then (8), (9) and (11) still hold, but (10) may not.

Situations Where Complete-and Observed-Cluster
Effects Differ With the exceptions mentioned above (i.e., within-cluster fixed effects, and between-cluster and causal fixed effects in LMMs when (9) holds), β and α may not be so interpretable in terms of effects in observed clusters if (9) or (10) do not hold.
Suppose that (10) with n = 1 does not hold and X (l) has a random effect. The between-observed-cluster effect of X (l) is given by (13) with f u (u; α) replaced by f u (u | N ≥ 1; α). In particular, it does not reduce to β (l) for the LMM unless E(u | N ≥ 1) = 0. Similarly, the observed-cluster causal effect c(x, X (−l) , u)f u (u | N ≥ 1; α) du is, in general, not the same as the complete-cluster causal effect c * (x, X (−l) ); and the within-observed-cluster effect will not, in general, have mean β (l) and variance implied by f u (u; α).
In the following example, (9) does not hold for n = 2. Suppose clusters are old people in a cohort study of cognitive function Y . A LMM is used, with a random effect for time because rate of cognitive decline varies between people. Assume a fixed effect for the intercept. The only missing data are due to death: R ij = 1 if person i is alive at time j; R ij = 0 if dead. So, X j = (1, j) T , Z = X (2) , u = u (2) and missingness is monotone. Suppose people with more rapid decline (more negative u (2) ) tend to die earlier. The within-complete-cluster effect of X (2) has mean β (2) and variance Var(u (2) ). The mean and variance of the within-observed-cluster effect are functions of X (2) : they both diminish as X (2) increases. This is because the subsample still alive at later times is enriched for high u (2) . In this setting "complete-cluster" inference has been called inference for a hypothetical immortal cohort, and it has been suggested that "observed-cluster" inference (describing the population still alive at each timepoint) is of more interest (Dufouil et al., 2004). See Section 6 and Web Appendix F for examples of between-cluster or causal effects differing in complete and observed clusters. Dunson et al. (2003), Chen et al. (2011) and Gueorguieva (2005) wanted observed-cluster inference when "complete clusters" do not exist, for example, toxicology experiments where clusters are litters. Dunson et al. and Gueorguieva assumed cluster-constant X, P(N ≥ 1) = 1 and P(N |X,Ỹ , u) = P (N | X, u). Chen et al. assumed X was cluster constant or a function of j (e.g., X j = (1, j) T ), P(N |X,Ỹ , u) = P(N | u) and Z = 1. It can be seen that these methods give completecluster inference for a hypothetical population of complete clusters in which M i = max(N 1 , . . . , N K ) and from which the population of observed clusters would be generated by applying monotone missingness mechanism P(N |X,Ỹ , u). However, they do not only provide complete-cluster inference. When, as in Dunson et al. and Gueorguieva, X is cluster constant and P(N ≥ 1) = 1, conditions (i), (iia) and (iii) of Proposition 2 hold with n = 1, so β and α are also between-cluster or causal effects in observed clusters. When, as in Chen et al., X is cluster varying, Z = 1 and P(N |X,Ỹ , u) = P (N | u), nonintercept elements of β are within-observed-cluster effects.

Example: Psoriatic Arthritis
This example shows a model that ostensibly describes observed clusters but some of whose parameters relate only to a population of complete clusters with no obvious meaning. Husted et al. (2007) analyzed a cohort of 382 psoriatic arthritis (PsA) patients. Physical function was measured by the health assessment questionnaire score (HAQ). HAQ is semicontinuous: it is zero (no disability) with positive probability and otherwise varies continuously up to 3 (severe disability). 31% of the 2107 HAQ scores were zero. They separately modeled P(HAQ > 0) (the "binary-part") and HAQ given HAQ > 0 (the "continuous-part"), using, respectively, logistic regression with random intercept v (1) and linear regression with random intercept u (1) . Both parts had the same covariates (sex, time since onset, etc.), and all covariates had fixed effects. Among the conclusions was that being female predicted higher HAQ when HAQ > 0, adjusting for other covariates.
Here, clusters are patients and "observed cluster" means a patient's set of non-zero scores. Su et al. (2009) noted that estimates for the continuous part might be biased because separate modeling of binary and continuous parts did not account for ICS caused by the model for the binary part determining the observed cluster size in the continuous part. So, they modified Husted et al.'s model by replacing v (1) by ψu (1) , where ψ is unknown. They called this shared random-effect model the "latent-process model" (SAS code provided in Web Appendix G). They also used a correlated random effects model, but results were similar.
In the original (misspecified) model of Husted et al., the estimated sex effect in the continuous part was 0.181 (SE 0.051). In the latent-process model, it was 0.246 (SE 0.052) ( Table 1). We focus on the meaning of this latter estimate. We emphasize there is nothing intrinsically wrong with the latent-process model. It can validly be used to predict HAQ. What is important is not to misinterpret the parameters in the continuous part. As this is an LMM and sex is cluster-constant with fixed effect, the estimated sex effect, 0.246, describes the between-cluster effect in "complete clusters," that is, in a hypothetical world in which all scores are somehow non-zero. The meaning and scientific interest of this hypothetical world, analogous to the world of "immortal cohorts," is unclear. Su et al. (2009) do not comment on the meaning of their estimated sex effect, but suppose one wished to interpret it as an effect in observed clusters, as done in Husted et al. (2007). As all the covariates have fixed effects, estimates for clustervarying covariates can be interpreted unproblematically as within-cluster effects in complete or observed clusters. However, sex is cluster-constant. To illustrate the problem with We also used IEE to fit a model for e A (x), the conditional mean of HAQ given sex, time since onset, etc. and HAQ > 0 ( Table 1). The estimated sex effect is 0.100 (SE 0.031), which is close to the effect, 0.124, worked out above using empirical Bayes estimates.
In conclusion, the estimated sex effect in the continuous part of the latent-process model (and correlated randomeffects model) describes the association between sex and HAQ in a hypothetical population of little scientific interest; for this dataset it overstates the size of the effect in the population of scientific interest. In further work, Su et al. (2011) found an association of genotype HLA-B27 with HAQ when HAQ > 0. The same interpretation problem applies here: this association refers to the hypothetical "complete" clusters.

Discussion
We have shown that shared random-effect models do not always describe observed clusters, except for cluster-varying co-variates with fixed effects or under the conditions of Proposition 2. The models of Dunson et al. (2003), Gueorguieva (2005) and Chen et al. (2011) are unnecessarily restrictive. They assume either cluster-constant X or that N does not depend on X. Proposition 2 shows X can be cluster varying if N depends only on cluster-constant elements. The assumptions required do, however, remain restrictive. WIEE relate to IPW for missing data. DWGEE2 give inference for a hypothetical population of complete clusters that is, in general, neither unique nor of scientific interest.
For binary Y , Li et al. (2011) used a correlated randomintercepts model with bridge distributions, so that e C (x) = φβx. For a single binary X, they compared the log odds ratios in complete and observed clusters. They found the difference was small when the variance of the random intercepts or the correlation between them was small. However, when randomintercept variances and/or correlation are small, cluster size is only weakly informative; when size is strongly informative, inferences for complete and observed clusters will differ more. We replicated Li et al's study and found the two log odds ratios could differ by as much as 25% when φ = 0.6, and 56% when φ = 0.2 (see Web Appendix H).
We have assumed Y and X are observed in all members for which we wish to make inference. Dufouil et al. (2004) and Shardell and Miller (2008) give methods for when this is not so.
Having illustrated the danger of misinterpreting estimates, we recommend careful thought about which inference is of scientific interest and which analysis method will give it.

Supplementary Materials
Web Appendices referenced in Sections 3-7 are available with this paper at the Biometrics website on Wiley Online Library.