• Please log in or register to access this feature.

SEARCH

SEARCH BY CITATION

Keywords:

  • GEE;
  • bootstrap;
  • familial correlation;
  • multifactorial model;
  • binary data;
  • simulation

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Application To Cardiovascular Data Set
  6. Results
  7. Simulation Study
  8. Discussion
  9. Acknowledgements
  10. References

Complex diseases are influenced by both genetic and environmental factors. Studies of individuals or of families can be used to examine the association of genetic factors, such as candidate genes, and other risk factors with the presence or absence of complex disorders. If families are investigated, whether or not they are randomly ascertained, possible familial correlation among observations must be considered. We have compared two statistical approaches for analyzing correlated binary data from randomly ascertained nuclear families. The generalized estimating equations approach (GEE) can be used to adjust for familial correlation. The relationship between covariates and the response is modelled, and the correlations among family members are treated as nuisance parameters. For comparison, we have proposed two strategies from a hierarchical nonparametric bootstrap approach. One strategy (S1) samples family units, preserving the structure and correlation within each family. A second and novel strategy (S2) also samples family units but then randomly samples offspring with replacement in each family. We applied the methods to data from a study of cardiovascular disease, and followed up with a simulation study in which family data were generated from an underlying multifactorial genetic model. Although the bootstrap approach was more computationally demanding, it outperformed the GEE in terms of confidence interval coverage probabilities for all sample sizes considered.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Application To Cardiovascular Data Set
  6. Results
  7. Simulation Study
  8. Discussion
  9. Acknowledgements
  10. References

In cohort or cross-sectional studies in which families are randomly sampled from a defined population, association analysis of disease status with risk factors such as candidate genes and environmental factors can provide some understanding of their relative impact on complex disorders. These complex disorders usually have a multifactorial pattern of inheritance, where several susceptibility genes are distinctly located throughout the human genome, and are also influenced by other factors such as environmental exposure.

Standard statistical methods that test for marginal relationships of risk factors with disease outcome and treat individual family members as independent are not valid because they ignore the correlation among family members. For example, in many cases the variation within a family will likely be smaller than the variation between families, because related family members share genetic characteristics and environmental influences.

In this report we evaluate two statistical methods for analyzing correlated binary data that avoid strict distributional assumptions. Unlike parametric models, the underlying correlation structure of the data does not need to be strictly specified. One such statistical method is the generalized estimating equations approach (GEE) (Liang & Zeger, 1986; Zeger & Liang, 1986; Zeger et al. 1988). The GEE does not require the complete specification of the joint distribution or likelihood of responses from family members. The GEE models the relationship between covariates and the response, while treating the family correlation as a nuisance parameter. This method requires specification of the mean structure and a working correlation matrix for the vector of responses. Correctly specifying a working correlation structure can improve estimation efficiency, but even if it is misspecified the GEE robust sandwich variance estimate is asymptotically unbiased (Liang & Zeger, 1986). In small samples, however, the GEE robust variance estimator is known to be biased, yielding anti-conservative inference (Drum & McCullagh, 1993) especially for high intra-cluster correlation and variable cluster sizes (Mancl & DeRouen, 2001). As an alternative to the GEE, we consider a bootstrap method which is simple and robust. For family data, we apply the hierarchical bootstrap (Davison & Hinkley, 1997) under two strategies, strategy 1 (S1) and strategy 2 (S2). This bootstrap approach takes into account the within-family information by using the families as bootstrapping units (S1), as well as by random sampling within each family (S2). Unlike other methods for correlated data, the bootstrap does not require any assumptions about the underlying correlation structure to obtain variances of estimates.

Few authors have compared the performance of the GEE and the bootstrap for correlated data in regression analysis and, to our knowledge, none have evaluated the S2 bootstrap strategy in family data. Feng et al. (1996) and Sherman & le Cessie (1997) used a nonparametric S1 bootstrap approach on independent clusters and observed that, in general, when the robust estimate of the standard error was used GEE provided comparable coverage to the bootstrap. Feng et al. (1996) found that the bootstrap estimates were more efficient when the sample size was small. Overall, these authors concluded that the bootstrap approach is superior in small and large samples compared to other methods used for correlated data, including the GEE and mixed linear models.

We conducted a simulation study designed to provide challenging conditions for the methods. It reflects a situation in which the underlying correlation structure is consistent with familial or genetic sharing. For example, a family with larger correlation between parents and between siblings, but with smaller correlation between parents and offspring, would arise with environmental sharing. A family correlation structure that reflects genetic dependence would have high correlation between siblings, lower correlation for parent-offspring pairs, and virtually no correlation between parents. A more complex correlation structure may involve both genetic and environmental sharing among family members. Although it is possible to specify these types of general correlation structures (Bull et al. 1995), evaluations with varying family sizes have been limited. Sherman & le Cessie (1997) used an exchangeable correlation structure to generate binary data in their simulations. Feng et al. (1996) used a more general correlation structure to simulate their data, but examined a quantitative trait. Both of these previous studies looked at clusters of fixed sizes. In our simulation study we investigated families of variable sizes, to reflect a setting such as that found in a population-based association study.

Based on bootstrap approaches from Davison & Hinkley (1997), which we adapt to family data, a novel aspect of this report is the use of a hierarchical bootstrap approach, in which resampling is applied to the offspring within nuclear families as well as to the family units. Previous studies have examined resampling of entire families only. Our work is motivated by studies of candidate genes in population-based studies which include entire nuclear families, not only unrelated individuals. This design is particularly relevant in cross-sectional or cohort studies in populations in which the candidate gene variant is common. Although unrelated individuals rather than families are typically included in a population-based study, we prefer the ‘population-based’ terminology because it reflects the design in which families are ‘randomly’ selected from the population, without reference to their disease status, and are therefore representative of it. Further, our regression analysis is a marginal, so-called population-averaged, model (Neuhaus et al. 1991), in which inference is at the individual level, and any familial clustering is incidental. In a logistic regression, for example, the regression coefficient for a binary genotype indicator compares the odds of disease in the group of individuals with the risk genotype to the odds in the group without it. By contrast, in a mixed effects or “cluster-specific” logistic regression, the regression coefficients correspond to the relationship between disease status and genotypes within a family. Moreover, family-based association studies of nuclear families, that is, transmission-disequilibrium (TDT) type analyses, are based on a conditional within-family model, but families are usually ascertained on offspring disease status, the conditioning is on the parental genotypes, and parental disease status is ignored. As is well-known in the case-control setting, we expect a marginal analysis to be more efficient than a conditional analysis, with the caveat that confounding by population structure is not an issue. In 1999, the Editors of Nature Genetics recommended that genetic effect estimates be routinely reported in association studies. Confidence intervals for genetic effects, such as the log odds ratio, are useful in assessing the importance of a candidate gene variant.

Methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Application To Cardiovascular Data Set
  6. Results
  7. Simulation Study
  8. Discussion
  9. Acknowledgements
  10. References

GEE for Binary Responses

In the GEE for binary responses one specifies the marginal distribution of the binary outcome or phenotype, yij, for the jth individual in the ith family, i= 1, ⋯, m, j= 1, ⋯, ni. The marginal expectation of the phenotype, E(yij) =πij=P(yij= 1), is explained by an individual-level covariate vector, xij, through the logit link function, logit(πij) =xijβ, where the regression parameter vector, β, is of dimension p. A consistent estimate of β is found by solving the estimating equations

  • image(1)

The first term inline image is an ni×p matrix where the (j, k)th element is ∂πij/∂βk and μi(β) =E(yi) is the marginal expectation for yi, the vector of binary phenotypes for family i, i= 1, 2, ⋯, m. The second component specifies the inverse of

  • image

which is of dimension ni×ni. The diagonal elements of Ai contain the marginal variance of yij, var (yij) =πij(1 −πij), and Ri(α) is the ‘working correlation’ matrix or the ‘model-based’ assumption of the correlation for the ni individuals in family i, fully specified by a vector of parameters α. Some common choices for R(α) are the independence structure where the correlation matrix is the identity matrix, and the exchangeable structure where correlations are assumed to be equal among individuals within a family. When the marginal mean structure is specified correctly, the GEE regression estimates, inline image, are consistent and asymptotically follow a Normal distribution. Liang & Zeger (1986) showed that the GEE estimates of β are consistent even when the covariance structure is misspecified. In the setting in which a covariate of interest varies within the family, as with a common candidate gene variant, the relative efficiency of the corresponding regression parameter estimate improves when the GEE working correlation structure is close to the true correlation structure.

A model-based or naive estimate of the variance of inline image is

  • image

which is unbiased only when Vi is correctly specified. A robust or sandwich estimate (Royall, 1986) of the variance of inline image is

  • image

where inline image where inline image. inline image will provide valid inferences for β even when the covariance structure Vi is misspecified. Liang & Zeger (1986) also found that when independence was assumed among observations in the same cluster, i.e. Ri(α) =I, the robust covariance matrix for inline image approached its asymptotic variance-covariance matrix as the number of clusters, m, increased.

Conventionally, confidence intervals for inline image are constructed assuming an asymptotic Normal distribution and using the robust variance estimate of the regression coefficient. However, in small samples inline image may be subject to finite sample bias and its distribution may not be symmetric. In addition, the robust variance estimate may not be appropriate for small sample sizes (Drum & McCullagh, 1993), since type I errors may be inflated (Pan, 2001). To address this, modifications to the robust variance estimator have been proposed (Mancl & DeRouen, 2001; Pan, 2001; Pan & Wall, 2002), but these have not been widely used in practice.

Hierarchical Bootstrap for Logistic Regression Models

The bootstrap method offers an alternative approach to confidence interval construction that can take familial correlation into account and does not require asymptotic normality of the estimates. Under the assumption that families are independent, an intuitive way to bootstrap family data would be to sample entire family units. This all-block bootstrap (Sherman & le Cessie, 1997; Davison & Hinkley, 1997) naturally preserves the structure and correlation within each family. Following Davison & Hinkley (1997), we propose an extension of the bootstrap resampling scheme that takes into account two sources of variation observed in family data, variation between families and variation within families, via the hierarchical bootstrap, a method for computing bootstrap estimates in which the resampling scheme follows a two-stage nested design. This approach can be applied to nuclear family data that have a nested or hierarchical structure of parents with offspring.

Davison & Hinkley (1997) presented two strategies for bootstrapping data with a nested structure where sampling is performed at two levels. We adapted these strategies for family data:

Strategy 1 (S1): Stage 1: randomly sample families with replacement, keep parents.

Stage 2: for those families selected in stage 1, randomly sample offspring within each family without replacement;

Strategy 2 (S2): Stage 1: randomly sample families with replacement, keep parents.

Stage 2: for those families selected in stage 1, randomly sample offspring within each family with replacement.

Note that the first strategy is the same as the all-block bootstrap because the sampling at the second stage is performed without replacement. Because families will normally be of different sizes, a correction factor is made to each bootstrap sample that is inline image, where inline image is the bootstrap estimate from a bootstrap sample of size inline image, (where n*i is the size of the ith family in the bootstrap sample) and where the empirical distribution is of size inline image is the size of the ith family in the original sample) (Efron & Tibshirani, 1993).

For binary outcomes degenerate bootstrap samples are occasionally produced. These occur when there are no observations represented in some covariate groups, and arise especially in smaller samples. Hence, the fitting of a logistic model can break down and produce nonexistent maximum likelihood estimates if the number of responses and/or the sample size is small (Albert & Anderson, 1984). To circumvent this problem, Moulton & Zeger (1989) proposed a one-step approximation when bootstrapping generalized linear models. In other words, we perform only one iteration for each bootstrap replication rather than allowing the iteratively reweighted least squares algorithm to fully converge. This method is also advantageous for complicated models and large data sets because computing time is reduced.

Bias-corrected and accelerated (BCa) confidence intervals (Efron & Tibshirani, 1986) were used to construct non-parametric confidence intervals for β. To obtain stable bootstrap confidence intervals Efron & Tibshirani (1986) suggested generating at least B= 1000 bootstrap samples. The collection of B bootstrap estimates inline image are ordered and values for the endpoints obtained depending on the bootstrap confidence level of interest.

For a fixed set of B bootstrap samples, a percentile confidence interval (PI) could be constructed by simply using the value that exceeds α/2% and (1 −α/2)% of the bootstrap distribution. However, the bias-corrected and accelerated bootstrap confidence interval (BCa) improves on the percentile interval, giving better coverage for distributions of inline image that may be biased and/or skewed (Efron & Tibshirani, 1986). The BCa interval endpoints are also computed by using percentiles of the bootstrap distribution. Two quantities, the acceleration, which corrects for skewness, and the bias-correction, are required in the construction of the BCa interval. The bias-correction deals with any lack of symmetry that might exist among the bootstrap estimates, inline image, around inline image, the estimate based on the original observed data. The acceleration constant is computed using statistics obtained from a jackknife approach. Note that for family data the jackknife is performed by deleting each family in turn rather than each individual. Further details can be found in Efron & Tibshirani (1993).

Application To Cardiovascular Data Set

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Application To Cardiovascular Data Set
  6. Results
  7. Simulation Study
  8. Discussion
  9. Acknowledgements
  10. References

The Data

We have illustrated the use of GEE versus the bootstrap for population-based association studies of families, using a subset of data from a study to investigate factors associated with cardiovascular disease risk (unpublished data provided by R. Hegele, University of Western Ontario). These data comprise 129 subjects in 33 nuclear families ascertained via a community-wide screening. All individuals in the study population were invited to participate independent of disease status or family history, and subsequent family units were included where, whenever possible, all available siblings within a family were asked to participate. The participation rate was 71%.

In addition to binary disease status, which represented a severe form of disease, potential explanatory variables available for analysis were age, sex and body mass index (BMI). A binary genotype indicator for a specific candidate genotype was of primary interest and occurred in 31% of the sample. There were 56 males and 73 females with a combined average age of 43 years and a combined average BMI of 29.14. For the phenotype, disease prevalence among participants was 50%, 41% and 58% in males and females, respectively.

Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Application To Cardiovascular Data Set
  6. Results
  7. Simulation Study
  8. Discussion
  9. Acknowledgements
  10. References

GEE and bootstrap approaches were used to analyze the cardiovascular data based on the logistic regression model

  • image

where pij is the probability of severe disease for individual j in family i, i= 1, …, 33, j= 1, …, ni, CG represents the candidate genotype indicator, and sex is an indicator of female sex. The age and BMI variables represent centred values.

Table 1 presents the regression parameter estimates based on ordinary logistic regression, and 95% confidence intervals and confidence interval widths for regression parameter estimates based on the GEE and bootstrap approaches. For all methods the candidate genotype effect was found to be important, whereas the 95% confidence intervals for the BMI and quadratic age effects included zero, indicating lack of statistical significance at the 5% level. Associations with linear age and sex were deemed significant at the 5% level for all methods, except for the second bootstrap strategy. The confidence interval widths for this bootstrap strategy were all higher than for all other methods, a result that is consistent with observations from the simulation study reported in the following section. The simulation results indicate that the confidence interval coverage for the second bootstrap strategy is generally closer to nominal.

Table 1.  Logistic regression coefficient estimates and 95% confidence intervals (with interval width) for regression coefficients using GEE, S1 (bootstrap strategy 1) and S2 (bootstrap strategy 2) for the cardiovascular study data set
VariableCoefficient[95% Confidence Interval] (width)
EstimateGEES1S2
CG1.489[0.520, 2.459][0.371, 2.470][0.070, 2.640]
(1.939)(2.099)(2.570)
Age0.028[0.006, 0.050][0.004, 0.051][−0.007, 0.061]
(0.044)(0.047)(0.067)
Age2−0.001[−0.003, 0.000][−0.003, 0.000][−0.003, 0.001]]
(0.003)(0.003)(0.004)
Sex0.786[0.001, 1.571][0.029, 1.730][−0.310, 1.820]
(1.570)(1.701)(2.130)
BMI0.006[−0.065, 0.078][−0.067, 0.090][−0.092, 0.117]
(0.143)(0.157)(0.209)

Simulation Study

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Application To Cardiovascular Data Set
  6. Results
  7. Simulation Study
  8. Discussion
  9. Acknowledgements
  10. References

A Multifactorial Model for Family Data

Complex genetic disease typically has a multifactorial pattern of inheritance, consisting of many genes having small additive effects (polygenes) and/or a small number of genes each having a major effect (oligogenes), as well as environmental factors that can influence the expression of disease. For the simulation study we developed a multifactorial disease model to randomly generate family data that included a candidate gene, an environmental effect with familial sharing, and a residual component that exhibited the effects of unmeasured polygenes and unmeasured environmental factors. Various scenarios were examined in sample sizes of 25, 50, and 100, including one with a gene-environment interaction that we report in detail. We examined a relatively small number of families because this is precisely the situation in which GEE confidence intervals perform poorly, and we wanted to evaluate the bootstrap strategies in the challenging conditions of small samples, variable family size, and high residual correlation. As the number of families increases, we expect the differences in the methods to be less dramatic, although differences could persist when higher target coverage levels (e.g. 99% or 99.9%) are of interest, and more bootstrap replicates would be required.

The structure of each nuclear family comprised two parents and a random number of offspring assumed to follow a sibship size distribution. The sibship distribution was generated from a geometric distribution with a mean of 0.4551 (Suarez & Van Eerdewegh, 1984) which produced a mean sibship size of 2.2. For each family member individual-specific covariates were also simulated, including sex, genotypes for a candidate gene, and an environmental factor.

The genetic component in the multifactorial model consisted of a bi-allelic candidate gene. The parental genotypes were simulated with predefined population allele frequencies. The genotypes for the offspring were then assigned following Mendelian transmission probabilities assuming random parental mating types. In this situation, the candidate gene covariate has intracluster correlation, ρX, that is positive within families and contributes to variance inflation of the associated parameter estimate (Bull et al. 2001). One would expect sharing of alleles between siblings and between parents and offspring, because the children's genes are determined solely by parental genes and governed by transmission probabilities. The value of ρX when the covariate X is a binary candidate gene indicator has an expected value that ranges between inline image and inline image for siblings. When spouses are not related the expected correlation for parents will be close to 0.

A quantitative environmental factor (EF) was generated assuming that each family had some shared exposure to an environmental influence. For each family a random normal variate with mean μI and variance σ2I was assigned to represent the mean family environmental exposure. Individual EF levels were then generated from a second normal distribution, using the family EF level for its mean and variance σ2J < σ2I so that there was less within-family variability relative to between-family variability. This induced positive correlation in EF among members of the same family. The environmental factor (EFij) for the jth individual in the ith family can be expressed as a value arising from a one-way ANOVA model with random effects,

  • image

where ef iN(0, σ2I) and eijN(0, σ2J). In the case of equal-sized families the intrafamilial correlation in EF is

  • image

A residual familial component was also included in the multifactorial model. This family residual was generated as a correlated vector, Qi, from a multivariate normal distribution, inline image, with mean vector zero and a variance-covariance matrix, Σi2Ki. The family variance-covariance matrix reflects the residual familial correlation structure (in addition to that due to known covariates) among spouses, parent-offspring pairs, and sibs, and is equal to a constant σ2 multiplied by the ith family correlation matrix (Ki) (Joe, 1997). The residual component for each family contained correlations that were small for spouses but larger for related members in the family, that is, parent-offspring and siblings.

Thus, two sources of familial correlation were incorporated into the simulated data. One source was at the covariate level, since family members share common genes and exposure to an environmental factor, and the other source of correlation was at the residual level reflecting the presence of other unmeasured genes or shared environmental factors.

A logistic regression model was used to simulate correlated binary outcomes for each family i of size ni such that

  • image(2)

where pij is the probability of being affected with the disease of interest, qij is an element from the correlated vector Qi, β1= log odds ratio for the effect of the candidate gene (CG), β2= log odds ratio for the effect of the environmental factor (EF), and β3= log odds ratio for the interaction effect of CG and EF. The genotypes comprised alleles A and a for the candidate gene (CG) and were coded categorically based on additive inheritance: AA= 2, Aa= 1, aa= 0.

Disease status was randomly determined from the disease probability pij. A uniform random number (D∼ Unif(0,1)) was generated and individual j in family i was classified as affected if pij was above D or not affected if pij was below D. The logistic regression model in (2) allows correlated binary data to be generated with a general familial dependence structure. This is known as a multivariate logit-normal model (Joe, 1997), because for each family a vector of probabilities, inline image, is assumed to arise from a multivariate logit normal distribution with parameters μi and Σi= (σjk). It has been suggested by Joe (1997) that σ2 of the family variance-covariance matrix Σi2Ki be set large to obtain a wide range of dependence.

Each simulated family and their data were generated independently using model (2). The familial residual vector, inline image, induces heterogeneity between families, producing a cluster (family)-specific effect in the data. Thus, model (2) is essentially a cluster-specific model. In contrast, unlike cluster-specific models, population-averaged models such as the GEE do not account for the cluster heterogeneity. For linear models the distinction between the population-averaged (PA) approach and the cluster-specific (CS) approach is not important, since both will yield the same estimates and have the same interpretations of the regression coefficients (Neuhaus, 1992). However, with non-linear models for discrete data such as logistic regression, population-averaged and cluster-specific models will provide different interpretations for the regression coefficients. For a binary covariate, Xij, a population-average regression coefficient would be interpreted as the log odds for disease for those in the population with Xij= 1 relative to those in the population with Xij= 0. Conversely, a cluster-specific regression coefficient would be interpreted as the log odds for disease in family member j with Xij= 1 relative to another family member j′ within the same family i that has Xij= 0. Zeger et al. (1988) and Neuhaus et al. (1991) showed that population-averaged regression coefficients estimated from a marginal model are attenuated when fitted to data generated from a cluster-specific model, and become more attenuated as the correlation in the yij's increases.

In population-based association studies interest centres on population-averaged effects, where the fact that families are included may be only a matter of convenience. The GEE and bootstrap methods we used are population-averaged approaches. To evaluate and compare the two methods values of population-averaged parameters were required. This was accomplished by generating a large population of 100,000 individuals using model (2), where μj represented the explained portion (the linear predictor) of model (2). The parameters were estimated using an ordinary logistic model, since asymptotically the coefficients are consistent in both the independent and correlated data settings (Liang & Zeger, 1986). As discussed by Neuhaus et al. (1992), the maximum likelihood estimate for the population-averaged parameter, inline image, estimated under a misspecified model (that is, a population-averaged model when the cluster-specific model is the true one) converges to the value βPA, which minimizes the Kullback-Leibler divergence between a cluster-specific and a population-averaged model. Therefore, the value of inline image in a very large sample simulation can be taken as the true βPA. Once the data were simulated, parameter estimates describing genetic and environmental effects were compared to the underlying population-average values, βPA.

For the simulation study 1000 replicates of 25 families and 500 replicates of 50 and 100 families were generated to compare the bootstrap and the GEE methods. Several scenarios were considered in simulations with similar patterns emerging (Shin, 1998). In this report, we present a scenario in which there were small effects from the candidate gene and environmental factor but a larger effect from the gene-environment interaction. Table 2 provides the parameter values used. We assumed a baseline population prevalence of 10%.

Table 2.  Parameters used in simulation study: βi= the cluster-specific parameter and βPAi= population-averaged value
allele frequency = 0.20.
small CG effect, small EF effect, large G × E effect, moderate residual effect,
β1= 1, β2= 1, β3= 3, σ2= 25, EF iN(1, 0.25)
βPA1= 0.31, βPA2= 0.28, βPA3= 0.95

The family correlation matrix, Ki of (2), represents unmeasured genetic and/or shared environmental effects. Because Ki is included on the logit scale, the residual correlations of the binary outcome are attenuated from the residual correlation on the logit scale. The residual correlations were therefore set higher on the logit scale to achieve a desired correlation level on the binary response scale. The residual correlations on the logit scale were set to 0.80 for both siblings and parent-offspring. The residual correlation on the logit scale between spouses was set to 0.30 to represent shared environment only. For example, for a family with 2 parents and 3 offspring, Ki would look like

  • image

Correlations between spouses, parent-offspring and siblings for binary disease status were estimated from the simulations generated, with resulting values of 0.18, 0.42 and 0.48, respectively.

Measures of Performance

In some datasets of 25 and 50 families parameter estimates could not be determined due to lack of convergence to finite values. Therefore, the number of nondegenerate datasets, denoted by ND, was used to assess the GEE and bootstrap methods using the following measures of performance.

The true variance, computed as the sample variance of inline image over all replications, is

  • image

where inline image is one of inline image from simulation s and where inline image. Note that inline image, the average of the variance estimates for the coefficient estimate computed for each replication.

Confidence interval coverage was chosen as the primary criterion to compare performance of the methods. GEE confidence intervals were based on the asymptotic normality assumption of the distribution of the coefficients. Therefore, all 95% confidence intervals for the GEE estimates were constructed using z-statistics taken from the standard normal distribution along with standard error estimate computed from the GEE using the robust variance estimator. The bootstrap BCa intervals were constructed based on 1000 bootstrap replications.

The coverage probability in each simulation was computed as the proportion of 95% confidence intervals out of ND data sets that contained the population-averaged parameter. Each coverage probability estimate was tested (using a two-tailed test) against the null hypothesis of Ho: pcoverage= 0.95 to determine if the observed coverage probability was significantly different from the nominal level of 95%. The confidence interval length averaged over all replicates was computed as a measure of overall precision of the interval under different conditions.

The number of observed degenerate data sets from simulations generated under the alternative hypothesis was minimal. For the parameters in Table 2, 1.9% were degenerate in 25 families, 0.2% for 50 and none for 100 families. Occurrences of degenerate data in the bootstrap samples were addressed by the one-step estimators.

Simulation Results

Table 3 shows the variance of the coefficient estimates over all replicates (inline image) and the average of the variance estimates, inline image, for the methods considered. As expected, the variance of the parameter estimates decreased as the number of families increased. For the GEE, under the independence working correlation assumption, all of the average robust variance estimates were less than inline image, indicating that the robust standard errors are too small, especially in small samples. For bootstrap strategy 1 (S1) the average of the variance estimates was smaller than inline image. In most cases, for bootstrap strategy 2 (S2) the averages of variance estimates were larger than inline image.

Table 3.  Simulation-based variance estimate of inline image and means of variance estimates across simulations for GEE (using independence working correlation), S1 (bootstrap strategy 1) and S2 (bootstrap strategy 2)
EffectNumber of Familiesvar(inline image)inline imageinline imageinline image
β1255.913.915.636.52
(CG)502.161.732.002.33
1000.950.830.891.03
β2252.972.162.643.00
(EF)501.271.051.131.27
1000.620.520.540.61
β3255.463.595.206.04
(CG×EF)502.051.601.862.16
1000.910.770.820.95

With respect to empirical confidence interval coverage (Table 4), the most striking finding was the consistent undercoverage of the GEE for all sample sizes. The bootstrap intervals had closer to nominal 95% coverage. As expected, strategy 2 tended to have higher coverage than strategy 1, corresponding to larger variance estimates of the coefficients and longer average interval lengths (data not shown).

Table 4.  Observed coverage of 95% confidence intervals across simulations for GEE (using independence working correlation), S1 (bootstrap strategy 1) and S2 (bootstrap strategy 2). Asterisks denote coverage probabilities that are significantly different from the nominal level of 95%
EffectNumber of FamiliesGEES1S2
β1250.890*0.9520.970*
(CG)500.910*0.9520.980*
1000.924*0.9440.960
β2250.898*0.931*0.953
(EF)500.922*0.926*0.946
1000.924*0.916*0.934
β3250.886*0.9510.964
(CG×EF)500.916*0.9440.968
1000.926*0.9420.972*

It is interesting to note that for the continuous environmental risk factor the S1 coverage was less than nominal but the S2 coverage was close to nominal, whereas for the categorical candidate gene risk factor, with only three levels, the S2 coverage was above nominal. The observed S2 overcoverage for the candidate gene may be explained as a consequence of resampling, with replacement, among families. Sparse samples with less variation among risk factor values could be subject to finite-sample bias (away from the null), leading to large one-step-approximation estimates and ultimately to greater variability for the associated regression parameter estimate.

In general, the bootstrap confidence intervals had longer lengths corresponding to the higher coverage probabilities. Although the GEE produced shorter intervals than the bootstrap, this does not imply that the GEE intervals were more accurate. For the GEE, use of a standard normal assumption to construct the intervals, based on asymptotic distributions of estimates, may not always be appropriate. The bootstrap, on the other hand, uses the data at hand to construct intervals and thus is more robust to nonnormal or skewed parameter estimate distributions.

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Application To Cardiovascular Data Set
  6. Results
  7. Simulation Study
  8. Discussion
  9. Acknowledgements
  10. References

It is evident from the simulations that, for all sample sizes examined, the BCa intervals for both bootstrap strategies perform better, with respect to producing good coverage probabilities, compared to the standard GEE approach. Furthermore, no assumptions about the underlying correlation structure need to be specified for the bootstrap strategies. Strategy 2 of the bootstrap yielded confidence intervals that tended to be conservative (i.e. too wide) for the candidate gene effect where there were few possible covariate values. On the other hand, strategy 2 worked well for the continuous covariate, whereas strategy 1 tended to be anti-conservative. Overall, in nuclear families with individual-level covariates, such as candidate genes that vary within families, we recommend the strategy 2 bootstrap method over the GEE method since it performs well for the sample sizes examined (25 to 100 families). A drawback in using the bootstrap is its demands on computational time relative to GEE. Therefore, it seems that better performance comes with some modest additional cost in computation.

In most cases, the GEE confidence intervals produced poor coverage probabilities. In our evaluations, we focussed on the use of the independence working correlation matrix, and based the bootstrap confidence intervals on regression estimates obtained under this assumption, but the bootstrap method could be applied just as well to GEE estimates obtained under other correlation structures. Efficiency would be improved to the extent that the actual correlation structure within families would be close to that specified (Lipsitz et al. 1994; Fitzmaurice, 1995). In particular, for a study that includes only groups of siblings, the use of an exchangeable working correlation would be sensible, and bootstrap confidence intervals could be based on coefficients estimated under this assumption. Nevertheless, the underlying problem with GEE in small samples is failure of asymptotic approximations, and improvements in efficiency of parameter estimates may not reduce the bias in the associated standard error estimates. Additional simulations we conducted using an exchangeable working correlation supported this contention. Although the efficiency of the regression estimates increased, and the confidence intervals were somewhat shorter in length, the usual GEE intervals tended to ‘undercover’ in this case as well. In studies that involve extended families the correlation structure would be more complex, and an appropriate working correlation matrix more difficult to specify (see for example Bull et al. 1995). Alternatively, the bootstrap approach could be modified for multi-stage nested designs relevant to the extended family structure. A modified hierarchical bootstrap may be more suitable than the GEE to handle more complex correlation structures, as well as smaller numbers of extended families. However, the presence of multi-generational families and missing data would also complicate the bootstrap method.

The GEE and bootstrap confidence intervals evaluated in this report are appropriate in cross-sectional or cohort designs for association studies in which certain assumptions are valid. First, it is assumed that families are sampled at random and that no bias has been introduced by non-random ascertainment. Second, individuals from different families are also assumed to be independent. This assumption might be violated, for example, when families from isolated communities with complex kinship relationships are recruited. Third, the consistency results for the regression parameters require accurate marginal mean specification, i.e., no important covariates have been omitted. Finally, the potential for confounding due to hidden population stratification must also be considered (Devlin & Roeder, 1999).

In the marginal GEE model we consider for the evaluation of candidate genes, the interpretation of the regression coefficients does not depend on the number or on the characteristics of the family members. However, the fact that there is variation in the cluster size (here the number of family members observed) does raise some additional issues. We assume that unobserved family members are missing completely at random and that disease status, the outcome of interest, is not associated with family size. Within-cluster resampling (Hoffman et al. 2001) offers an alternative approach to obtain valid inferences when this assumption is violated, while Williamson et al. (2003) proposed an equivalent weighted GEE to address this situation. If, however, the observed family size is related to disease status (i.e. is informative), or the relationship between disease status and genotype varies with family size, these latter authors also pointed out that the cluster weighting associated with a non-independence GEE working correlation would produce a different marginal model. More recently, Neuhaus & McCulloch (2006) demonstrated that correlation between family response means and covariates can lead to substantially biased estimates of marginal association. Further research is warranted to evaluate the impact of these conditions on bias and efficiency in settings in which covariates such as candidate genes vary within family clusters, but a randomly selected individual from the population is of interest.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Application To Cardiovascular Data Set
  6. Results
  7. Simulation Study
  8. Discussion
  9. Acknowledgements
  10. References

This work was supported by project grants from the Canadian Network of Centres of Excellence in Mathematics (MITACS) and the Natural Sciences and Engineering Research Council (Canada). JS was supported by the Ontario Student Opportunity Trust Fund. SBB is a Senior Investigator of the Canadian Institutes of Health Research.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Application To Cardiovascular Data Set
  6. Results
  7. Simulation Study
  8. Discussion
  9. Acknowledgements
  10. References