A Multifactorial Model for Family Data
Complex genetic disease typically has a multifactorial pattern of inheritance, consisting of many genes having small additive effects (polygenes) and/or a small number of genes each having a major effect (oligogenes), as well as environmental factors that can influence the expression of disease. For the simulation study we developed a multifactorial disease model to randomly generate family data that included a candidate gene, an environmental effect with familial sharing, and a residual component that exhibited the effects of unmeasured polygenes and unmeasured environmental factors. Various scenarios were examined in sample sizes of 25, 50, and 100, including one with a gene-environment interaction that we report in detail. We examined a relatively small number of families because this is precisely the situation in which GEE confidence intervals perform poorly, and we wanted to evaluate the bootstrap strategies in the challenging conditions of small samples, variable family size, and high residual correlation. As the number of families increases, we expect the differences in the methods to be less dramatic, although differences could persist when higher target coverage levels (e.g. 99% or 99.9%) are of interest, and more bootstrap replicates would be required.
The structure of each nuclear family comprised two parents and a random number of offspring assumed to follow a sibship size distribution. The sibship distribution was generated from a geometric distribution with a mean of 0.4551 (Suarez & Van Eerdewegh, 1984) which produced a mean sibship size of 2.2. For each family member individual-specific covariates were also simulated, including sex, genotypes for a candidate gene, and an environmental factor.
The genetic component in the multifactorial model consisted of a bi-allelic candidate gene. The parental genotypes were simulated with predefined population allele frequencies. The genotypes for the offspring were then assigned following Mendelian transmission probabilities assuming random parental mating types. In this situation, the candidate gene covariate has intracluster correlation, ρX, that is positive within families and contributes to variance inflation of the associated parameter estimate (Bull et al. 2001). One would expect sharing of alleles between siblings and between parents and offspring, because the children's genes are determined solely by parental genes and governed by transmission probabilities. The value of ρX when the covariate X is a binary candidate gene indicator has an expected value that ranges between and for siblings. When spouses are not related the expected correlation for parents will be close to 0.
A quantitative environmental factor (EF) was generated assuming that each family had some shared exposure to an environmental influence. For each family a random normal variate with mean μI and variance σ2I was assigned to represent the mean family environmental exposure. Individual EF levels were then generated from a second normal distribution, using the family EF level for its mean and variance σ2J < σ2I so that there was less within-family variability relative to between-family variability. This induced positive correlation in EF among members of the same family. The environmental factor (EFij) for the jth individual in the ith family can be expressed as a value arising from a one-way ANOVA model with random effects,
where ef i∼N(0, σ2I) and eij∼N(0, σ2J). In the case of equal-sized families the intrafamilial correlation in EF is
A residual familial component was also included in the multifactorial model. This family residual was generated as a correlated vector, Qi, from a multivariate normal distribution, , with mean vector zero and a variance-covariance matrix, Σi=σ2Ki. The family variance-covariance matrix reflects the residual familial correlation structure (in addition to that due to known covariates) among spouses, parent-offspring pairs, and sibs, and is equal to a constant σ2 multiplied by the ith family correlation matrix (Ki) (Joe, 1997). The residual component for each family contained correlations that were small for spouses but larger for related members in the family, that is, parent-offspring and siblings.
Thus, two sources of familial correlation were incorporated into the simulated data. One source was at the covariate level, since family members share common genes and exposure to an environmental factor, and the other source of correlation was at the residual level reflecting the presence of other unmeasured genes or shared environmental factors.
A logistic regression model was used to simulate correlated binary outcomes for each family i of size ni such that
where pij is the probability of being affected with the disease of interest, qij is an element from the correlated vector Qi, β1= log odds ratio for the effect of the candidate gene (CG), β2= log odds ratio for the effect of the environmental factor (EF), and β3= log odds ratio for the interaction effect of CG and EF. The genotypes comprised alleles A and a for the candidate gene (CG) and were coded categorically based on additive inheritance: AA= 2, Aa= 1, aa= 0.
Disease status was randomly determined from the disease probability pij. A uniform random number (D∼ Unif(0,1)) was generated and individual j in family i was classified as affected if pij was above D or not affected if pij was below D. The logistic regression model in (2) allows correlated binary data to be generated with a general familial dependence structure. This is known as a multivariate logit-normal model (Joe, 1997), because for each family a vector of probabilities, , is assumed to arise from a multivariate logit normal distribution with parameters μi and Σi= (σjk). It has been suggested by Joe (1997) that σ2 of the family variance-covariance matrix Σi=σ2Ki be set large to obtain a wide range of dependence.
Each simulated family and their data were generated independently using model (2). The familial residual vector, , induces heterogeneity between families, producing a cluster (family)-specific effect in the data. Thus, model (2) is essentially a cluster-specific model. In contrast, unlike cluster-specific models, population-averaged models such as the GEE do not account for the cluster heterogeneity. For linear models the distinction between the population-averaged (PA) approach and the cluster-specific (CS) approach is not important, since both will yield the same estimates and have the same interpretations of the regression coefficients (Neuhaus, 1992). However, with non-linear models for discrete data such as logistic regression, population-averaged and cluster-specific models will provide different interpretations for the regression coefficients. For a binary covariate, Xij, a population-average regression coefficient would be interpreted as the log odds for disease for those in the population with Xij= 1 relative to those in the population with Xij= 0. Conversely, a cluster-specific regression coefficient would be interpreted as the log odds for disease in family member j with Xij= 1 relative to another family member j′ within the same family i that has Xij= 0. Zeger et al. (1988) and Neuhaus et al. (1991) showed that population-averaged regression coefficients estimated from a marginal model are attenuated when fitted to data generated from a cluster-specific model, and become more attenuated as the correlation in the yij's increases.
In population-based association studies interest centres on population-averaged effects, where the fact that families are included may be only a matter of convenience. The GEE and bootstrap methods we used are population-averaged approaches. To evaluate and compare the two methods values of population-averaged parameters were required. This was accomplished by generating a large population of 100,000 individuals using model (2), where μj represented the explained portion (the linear predictor) of model (2). The parameters were estimated using an ordinary logistic model, since asymptotically the coefficients are consistent in both the independent and correlated data settings (Liang & Zeger, 1986). As discussed by Neuhaus et al. (1992), the maximum likelihood estimate for the population-averaged parameter, , estimated under a misspecified model (that is, a population-averaged model when the cluster-specific model is the true one) converges to the value βPA, which minimizes the Kullback-Leibler divergence between a cluster-specific and a population-averaged model. Therefore, the value of in a very large sample simulation can be taken as the true βPA. Once the data were simulated, parameter estimates describing genetic and environmental effects were compared to the underlying population-average values, βPA.
For the simulation study 1000 replicates of 25 families and 500 replicates of 50 and 100 families were generated to compare the bootstrap and the GEE methods. Several scenarios were considered in simulations with similar patterns emerging (Shin, 1998). In this report, we present a scenario in which there were small effects from the candidate gene and environmental factor but a larger effect from the gene-environment interaction. Table 2 provides the parameter values used. We assumed a baseline population prevalence of 10%.
Table 2. Parameters used in simulation study: βi= the cluster-specific parameter and βPAi= population-averaged value
|allele frequency = 0.20.|
|small CG effect, small EF effect, large G × E effect, moderate residual effect,|
|β1= 1, β2= 1, β3= 3, σ2= 25, EF i∼N(1, 0.25)|
|βPA1= 0.31, βPA2= 0.28, βPA3= 0.95|
The family correlation matrix, Ki of (2), represents unmeasured genetic and/or shared environmental effects. Because Ki is included on the logit scale, the residual correlations of the binary outcome are attenuated from the residual correlation on the logit scale. The residual correlations were therefore set higher on the logit scale to achieve a desired correlation level on the binary response scale. The residual correlations on the logit scale were set to 0.80 for both siblings and parent-offspring. The residual correlation on the logit scale between spouses was set to 0.30 to represent shared environment only. For example, for a family with 2 parents and 3 offspring, Ki would look like
Correlations between spouses, parent-offspring and siblings for binary disease status were estimated from the simulations generated, with resulting values of 0.18, 0.42 and 0.48, respectively.
Measures of Performance
In some datasets of 25 and 50 families parameter estimates could not be determined due to lack of convergence to finite values. Therefore, the number of nondegenerate datasets, denoted by ND, was used to assess the GEE and bootstrap methods using the following measures of performance.
Confidence interval coverage was chosen as the primary criterion to compare performance of the methods. GEE confidence intervals were based on the asymptotic normality assumption of the distribution of the coefficients. Therefore, all 95% confidence intervals for the GEE estimates were constructed using z-statistics taken from the standard normal distribution along with standard error estimate computed from the GEE using the robust variance estimator. The bootstrap BCa intervals were constructed based on 1000 bootstrap replications.
The coverage probability in each simulation was computed as the proportion of 95% confidence intervals out of ND data sets that contained the population-averaged parameter. Each coverage probability estimate was tested (using a two-tailed test) against the null hypothesis of Ho: pcoverage= 0.95 to determine if the observed coverage probability was significantly different from the nominal level of 95%. The confidence interval length averaged over all replicates was computed as a measure of overall precision of the interval under different conditions.
The number of observed degenerate data sets from simulations generated under the alternative hypothesis was minimal. For the parameters in Table 2, 1.9% were degenerate in 25 families, 0.2% for 50 and none for 100 families. Occurrences of degenerate data in the bootstrap samples were addressed by the one-step estimators.
Table 3. Simulation-based variance estimate of and means of variance estimates across simulations for GEE (using independence working correlation), S1 (bootstrap strategy 1) and S2 (bootstrap strategy 2)
|Effect||Number of Families||var()|
With respect to empirical confidence interval coverage (Table 4), the most striking finding was the consistent undercoverage of the GEE for all sample sizes. The bootstrap intervals had closer to nominal 95% coverage. As expected, strategy 2 tended to have higher coverage than strategy 1, corresponding to larger variance estimates of the coefficients and longer average interval lengths (data not shown).
Table 4. Observed coverage of 95% confidence intervals across simulations for GEE (using independence working correlation), S1 (bootstrap strategy 1) and S2 (bootstrap strategy 2). Asterisks denote coverage probabilities that are significantly different from the nominal level of 95%
|Effect||Number of Families||GEE||S1||S2|
It is interesting to note that for the continuous environmental risk factor the S1 coverage was less than nominal but the S2 coverage was close to nominal, whereas for the categorical candidate gene risk factor, with only three levels, the S2 coverage was above nominal. The observed S2 overcoverage for the candidate gene may be explained as a consequence of resampling, with replacement, among families. Sparse samples with less variation among risk factor values could be subject to finite-sample bias (away from the null), leading to large one-step-approximation estimates and ultimately to greater variability for the associated regression parameter estimate.
In general, the bootstrap confidence intervals had longer lengths corresponding to the higher coverage probabilities. Although the GEE produced shorter intervals than the bootstrap, this does not imply that the GEE intervals were more accurate. For the GEE, use of a standard normal assumption to construct the intervals, based on asymptotic distributions of estimates, may not always be appropriate. The bootstrap, on the other hand, uses the data at hand to construct intervals and thus is more robust to nonnormal or skewed parameter estimate distributions.