Analysis of Family- and Population-Based Samples Using Multiple Linkage Disequilibrium Mapping

Authors

  • Yen-Feng Chiu,

    Corresponding author
    • Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taiwan, ROC
    Search for more papers by this author
  • Chun-Yi Lee,

    1. Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taiwan, ROC
    Search for more papers by this author
  • Hui-Yi Kao,

    1. Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taiwan, ROC
    Search for more papers by this author
  • Wen-Harn Pan,

    1. Division of Health Services and Preventive Medicine, Institute of Population Health Sciences, National Health Research Institutes, Taiwan, ROC
    Search for more papers by this author
  • Fang-Chi Hsu

    1. Department of Biostatistical Sciences, Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC
    Search for more papers by this author

Corresponding author: Yen-Feng Chiu, 35 Keyan Rd., Zhunan, Miaoli 350, Taiwan, ROC. Tel: +886-37-246-266 ext. 36107; Fax: +886-37-586-467; E-mail: yfchiu@nhri.org.tw

Summary

We report two methods for linkage disequilibrium mapping that involve incorporation of covariates through parametric modeling to utilize combined case-parent trios and unrelated case and/or control data. The proposed two combined methods were used to map the disease locus of hypertension in the angiotensin-converting enzyme (ACE) gene with incorporation of ACE activity. The efficiencies in estimating the disease locus increased by 351- and 100-fold in the hybrid study with respect to the two proposed methods when compared to the estimates from the trios study; and they changed by 1.4- and 0.4-fold, respectively, when compared to the case-control study. Efficiency of disease locus estimates was greatly improved in both simulations and hypertension studies based on the hybrid data, compared to case-parent trio studies only. These newly developed methods preserve the advantages of the previous methods, including flexible modeling and assessment of gene-gene and gene-covariate effects, while providing more power by using all the data combined. The computing program for analysis using the separate and hybrid data sets is freely available on the author's website.

Introduction

Case-control designs are commonly utilized in association studies, since information from unrelated subjects is relatively easy to obtain. This, in turn, facilitates collection of a sample size sufficient to identify genes with small effects that contribute to the risk of complex disease. There are concerns, however, that use of unrelated cases and controls in association analysis cannot distinguish a true association because of spurious associations due to confounding effects, such as population stratification. To avoid such spurious associations, trio (or triad) designs have been recommended (Falk & Rubinstein, 1987). These designs, which include an affected child plus two parents, involve cases and “pseudo” controls that are matched for genetic ancestry and are, therefore, robust to population stratification. Unfortunately, recruitment of parental controls often proves more challenging and expensive than recruitment of unrelated controls. Difficulties in recruiting parental controls can, therefore, contribute to insufficient sample size for identifying genes with small effects on disease risk. Thus, situations may arise where a study collects both parental and unrelated controls for association analysis rather than choosing one specific type of control. For example, studies might collect a sample of trios to confirm previous association results that were identified in case-control studies (Epstein et al., 2005). Also, studies recruiting both parental and unrelated controls might focus on utilization of parental data, since use of unrelated controls may yield spurious associations due to stratification. Unrelated controls, however, do provide an opportunity to increase power in cases where the sample size of trios falls below the anticipated number. It is also possible that probands could be selected from a subset of a large-scale observational study cohort. Families would then be recruited for linkage and association analyses. All subjects from the original study might subsequently be included in a genome-wide association study. In this scenario, the entire sample for association mapping would consist of trios, unrelated controls, and perhaps unrelated cases.

Nagelkerke et al. (2004) proposed a joint analysis of such data using a likelihood-based approach, demonstrating that power is increased in comparison to methods that analyze trios and unrelated subjects separately. Epstein et al. (2005) showed that power is improved after relaxing the assumption of parental mating-type distribution compared to the more general parental mating-type used by Nagelkerke et al. (2004). In addition, they developed formal tests to determine whether data types can be combined, as well as for analyses of incomplete triad data. Schaid and Rowland (1998) developed a pooled score statistic for the combination of family, sibling, and case-control data, which, however, does not account for correlated transmissions of parental alleles to multiple affected offspring in the presence of linkage between the disease and marker loci. Dudbridge (2008) combined allelic relative risk estimate for case-control and trio samples by multinomial regression based on a retrospective likelihood. For combined case-control and affected sibpair family samples, this method accounts for linkage through a conditioning step based on the inheritance vector. Guo et al. (2009) developed a combined haplotype relative risk association test to combine trios and unrelated subjects in a homogeneous population. These approaches are likelihood-based. Kazeem and Farrall (2005) developed a pooled χ2 test with a weighted combined log odds ratio (OR) estimate based on single sample estimates from case-control and trios data. Glaser and Holmans (2009) compared the power of six combined methods through simulation studies and concluded that the combined statistics were more powerful than single-sample statistics when risk parameters were similar in single samples. They concluded that the four methods by Schaid and Rowland (1998), Nagelkerke et al. (2004), Kazeem and Farrall (2005), and Dudbridge (2008) had similar performance when combining unrelated cases/controls with trios (TDTCC design) as they are all likelihood-based methods. In addition, they found that all combined analysis methods under the TDTCC scenario had better or equal (adjusted) power than single-sample statistics and the pooledχ2-statistics. Chen and Lin (2008) used a weighted least-square approach to combine separate information from case-parent/case-sibling and case-unrelated control analyses. Compared to the likelihood-based approaches, this method is relatively simple and does not require assumptions or estimations about the mating-type distribution. Infante-Rivard et al. (2009) reviewed and summarized these methods and described key features of them. Hsu et al. (2009) proposed a pseudo-likelihood approach that can flexibly accommodate multiple genetic markers, environmental covariates, and gene-gene and gene-environment interactions for hybrid data; the hybrid data consist of cases, parents of cases, and unrelated individuals, and is readily extended to allow for nonoverlapping subsets of case-parents and case-unrelated controls.

In association mapping, precise localization of disease susceptibility genes is of primary importance to study genes and gene-gene and gene-environment interactions. The multipoint association linkage disequilibrium (LD) mapping proposed by Liang et al. (2001b) for trio data is robust on the underlying genetic model, given that the one assumption about the underlying genetic model is the presence of only one disease locus location in the chromosomal region. Instead of doing hypothesis testing to detect the disease locus in regular association analysis approaches, this approach provides an estimate for the disease locus location (τ), its genetic effect (C), and their variances through the generalized estimating equation (GEE) method (Zeger & Liang, 1986). This is accomplished by simultaneously incorporating multiple markers, similar to the situations where individuals have repeated observations in a longitudinal study. Therefore, a 95% confidence interval can also be estimated for the disease locus location and for its genetic effect. Chiu et al. (2008, 2010a) have extended this approach to incorporate covariates into the LD mapping, based on either trios or a case-control study. The significance of the covariate indicates that the genetic effect is associated with the covariate. In the present study, we developed two new methods for extending this approach, to improve the efficiency of a disease locus estimate by analyzing the combined case-parent trios and unrelated case-control data with incorporation of covariates.

Materials and Methods

Background

Multipoint LD mapping approaches have been applied to case-parent trio designs (Liang et al., 2001b) and case-control study designs (Liang & Chiu, 2005). The main goal of these approaches is to utilize multiple single nucleotide polymorphisms (SNPs) simultaneously to estimate the location of an unobserved disease locus, along with its confidence interval for further fine mapping. It is different from most other methods that focus on testing only. Before introducing the new methods for extending these approaches to the combined case-parent trio and case-control data, we review the previous work.

Consider a region R framed by M SNP markers located at inline image that have been genotyped in study subjects. Let t be an arbitrary location in R. In this particular model, we assume that a marker at location t has been genotyped, with inline image being the “target” or putative “high risk” allele.

Case-Parent Trios Designs

The transmission statistic for each SNP is computed as follows. Define inline image as the paternal preferential-transmission statistic inline image (Liang et al., 2001b), where

display math(1)

Similarly, the maternal preferential-transmission statistic can be defined by inline image. A potential problem arises in doubly heterozygous matings with a heterozygous offspring, as maternal and paternal transmissions cannot be determined. Nevertheless, such trios can still be used as the expectation of the preferential-transmission statistic is identical for both parental sides (see below), theoretically, either way of the determination of paternal or maternal transmissions makes the same contribution to the score function (equation (9)). Trios where any individuals have missing genotypes at position t are removed from consideration when constructing the transmission counts at position t. Under the assumption that there is only one disease locus within R, initial complete LD, random mating and stable inline imageover time, the expectation of the preferential-transmission statistic can be written as

display math(2)

where Φ is the event that the sampled offspring is affected, inline image is the recombination fraction between marker inline image and disease locus τ, inline image is the high-risk allele and inline image the normal allele at the disease locus τ, and N is the number of generations since the initiation of the disease variant. The expectation of transmission statistics is a function of the parameters of τ, C and N (equation (2)). The GEE approach is then applied to estimate the parameters based on the observed values of Y(t) and X(t) and equation (2). The genetic effect of τin the trios design, when associated with covariates, is characterized by inline image which is the excess in probability of a target allele being transmitted to the affected offspring versus not being transmitted, given the covariates. Thus, in the adjusted model, the genetic effect of τ, is a function of the covariate inline imagewhere inline image is a vector of p covariates (Chiu et al., 2010a).

Case-Control Designs

The transmission statistic in case-control designs is defined as follows. Define

display math(3)

Similarly, one can define the indicator variables, inline imageand inline image, for the controls accordingly (Liang & Chiu, 2005). For case/control data, we are, in fact, unable to determine these quantities, as we cannot tell which parent a heterozygous case or control received its alleles from. However, we can determine the sum W(t) = Y(t) + X(t) and it is this sum that is used below. Under the same assumptions that are specified previously,

display math

one has

display math(4)

where inline imageinline image is the excess probability of carrying the risk allele for a case compared to a control, given the covariates; and inline image (D, inline image), where D denotes cases and inline image denotes controls (Liang & Chiu, 2005). Similar to the trio design, the expectation of the transmission statistic is a function of parameters C, N and τ. Note that inline image is, on average, equal to zero if the marker t is in linkage equilibrium with the disease locus. Since the expectation of Y(t) and X(t) are identical, we define W(t) = Y(t) + X(t), thus, the expectation of W(t) equals to 2 times the expectation of Y(t) or X(t) (i.e., E(W(t)) = 2*E(Y(t)) = 2*E(X(t)). The GEE approach is then applied to estimate the parameters base_d on the observed values of W(t) and equation (4). The parameterC, which measures the genetic effect of τ, is a function of covariates Z, where Z = (Z1, Z2,…, Zp)T is a vector of p covariates in the adjusted model. Covariates for population stratification can be included in the case-control study for adjustment. For the trio design, it is not necessary to adjust for population stratification as long as it is independent from location of the disease locus. For a quantitative covariate, Z could be the difference between each case-control combination pair (Chiu et al., 2010a). The unrelated cases and controls can be either matched or unmatched. For the regular unmatched case-control design, all possible case-control pair combinations were considered when computing the transmission statistic (Liang & Chiu, 2005) (see also equation (13) in the appendix).

Incorporating covariates

In the following section, we discuss the method for incorporating covariates into the procedure in more detail. We applied parametric models (Chiu et al., 2008, 2010a) in the association mapping, where C was modeled through logistic regression:

display math(5)

where Z is a vector of covariates, inline image C(z) could be the genetic effect from the combined study, or inline image and inline image from the case-parent trios and case-controls studies, respectively (Chiu et al., 2010a). For a trios design, inline image represents the probability that an affected child receives the risk allele at τ from his or her heterozygous parent. For a case-control design, inline image is the probability that a case receives the risk allele at τ from his/her parent, while his/her “paired” control does not receive the risk allele. Note that the “paired” case-control data are similar to pseudo cases (when the target allele is transmitted) and pseudo controls (when the target allele is not transmitted) in the transmissions of trios.

Combined Designs

When both trios and case-control data are available, it is more efficient to use all the data simultaneously. To enhance the efficiency of estimates for the disease locus location in the association mapping in this circumstance, we developed two methods to obtain estimates, using the combined data. Method 1 is a weighted average of estimates from the two separate studies. Method 2 derives its estimate from the combined score functions of the two separate studies.

  • Method 1: weighted average of estimates from separate studies.

Let inline image denote the common parameters for the combined datasets, where τ represents the disease locus location and N is a nuisance parameter, indicating the number of generations since the initiation of the disease variant. inline image and inline image are the estimates of δ from the GEE approach in a trio (Tr) study and a case-control (CC) study, respectively. The estimator for δ is given by

display math(6)

where

display math
display math

and

display math
  • Method 2: estimates from the combined score functions from separate studies

We also propose another method to estimate δ by solving inline image, where inline image is the GEEs for δ. inline image and inline image are the GEEs in a trio (Tr) study and a case-control (CC) study, respectively.

The detailed theoretical derivations for these two methods are included in the Appendix. Note that the estimates from Methods 1 and 2 are not always identical. One can test if the disease locations from the trios (τTr) and case-control (τCC) studies are identical (namely, inline image) in both studies to determine whether combining the two studies is legitimate.

Simulation Study

We examined the performance of the proposed methods through simulation studies. A logistic regression model was used as the penetrance function to generate binary disease outcomes for the phenotype data.

display math(7)
display math
display math

where inline image for affected individuals and inline image for unaffected individuals, the parameter vector inline image is the natural logarithm of the ORs, g1 and g2 are the genotypes at the disease locus, H is the high-risk allele, h is the non-high-risk allele, and X is an environmental factor. Note that the genetic effects (ORs) assumed (9 and 5, respectively) are large by the standards of current genetic association studies, which allows us to get away with a relatively small sample size in this simulation study.

Age at onset was generated for each case (affected) i based on the genotypes at the disease locus for affected individuals. It was incorporated into the association mapping as a covariate (see details for the modeling in the appendix). The age at onset follows an extreme value distribution, that is, the logarithm of age at onset is a Weibull distribution (Li & Hsu, 2000),

display math

where λ = 0.02, β = 0.1, γ = 40, si is the number of H at τ, and υi follows the standard extreme value distribution, i = 1, … , n.

Two hundred (i.e., the sample size n) case-parent trios and 200 controls were generated for each sample; and 1000 replicates were generated in the simulation study. Ten biallelic markers spanning 0.9 cM with an equal interval of 0.1 cM between two adjacent markers were simulated. The LD structures in terms of D’ and r2 (Devlin & Risch, 1995) under different scenarios in the simulation study are presented in Table S1a–i. The minor allele frequencies at the marker loci were set at 0.1. The disease locus was located in the middle of the region at 0.45 cM, with a minor allele frequency 0.1. The test for equivalence of the disease locus from the trios and case-control studies – namely, inline image – was conducted prior to the combining. The estimates from the trio, case-control, and combined studies, as well as the relative efficiencies (the ratio of the inverse of the variance estimates for the disease locus estimates from two different designs) compared to the separate trios or case-control studies while incorporating covariates were computed. To study the impact on the relative efficiency (RE) in estimating the disease locus from various sample sizes, we simulated 50, 100, and 200 units of cases, controls, and trios, respectively, for the evaluation. The results for the simulation study are presented in Table 1. To study the impact of disease allele frequency and marker density on estimating the disease locus, we fixed the sample size at 200 (i.e., 200 cases, 200 controls, and 200 trios). The results of various disease allele frequencies (0.05, 0.1, and 0.2) and marker densities (markers located at every 0.1, 0.05 or 0.01 cM) are presented in Tables 2 and 3, respectively. Additionally, the performance of these methods in the presence of population stratification was investigated (Tables 4 & 5). A total of 200 trios, 200 unrelated cases, and 200 unrelated controls were generated, among which 160 cases and 40 controls had a target allele frequency of 0.6 for the marker located at 0.9 cM (stratum 1), while its allele frequency was 0.1 for the other 40 cases and 160 controls (stratum 2). Three approaches were applied to perform the association mapping in the presence of population stratification: (1) without taking population stratification into account, (2) incorporating a covariate to adjust for the population stratification, and (3) stratifying individuals according to their stratum. For the second approach, the population stratification was adjusted in either both trios and case-control studies (Table 4a) or in the case-control study only (Table 4b). The estimates for the disease locus (τ), its genetic effect (C) and 95% coverage probability, along with the covariate's coefficient (β) and their variance estimates, are shown in Tables 1–4. The empirical P-value derived from the Wald statistic for the regression coefficient (β) is also provided. In the absence of covariates, one can test if C = 0 based on the estimates for C and its standard error from the GEE approach. In the presence of covariates, C is a function of covariates, and one can apply the bootstrap method to estimate the 95% confidence interval (CI) for C (Chiu et al., 2010b).

Table 1. Impact of sample sizes on estimating the disease locus using trios, case-control, and combined simulated studies
Sample size50 trios, 50 unrelated controls100 trios, 100 unrelated controls200 trios, 200 unrelated controls
Study DesignParameterτCβτCβτCβ

Note

  1. τ is the disease locus location, C is the genetic effect of τ and is a function of the covariate, β is the regression coefficient between the covariate and C, and RE is relative efficiency.

TriosEstimate0.450.20−0.190.450.20−0.190.450.19−0.19
 Bias−0.000570.011 −0.00200.067 −0.000490.0026 
 Sample variance0.00260.00340.00210.00130.00160.000890.000690.000820.00044
 Mean variance0.0035 0.00190.0016 0.000940.00074 0.00047
 95% Coverage probability0.92  0.95  0.96  
 Empirical P-value for β  1.73E-05  2.33E-10  <1.00E-18
Case-controlEstimate0.450.21−0.190.450.20−0.190.450.20−0.19
 Bias−0.00140.0096 −0.0020.0032 −0.000320.0022 
 Sample variance0.00210.00290.000880.000850.00150.000400.000460.000710.00020
 Mean variance0.0021 0.000790.0010 0.000410.00051 0.00021
 95% Coverage probability0.91  0.95  0.97  
 Empirical P-value for β  1.40E-10  <1.00E-18  <1.00E-18
Combined (Method 1)Estimate0.450.21−0.190.450.20−0.190.450.20−0.19
 Bias−0.000900.017 −0.00180.0075 −0.000590.0067 
 Sample variance0.00170.00280.000840.000830.00150.000390.000430.000700.00019
 Mean variance0.0016 0.000780.00090 0.000400.00047 0.00021
 RE (vs. trios)1.5  1.6  1.6  
 RE (vs. case-control)1.2  1.0  1.1  
 95% Coverage probability0.90  0.94  0.96  
 Empirical P-value for β  6.43E-11  <1.00E-18  <1.00E-18
Combined (Method 2)Estimate0.450.21−0.190.450.20−0.190.450.20−0.19
 Bias−0.00110.016 −0.00190.0065 −0.000690.0050 
 Sample variance0.00190.00270.000940.000880.00130.000450.000470.000640.00022
 Mean variance0.0019 0.00100.0010 0.000510.00051 0.00026
 RE (vs. trios)1.4  1.5  1.5  
 RE (vs. case-control)1.1  1.0  1.0  
 95% Coverage probability0.91  0.95  0.96  
 Empirical P-value for β  4.62E-10  <1.00E-18  <1.00E-18
Table 2. Impact of disease allele frequencies on estimating the disease locus using trios, case-control, and combined simulated studies
Allele frequencyinline imageinline imageinline image
Study DesignParameterτCβτCβτCβ

Note

  1. τ is the disease locus location, C is the genetic effect of τ and is a function of the covariate, β is the regression coefficient between the covariate and C, and RE is relative efficiency.

TriosEstimate0.450.12−0.160.450.19−0.190.450.26−0.22
 Bias0.00056−0.0050 −0.000490.0026 −0.000100.027 
 Sample variance0.00120.000560.000310.000690.000820.000440.000440.00110.00083
 Mean variance0.0015 0.000360.00074 0.000470.00065 0.00088
 95% Coverage probability0.96  0.96  0.99  
 Empirical P-value for β  <1.00E-18  <1.00E-18  1.18E-14
Case-controlEstimate0.450.13−0.160.450.20−0.190.450.27−0.22
 Bias0.00034−0.0052 −0.000320.0022 −0.000640.026 
 Sample variance0.000730.000530.000180.000460.000710.000200.000310.00100.00029
 Mean variance0.00093 0.000200.00051 0.000210.00045 0.00031
 95% Coverage probability0.97  0.97  0.99  
 Empirical P-value for β  <1.00E-18  <1.00E-18  <1.00E-18
Combined (Method 1)Estimate0.450.13−0.160.450.20−0.190.450.27−0.22
 Bias0.00039−0.0027 −0.000590.0067 −0.000600.033 
 Sample variance0.000700.000500.000180.000430.000700.000190.000290.00100.00029
 Mean variance0.00086 0.000200.00047 0.000210.00041 0.00031
 RE (vs. trios)1.8  1.6  1.5  
 RE (vs. case-control)1.0  1.1  1.1  
 95% Coverage probability0.96  0.96  0.98  
 Empirical P-value for β  <1.00E-18  <1.00E-18  <1.00E-18
Combined (Method 2)Estimate0.450.13−0.160.450.20−0.190.450.27−0.22
 Bias0.00039−0.0035 −0.000690.0050 −0.000400.030 
 Sample variance0.000790.000460.000200.000470.000640.000220.000320.000890.00040
 Mean variance0.00098 0.000230.00051 0.000260.00044 0.00042
 RE (vs. trios)1.6  1.5  1.4  
 RE (vs. case-control)0.9  1.0  1.0  
 95% Coverage probability0.97  0.96  0.98  
 Empirical P-value for β  <1.00E-18  <1.00E-18  <1.00E-18
Table 3. Impact of marker density on estimating the disease locus using trios, case-control, and combined simulated studies
Distance between two    
adjacent markers0.1cM0.05cM0.01cM0.003cM
Study DesignEstimateτCβτCβτCβτCβ

Note

  1. τ is the disease locus location, C is the genetic effect of τ and is a function of the covariate, β is the regression coefficient between the covariate and C, and RE is relative efficiency.

TriosEstimation0.450.19−0.190.450.19−0.190.450.20−0.190.450.20−0.19
 Bias−0.000490.0026 0.000052−0.043 0.000340.0037 −0.000470.0061 
 Sample variance0.000690.000820.000440.000440.000720.000390.000250.000680.000340.000200.000900.00040
 Mean variance0.00074 0.000470.00052 0.000420.00030 0.000330.00020 0.00040
 95% Coverage probability0.96  0.98  0.95  0.94  
 Empirical P-value for β  <1.0E-18  <1.0E-18  <1.0E-18  <1.0E-18
Case-controlEstimation0.450.20−0.190.450.20−0.180.450.20−0.190.450.20−0.18
 Bias−0.000320.0022 0.000110.0024 0.0000490.0027 0.0000900.000088 
 Sample variance0.000460.000710.000200.000290.000620.000160.000120.000690.000120.000110.000710.00014
 Mean variance0.00051 0.000210.00032 0.000170.00013 0.000110.00012 0.00011
 95% Coverage probability0.97  0.97  0.95  0.94  
 Empirical P-value for β  <1.0E-18  <1.0E-18  <1.0E-18  <1.0E-18
Combined (Method 1)Estimation0.450.20−0.190.450.20−0.180.450.20−0.190.450.20−0.18
 Bias−0.000590.0067 0.000110.0064 −0.0000870.0065 −8.14E-60.0038 
 Sample variance0.000430.000700.000190.000280.000610.000160.000110.000690.000120.0000850.000700.00013
 Mean variance0.00047 0.000210.00031 0.000170.00013 0.000110.00012 0.00011
 RE (vs. trios)1.6  1.5  2.2  2.3  
 RE (vs. case-control)1.1  1.0  1.1  1.3  
 95% Coverage probability0.96  0.97  0.96  0.98  
 Empirical P-value for β  <1.0E-18  <1.0E-18  <1.0E-18  <1.0E-18
Combined (Method 2)Estimation0.450.20−0.190.450.20−0.180.450.20−0.190.450.20−0.18
 Bias−0.000690.0050 0.000220.0052 0.0000800.0060 −0.0000500.0044 
 Sample variance0.000470.000640.000220.000320.000570.000200.000160.000600.000160.000120.000650.00018
 Mean variance0.00051 0.000260.00035 0.000220.00020 0.000160.00018 0.00018
 RE (vs. trios)1.5  1.4  1.5  1.7  
 RE (vs. case-control)1.0  0.9  0.8  0.9  
 95% Coverage probability0.96  0.97  0.97  0.99  
 Empirical P-value for β  <1.0E-18  <1.0E-18  <1.0E-18  <1.0E-18
Table 4a. Impact of population stratification on estimating the disease locus using trios, case-control, and combined simulated studies
Adjustment for  Adjusted via stratifying by known
stratificationWithout adjustmentWith covariate adjustmentpopulation
       β1β2   
Study DesignEstimateτCβτC(age onset)(population)τCβ

Note

  1. τ is the disease locus location, C is the genetic effect of τ and is a function of the covariate, β is the regression coefficient between the covariate and C, and RE is relative efficiency.

TriosEstimation0.450.19−0.190.450.19−0.19−0.00770.450.19−0.19
 Bias0.000330.0025 0.000430.0023  0.000330.0025 
 Sample variance0.000710.000820.000450.000710.000820.000460.0170.000710.000820.00045
 Mean variance0.00080 0.000480.00080 0.000480.0160.00080 0.00048
 95% Coverage probability0.96  0.96   0.96  
 Empirical P-value for β 3.24E-11<1.00E-18 3.24E-11<1.00E-180.95 3.24E-11<1.00E-18
Case-controlEstimation0.480.20−0.170.490.20−0.170.0650.450.20−0.18
 Bias0.0350.0056 0.0380.0048  −0.000410.0070 
 Sample variance0.00120.000590.000190.00140.000590.000190.00240.000650.00110.00030
 Mean variance0.0012 0.000180.0013 0.000170.00240.00079 0.00030
 95% Coverage probability0.86  0.84   0.97  
 Empirical P-value for β  <1.00E-18  <1.00E-180.19  <1.00E-18
Combined (Method 1)Estimation0.460.20−0.170.460.21−0.170.0600.450.20−0.18
 Bias0.00690.012 0.00720.015  −0.000820.011 
 Sample variance0.000670.000620.000190.000700.000620.000190.00230.000500.000810.00024
 Mean variance0.00068 0.000170.00068 0.000170.00240.00055 0.00026
 RE (vs. trios)1.1  1.0   1.4  
 RE (vs. case-control)1.7  1.9   1.3  
 95% Coverage probability0.94  0.93   0.95  
 Empirical P-value for β  <1.00E-18  <1.00E-180.21  <1.00E-18
Combined (Method 2)Estimation0.460.20−0.180.460.20−0.180.0220.450.20−0.19
 Bias0.0110.0062 0.0110.0073  −0.000750.0069 
 Sample variance0.000620.000600.000220.000630.000600.000220.00450.000520.000690.00027
 Mean variance0.00073 0.000240.00072 0.000240.00430.00059 0.00030
 RE (vs. trios)1.1  1.1   1.4  
 RE (vs. case-control)1.9  2.2   1.2  
 95% Coverage probability0.94  0.95   0.96  
 Empirical P-value for β  <1.00E-18  <1.00E-180.74  <1.00E-18
Table 4b. Impact of population stratification on estimating the disease locus using trios, case-control, and combined simulated studies
Adjustment for population stratificationWith different adjustment
Study DesignEstimateτCβ1 (Age onset_Tr)β2 (Age onset_CC)β3 (Population)
TriosEstimation0.450.19−0.19  
 Bias0.000330.0025   
 Sample variance0.000710.000820.00045  
 Mean variance0.00080 0.00048  
 95% Coverage probability0.96    
 Empirical P-value for β  <1.00E-18  
Case-controlEstimation0.460.20 −0.180.021
 Bias0.00690.0031   
 Sample variance0.000540.00068 0.000200.00037
 Mean variance0.00064  0.000200.00037
 95% Coverage probability0.97    
 Empirical P-value for β   <1.00E-180.27
Combined (Method 1)Estimation0.460.19−0.19−0.160.067
 Bias0.00700.0020   
 Sample variance0.000740.000860.000420.000200.0030
 Mean variance0.00069 0.000480.000170.0030
 RE (vs. trios)1.0    
 RE (vs. case-control)1.9    
 95% Coverage probability0.93    
 Empirical P-value for β  <1.00E-18<1.00E-180.22
Combined (Method 2)Estimation0.460.19−0.18−0.170.066
 Bias0.014−0.0069   
 Sample variance0.000690.000770.000360.000200.0033
 Mean variance0.00052 0.0000470.000160.0033
 RE (vs. trios)1.0    
 RE (vs. case-control)2.0    
 95% Coverage probability0.88    
 Empirical P-value for β  <1.00E-18<1.00E-180.25

A Data Example

Angiotensin-converting enzyme (ACE) is a key enzyme of the well-described renin-angiotensin-aldosterone system and is pivotal for electrolyte balance and blood-pressure regulation. Studies have suggested strong evidence for an association between insertion/deletion polymorphisms and plasma ACE activity, with increased levels among individuals with the D allele. Therefore, ACE activity represents an upstream and internal facet of hypertension (Chung et al., 2010a). We demonstrated the proposed methods in a hypertension study in Taiwan (Chung et al., 2010a) that included 405 general pedigrees. We imputed the genotypes for the original 405 general pedigrees using the integrated genotype inference procedure implemented in MERLIN (Burdick et al., 2006). After the imputation, 71 trios in which one parental genotype is missing, 196 subjects without parents’ genotypes, and five subjects without measures of ACE activity were excluded. It turned out a total of 228 hypertension trios from 152 families and 400 controls with complete phenotype and genotype data were analyzed. Eight SNPs of the ACE gene on chromosome 17 with an MAF of at least 0.05 identified from a genome-wide association study (GWAS) study were included in these analyses (Chung et al., 2010a). The SNP information can be found in supplementary Table S2. Table 5 shows estimates from the trio, case-control (adopted cases from the trio study), and combined studies of hypertension while incorporating ACE activity levels as a covariate in the association mapping. The 95% CI for the genetic effect C was estimated by the bootstrap method (Chiu et al., 2010b). A total of 1000 replicates were obtained by resampling. The disease loci estimates were computed for each sample and ranked. The lower and upper limits of the 95% CI were the 2.5% and 97.5% percentiles of the 1000 replicates, respectively. Additionally, for the purposes of comparison, we conducted the single-locus association analysis using the UNPHASED software (Dudbridge, 2008) with and without incorporating ACE activity (Table 6).

Table 5. Estimates from the trios, case-control, and combined studies of hypertension while incorporating ACE activity
Study DesignParameterτCβ

Note

  1. τ is the disease locus location, C is the genetic effect of τ and is a function of ACE activity, β is the regression coefficient between ACE activity and C, and RE is relative efficiency.

  2. †S.E.: Standard error.

Combined (Method 1)Estimate58.91540.0280.14
 S.E.0.0000390.0740.00029
 RE (vs. trios)350.7  
 RE (vs. case-control)1.5  
 95% C.I.[58.9153, 58.9155][−0.088,0.17] 
 Z  484.89
 P-value  <1.0E-18
Combined (Method 2)Estimate58. 91600.0680.14
 SE0.0000730.0800.00062
 RE (vs. trios)100.0  
 RE (vs. case-control)0.4  
 95% C.I.[58. 9158, 58.9165][−0.11,0.19] 
 Z  231.98
 P-value  <1.0E-18
Table 6. Likelihood ratio test using the Dudbridge Method implemented in UNPHASED
  Without incorporation of ACE activityWith incorporation of ACE activity
  Likelihood ratio Likelihood ratio 
MarkerPositionChi-squareP-valueChi-squareP-value
rs445960958.90268−1.50E-111.005.68E-141.00
rs430558.911960.0180.890.0180.90
rs430958.913660.140.710.140.71
rs431158.914507.73E-121.000.00E + 001.00
rs432958.917190.140.710.140.71
rs434358.919760.0630.800.0630.80
rs436258.927490.270.610.270.61
rs426738558.937490.200.660.200.66

Results

Simulation Study

Estimates for the disease locus from case-control studies are thought to have higher efficiency than trio studies. When compared to the trio study, the REs for the hybrid study were 1.5, 1.6, and 1.6 for sample sizes of 50, 100, and 200, respectively, while they were 1.2, 1.0, and 1.1 for sample sizes of 50, 100, and 200, respectively, when compared to the case-control study using the combined method 1 (Table 1). Based on the combined method 2, the REs for the sample sizes of 50, 100, and 200 were, respectively, 1.4, 1.5, and 1.5 when compared to the trios study, and were, respectively, 1.1, 1.0, and 1.0 when compared to the case-control study (Table 1). The efficiencies (inverse of the variance estimate) of the disease locus and regression coefficient between the covariate Z (age at onset) and the genetic effect C are comparable among case-control and combined studies, and are lower in the trio study design. The two combining methods are similar, although the REs from Method 1 are often slightly higher than those from Method 2 in the simulation study. The impact of the disease allele frequencies on efficiency of the disease locus is limited (Table 2). However, the efficiencies of all estimates increase when the marker density increases. The RE of the disease locus estimate also increases when compared to the trios study (Table 3). In the presence of population stratification (Table 4a), the bias and variance estimates are higher in case-control studies, as expected. Therefore, this is the only scenario under consideration where the REs are higher when compared to the case-control study, rather than when compared to the trios study. In Table 4aa, population stratification was adjusted in both trios and case-control studies. However, since population stratification is not an issue in trios designs, due to the parsimonious principle, it was not adjusted for in the trios study, as shown in Table 4ab. Without this covariate, the estimates for the regression parameters are slightly more efficient than those in which the covariate for population stratification was included. The robustness of the Z-test for testing population stratification was studied under various sample sizes (Table S3). Based on the same sample size of trios (nTr), the false-positive ratio increases when the sample size of case-control data increases. The impact is stronger when the sample size of case-control data is relatively smaller, such as when nTr is 50. With a fixed sample size of case-control (nCC) studies, the impact from increasing trio numbers on the Z-test is rather limited. The false-positive ratio is lowest when the sample sizes of trios and total cases and controls are similar. For sufficient sizes of trios data (namely, nTr ≥ 100 in our simulation study), the performance of the Z-test is quite robust.

A Data Example

First, the test of inline image vs. inline image was conducted for the hypertension study using the Z statistic:

display math

The null hypothesis of identical positions of the hypertension locus from the two separate studies is not rejected (P = 0.41); it is therefore legitimate to combine the two hypertension studies. In this hypertension study (Table 4b), the REs of the combined when compared to the trio study were as high as 351 and 100.0 for the combined methods 1 and 2, respectively. As expected, case-control studies were relatively more efficient than the other study designs. However, the changes of efficiency from the combined study are 1.5- and 0.4-fold, respectively, when compared with the case-control study for the combined methods 1 and 2. Note that since the standard error (SE) of the disease locus estimate from the trios study is about 15-fold larger than that estimated from the case-control study (the SE of the disease locus estimate is 0.00073 and 0.000048 from the trios and case-control studies, respectively), adding in the trios data unfortunately resulted in a reduction of efficiency of the disease locus estimate (SE = 0.000072) from the case-control study when using method 2. The result from this example suggests that method 2 may be more vulnerable to sample variations than method 1.

The disease locus estimate is 58.9160 cM (≈ 58916.0 kb) near the polymorphism rs4311 (Fig. 1). The empirical statistics in Figure 1 summarized the observed marker data in the studied region. Presumably, we expected that the disease locus would be close to the peak of the empirical statistics. This finding is consistent with several studies, for example, Catarsi et al. (2005) found that the haplotype containing C for rs4295, A for rs4424958, C for rs4309, T for rs4311, deletion polymorphism for rs13447447, and G for rs4363 in ACE gene, was associated with serum ACE activity and cyclosporine sensitivity (P = 0.0139). Domingues-Montanari et al. (2011) found that rs4311 is associated with hemorrhage recurrence in amyloid angiopathy, a clinical outcome relevant to hypertension. We could not locate the disease locus for hypertension without incorporating ACE activity levels in the proposed approach, hence, its results are not presented. However, the genetic effect of the estimated disease locus with incorporation of ACE activity (0.028 for method 1 and 0.068 for method 2) was not significant with a 95% CI [−0.088, 0.17] or [–0.11, 0.19] for method 1 or 2, respectively (Table 4b). Similarly, none of the SNPs showed evidence of association in the Dudbridge approach (Dudbridge, 2008), regardless of whether or not ACE activity was incorporated (Table 5).

Figure 1.

Empirical statistics and the disease locus estimates from the case-control, trios, and combined studies. Green (dashed) lines are empirical statistics for trios data. Red (dashed) lines are empirical statistics for case-control data. Triangles are the disease locus estimates from separated studies. Diamonds are the disease locus estimates from the combined studies using method 1 (pink) or method 2 (blue).

Discussion

We proposed two methods to combine data from case-parent trios and unrelated cases and/or controls, comparing the performance of these two methods to approaches that use either trios or case-control data. It is more efficient to analyze the combined data when the disease locus estimates from separate studies are consistent. With the same sample size, case-control studies are more cost-effective than trio studies, since the efficiency of the disease locus estimate is higher in case-control studies. Both simulation and data example studies showed that efficiency in estimating the disease locus in a case-parent trio study can be improved dramatically by combining trios and unrelated subjects.

The simulation study suggested the two combined methods are quite compatible. Although method 1 sometimes performed slightly better than method 2 in the simulation, method 2 showed a much higher RE than method 1 in the data example. It is appropriate to combine case-parent trios and unrelated cases or controls only when their disease locus location is consistent, which can be tested from the estimates and their covariance matrices obtained in the separate and combined studies. Method 2 is still applicable even when the estimate from one single study is not available due to lack of power. As the same cases are used in both approaches, method 2 does account for corrections from repeating use of affected individuals using the GEE approach. These two combined methods are both flexible in that one can use either the same cases from the trio studies, or an independent set of cases in the combined analysis. As there is no need to specify an underlying genetic model of the disease in these proposed association mapping methods, they are simple and flexible to implement in practice.

These two proposed methods preserve the advantages of the previous methods, including flexible modeling, assessment of gene-gene and gene-covariate effects, and easy confounding effect adjustments (such as population stratification and disease heterogeneity), while simultaneously utilizing multiple markers, siblings, and unrelated subjects. The proposed approaches can accommodate both dyads (affected child with one parental genotype) and monads (affected child without parental genotype) (Liang et al., 2001b) and can be extended to combine general pedigrees with unrelated subjects. In addition, this approach provides an estimate of the disease locus location along with sample uncertainty to help investigators pinpoint a region for resequencing. It can also be applied to imputed sequence data as long as they are properly imputed (Liang et al., 2001a). Recently, it has been shown that ignoring LD in association analysis can result in the misinterpretation of GWAS findings and have an impact on subsequent genetic and function studies (Christoforou et al., 2012). This disease locus localization method based on LD patterns can ease this concern as well as the concern of multiple testing.

Due to the large numbers of control individuals with available genome-wide genotype data through various databases, several association tests with the feature of accounting for population stratification using hybrid case-control and family data have been proposed recently (Zhu et al., 2008; Zhang et al., 2009; Lasky-Su et al., 2010; Chung et al., 2010b). The approach proposed by Lasky-Su et al. (2010) is robust to population stratification, as the population-based test statistic uses a rank-based P-value that does not rely on large sample theory. In the approaches from Zhu et al. (2008) and Zhang et al. (2009), population stratification was corrected for by integrating principal component analysis and association tests, while the method by Chung et al. (2010b) accounts for population structures identified by Ward's clustering algorithm (Gao & Starmer, 2007). In our proposed approaches, population stratification extracted from principal component analysis (Zhu et al., 2008; Zhang et al., 2009) or a clustering algorithm (Chung et al., 2010b) can also be accounted for as covariates. By extending GEEs, the approach developed by Zhang et al. (2009) can also conduct univariate or multivariate association tests for the combined data of unrelated subjects and nuclear families. Similar to our approaches based on GEEs, it is relatively straightforward to extend their approach to include data of general pedigrees with arbitrary structures. In addition, most of the above-mentioned work focuses on testing rather than on estimation.

Estimates of the disease location or effects, along with the uncertainty of these estimates, are particularly useful in identifying a potential region for resequencing or in characterizing the roles of gene and environment in disease development. Similar to our approaches, the pseudo-likelihood approach proposed by Hsu et al. (2009) also provides unbiased estimates of the effects of the factors of interest, and can flexibly accommodate multiple genetic markers, environmental covariates, and gene-gene and gene-environment interactions for hybrid data. However, the disease is assumed to be rare in their approach, and it does not account for LD between multiple-linked markers; consequently, only unlinked markers are legitimate in their current algorithm. Through either imputation or the conditional distributions of offspring genotypes via the sufficient statistic of missing parental genotypes (Rabinowitz & Laird, 2000), most of these above-mentioned approaches, including ours, can also handle missing genotypes in parents. Although Zhang et al. (2009) and Hsu et al. (2009) also use GEE to estimate model parameters, their statistics are different from ours, where genotype instead of phenotype data are treated as random variables, thus easing concerns of ascertainment bias.

In the present study, both the simulation study and data example incorporated a quantitative covariate that was available in affected individuals only. In practice, when a quantitative covariate is available for cases and controls, one can incorporate the sum or difference of the quantitative covariate between all case-control combination pairs, where the genetic effect and regression coefficients are defined differently in each design (Chiu et al., 2009). In addition to increased efficiency in estimating the disease locus from the combined study, the disease locus was identified only when ACE activity levels were incorporated, suggesting that the power for estimating the disease locus increased with incorporation of ACE activity. In the demonstrated example, the estimated disease locus did not have a significant genetic effect, which might be due to the limited sample size for detecting a fairly small effect. In addition, as it was located between two markers without being genotyped, the power for detecting its genetic effect was also reduced (Liang et al., 2001a).

Assuming the existence of a disease locus in the studied region, the proposed approaches identify a location with the highest genetic effect on the disease, which is most likely to be the disease locus in the studied region theoretically; however, the genetic effect from the estimated disease locus may not be statistically significant. In situations where the genetic effect at the estimated disease locus is not significant, a replicate with a larger sample size or references from the literature will be helpful for further verification or interpretation of the finding.

One limitation of these approaches is that if there is more than one disease locus in the study region, the result may not be valid. It is helpful to check the empirical plot (i.e., the plot of empirical expected statistics vs. marker location) as shown in Figure 1 to determine how many “peaks” are present in the region of interest. For a study region with more than one susceptible locus, one could divide the region into a few subregions to examine the marginal genetic effect for each disease locus.

Conclusions

In summary, these approaches are practical, since the only assumption required about the underlying genetic mechanism of the disease is that there is only one disease locus in the region of interest. The disease locus location and its 95% confidence interval can be estimated. The magnitude and significance of the associations between the genetic effect at the disease locus and covariates can also be assessed in the LD mapping approaches. The proposed approaches allow investigators to combine case-parent trio data and case-control data, which have two advantages: (1) full use of all available data, and (2) improved efficiency of estimates in the presence of covariates. Gene-gene and gene-covariate interaction effects can also be included as covariates and assessed in this association mapping. The computing program for analysis using the hybrid data set is freely available on our website. These approaches can be further extended to combined studies from all kinds of family studies and population studies.

Acknowledgements

The authors appreciate the editor's and reviewers’ constructive suggestions, which greatly improved the quality of the manuscript. This project was supported by a grant from the National Science Council, Taiwan (NSC98-2118-M-400-002) and a grant from the National Health Research Institutes, Taiwan (PH-099-pp-04). The authors are grateful to Dr. Chia-Min Chung and Mr. Yu-Wei Li for their help with the example data and mathematical derivations, respectively. We thank Ms. Karen Klein (Research Support Core, WFUHS) and Mr. Mark Swofford (Scientific Editing Office, NHRI) for their dedicated editorial contributions to this manuscript. All of the authors have no conflict of interest to declare.

Web Resources

The computing software of the proposed methods is available online via http://www.nhri.org.tw/NHRI_WEB/nhriw1001Action.do?emp_cd=911103&status=Res

Appendix

Let the common parameters inline image from the case-parent trios and unrelated subjects. In the case-parent trios design, from equation (2), assuming inline image in (2) was estimated by inline image (see details in Chiu et al. (2010a)), let

display math(8)

The score function (assuming no missing genotypes) from the trios study is

display math(9)

where n is the number of trios, and M is the number of markers.

Obtaining inline image by solving inline image using the GEE approach,

and

display math(10)

where

display math(11)

and

display math(12)

Similarly, in the case-control study, based on equation (4),

display math(13)

where inline image, inline image, inline image, inline image and inline image.

Obtaining inline image by solving inline image, and

display math

where

display math(14)

and

display math(15)

Method 1

display math(16)

where

display math(17)
display math(18)
display math(19)

where

display math(20)

Method 2

display math(21)

Obtain inline image by solving inline image, where inline image is the generalized estimating equations for δ.

display math(22)

where

display math(23)
display math(24)

Proof of optimal weights for Method 1 and asymptotic normality for estimates from Method 1 and Method 2

The following derivation is based on the linear model theory.

Let inline image be the vector obtained by stacking inline image and inline image. Let inline image the matrix formed by stacking two q-dimensional identity matrices Iq (q is the dimension of δ) (q = 4 here). When the case-parent trios and unrelated controls are sampled from the same population, the 2q × 1 vector Υ follows asymptotically a 2q-variate normal distribution with mean inline image and variance-covariance matrix Σ.

Denote the component submatrices of Σ by Σ11, Σ12, Σ21, and Σ22, where Σ11 is the asymptotic variance matrix of inline image, and Σ22 is the asymptotic variance matrix of inline image, and inline image is the asymptotic covariance matrix between inline image and inline image. By the linear model theory (Seber, 1997), the optimal (most efficient) estimator for δ based on the linear combination of inline image and inline image is given by the WLS estimator inline image, which, by some matrix algebra, leads to equation (16) (Chen & Lin, 2008).

Ancillary