Novel family-based approaches to genetic risk in thrombosis

Authors


Dr John Blangero, Department of Genetics, PO Box 760549, San Antonio, TX, USA 78245–0549.
Tel.: +1 210 258 9634; fax: +1 210 670 3317; e-mail: john@darwin.sfbr.org

Abstract

Summary.  The genetic basis of thrombosis is complex, involving multiple genes and environmental factors. The field of common complex disease genetics has progressed enormously over the past 10 years with the development of powerful new molecular and analytical strategies that enable localization and identification of the causative genetic variants. During the course of these advances, a major paradigmatic change has been taking place that focuses on the genetic analysis of measurable quantitative traits that are correlated with disease risk vs. the previous emphasis on the analysis of the much less informative dichotomous disease trait. Because of their closer proximity to direct gene action, disease-related quantitative phenotypes represent our best chance to identify the underlying quantitative trait loci (QTLs) that influence disease susceptibility. This approach works best when data can be collected on extended families. Unfortunately, family-based designs are still relatively rare in thrombosis/hemostasis studies. In this review, we detail the reasons why the field would benefit from a more vigorous pursuit of modern family-based genetic studies.

Introduction

Thrombosis is a common complex disease associated with substantial morbidity and mortality. The major determinants of thrombosis include both environmental and genetic factors [1]. Although it is likely that multiple genes with varying effects are involved in determining susceptibility to thrombosis, surprisingly little information is available on the relative importance of genetic factors in thrombosis risk in the general population. There has been a paucity of family-based studies, which is the required design to make valid inferences about the overall importance of genes in thrombosis risk. In the Genetic Analysis of Idiopathic Thrombosis (GAIT) family-based study, we previously estimated an additive genetic heritability of 60% for thrombosis [2], suggesting that genes would represent the largest single causal mechanism in the underlying pathophysiologic pathway of this disease. This is greater than or equal to that seen in other common complex diseases such as Type 2 diabetes [3], gallbladder disease [4], alcoholism [5], and obesity [6]. This finding alone suggests that whole-genome approaches to localizing and identifying the quantitative trait loci (QTLs) that underlie thrombosis susceptibility are highly justified.

Researchers in the hemostasis/thrombosis field generally have not actively pursued family studies except for the occasional serendipitous collection of unusual families with high densities of affected individuals. Therefore, most of our knowledge regarding the genetic factors involved in common thrombosis has been limited to association studies that employ case–control designs to look at known polymorphic variations in candidate genes [7–10]. Although such studies provide important indirect evidence for the presence of genetic effects, they have a number of weaknesses. These include their propensity for Type I errors due to hidden population stratification and the failure to deal with a formidable multiple-testing issue, the lack of direct evaluation of familial transmission, and their relative inability to discover novel genes in determining intrapopulation variation in thrombosis risk. In this review, we outline the case for increased use of family-based designs for the dissection of the genetic basis of thrombosis.

Trends in the genetics of common complex disease

The last 10 years has produced revolutionary advances in the field of human genetics, allowing a transition from the classical analysis of monogenic disorders to a new emphasis on the genetic basis of common complex disorders such as obesity, diabetes, osteoporosis, atherosclerosis, hypertension, and asthma. Attendant with this transition has been the realization that the genetic basis of such complex diseases may be better dissected via examination of continuous variation in those pathways more proximate to gene action [10]. In particular, there are been a growing realization that by looking at measurable quantitative variation in phenotypes closely related to disease risk, we may have more power to localize and identify disease-related susceptibility genes than by the examination of the even more complex (and statistically less informative) disease outcome itself [11]. A principal benefit of the examination of quantitative variation in physiologic phenotypes is that both normal and affected individuals contribute to the genetic information.

Powerful analytical strategies, such as the variance component method of quantitative trait linkage analysis [12,13], have been developed and extensively utilized to localize human disease-related QTLs in studies of extended families via comprehensive scanning of the human genome. The first localization of a human QTL using the genome scan approach was the discovery of a human obesity QTL at chromosome 2p23 by Comuzzie et al. [6], who obtained a LOD score of 4.95 at this genomic location when looking at variation in serum leptin levels. This was followed by at least four replications of linkage across various populations [14–17]. While the identification of this QTL is still not formally completed, strong association evidence points to the POMC gene [18]. Since 1997, there have been many more localizations of QTLs influencing human quantitative phenotypic variation and it is anticipated that the next few years will yield numerous conclusive identifications of the causative genes. While the field of thrombosis/hemostasis has been relatively slow to utilize these modern genetic approaches, there is considerable merit in the application of these methods to the many phenotypic measures that are known to be correlated with risk of thrombosis.

Quantitative risk factors for thrombosis

The physiologic cascade that underlies the normal formation of thrombin and the pathologic endpoint of thrombosis is complex, with many components involved in the coagulation and fibrinolytic pathways. However, many features of the hemostatic and fibrinolytic systems are known, facilitating the search for quantitative risk factors for thrombosis. Numerous hemostatic factors have been implicated as possible concomitants of both venous [19–22] and arterial thrombosis [23,24].

For example, there is epidemiologic evidence for a positive relationship between both von Willebrand factor (VWF) and factor (F) VIII levels and the risk of venous [21] and arterial thrombosis [25]. High plasma homocysteine levels have been associated with deep-vein [22] and arterial thrombosis [26]. The quantitative measure of activated protein C ratio (APCR) is correlated with risk of venous thrombosis even when a major genetic influence on APCR, the FV Leiden polymorphism, is taken into account [27]. Similarly, levels of FXII [28] and tissue plasminogen activator [23] have been correlated with arterial thrombosis. More recently, results from the LETS study have implicated high plasma levels of FIX [29] and FXI [30] as risk factors for thrombosis. Bivariate genetic analyses in the GAIT data demonstrated significant genetic correlations between thrombosis and APCR, homocysteine, tissue plasminogen activator, VWF, and clotting FVII, FVIII, FIX, FXI, and FXII, suggesting pleiotropic influences on these quantitative measures and risk of thrombosis [2].

Superiority of quantitative traits for genetic dissection

Quantitative traits have inherently more statistical information about genetic signals than that available from discrete traits such as disease status. The reason for this becomes apparent when we examine the typical assumptions about the nature of discrete variation. Classical quantitative genetic theory employs a threshold model to explain discrete variation. In this model, an unobservable (except through correlated continuous phenotypes) variable termed risk or liability is believed to determine the affected status, i.e. if liability is above a particular threshold, the individual is affected, whereas if liability is below the threshold, the individual is not affected. A similar model underlies the utilization of the logistic regression models so favored by epidemiologists.

Figure 1 shows a typical distribution of a quantitative trait or the distribution of risk/liability to thrombosis that is correlated with the trait. It shows the values for a hypothetical sibship of four in which two of the sibs are affected and two are unaffected when the trait is dichotomized so that the prevalence of the disease is 20%. It is clear from the distribution that sib 2, whose trait value lies near the threshold is much more similar to affected sibs 3 and 4 than to sib 1, whose trait value lies in the far left tail of the distribution. Discretization loses this critical quantitative information that tells us more precisely about the relative similarity/dissimilarity among the siblings.

Figure 1.

Distribution of a hypothetical quantitative trait and the effect of dichotomization on statistical genetic information.

Figure 2 quantifies this loss of information in the context of a gene-localization study. It shows that the relative efficiency of discrete traits is much lower than that of the focal quantitative trait. The relative efficiency allows us to compare the sample sizes needed to obtain the same power for these two situations. If the relative efficiency of a discrete vs. a quantitative trait was 0.25, then four times as many individuals would have to be sampled to map a gene using the discrete phenotype than if the quantitative phenotype were itself examined. Depending upon the prevalence of the disease, Fig. 2 shows that the discretization of quantitative traits is always statistically inefficient, requiring sample sizes anywhere from nearly 100 (for 1% prevalence) to 3 (for a prevalence of 30%).

Figure 2.

The effect of discretizing quantitative traits. Relative efficiency versus heritability.

Genetic signal-to-noise ratio: population variation

Multiple studies will be required to map genes influencing quantitative variation in thrombosis susceptibility. This stems from the nature of genetic variability across the human species. The likelihood that a QTL can be localized in any given study is a function of the sample size, the pedigree size/complexity and the relative importance of the QTL in the population from which the sample is taken. The genetic signal-to-noise ratio is purely a function of the QTL-specific heritability, which measures the relative phenotypic variance that is accounted for by a QTL. For a simple di-allelic QTL, the total variance that is attributable to the QTL can be written as equation image = 2pq(1-pq) α2, where pq is the allele frequency of the QTL polymorphic variant and α is half the displacement between the means of the two homozygous genotypes. By dividing the genetic variance attributable to a QTL by the total phenotypic variance, we obtain the QTL-specific heritability, which is a direct measure of the genetic signal-to-noise ratio for a particular QTL. Obviously, this parameter can vary across populations when there are differences in allele frequencies or when there are differences in the actual genotypic effects.

The importance of interpopulation variation in QTL allele frequencies is strikingly made in Fig. 3 for the case of the FV Leiden polymorphism and its effects on APCR. Assuming a displacement between homozygous means of approximately two standard deviations (data taken from the GAIT study [31]), the figure plots the expected QTL-specific heritability as a function of the frequency of the Leiden variant. The allele frequency range shown covers the observed distribution across human populations [32]. It is clear from the figure that the relative genetic signal-to-noise ratio varies dramatically across populations. This has important implications for gene-mapping studies of similar QTLs. A family study undertaken in Sweden, where the polymorphism accounts for nearly 35% of the total variance in APCR would easily be able to localize this QTL with a modest sample of families. However, a study undertaken in Spain, where the variant accounts for approximately 3% of the variation in the APCR phenotype, would be unlikely to be successful in identifying the effects of the FV gene by a linkage strategy. From this empirical example, we can see the potential value for multiple studies.

Figure 3.

Between-population differences in the genetic signal-to-noise ratio (QTL-specific heritability) as a function of allele frequency.

The primacy of family studies in complex disease genetics

Sampling designs in human genetics are used for the study of the genetic basis of quantitative phenotypes. Table 1 lists the major types and the kinds of inferences that can possibly be made under each design. The sampling of unrelated individuals dominates epidemiology and there has been a tremendous growth in the use of such designs in genetic studies as the number of known polymorphisms within candidate genes has increased. However, the genetic information to be gleaned from such a design is sparse. It is limited to inferences about the association between a polymorphism and the quantitative trait, and is therefore dependent upon either directly typing functional variants or upon the markers being in linkage disequilibrium with functional variants. Unfortunately, linkage disequilibrium is highly unpredictable. Failure to find an association has no genetic interpretation other than the trivial one that the variant tested is unlikely to have an effect on the trait studied. Nothing can be stated about the importance or lack of importance of the candidate gene when association is negative since other unstudied variants in the gene may still be responsible for large amounts of phenotypic variance. Association studies based on unrelated individuals have numerous other problems, such as the high potential for Type I error due to hidden population stratification and the widespread practice of data dredging and failure to account for multiple testing. Additionally, when only unrelated subjects are available, no general inferences about the overall importance of genes (heritability) on the phenotype can be made. Similarly, there is no linkage information available, so that the mapping of novel QTLs is a remote possiblility (even with very high-density single nucleotide polymorphism [SNP] typing). The principal benefit of association studies of unrelated subjects occurs when functional variants have been previously identified. In this situation, epidemiologic population-based studies are very useful for obtaining unbiased estimates of the sizes of the genetic effects.

Table 1.  Major study designs in human genetics and the types of genetic inferences that can be made
DesignPossible inferences
HeritabilityLinkageAssociation
Unrelated individuals+
Triads (parents, one offspring)++
Sibling pairs+++
Nuclear families+++
Extended pedigrees+++

Studies of triads comprised of both parents and a single offspring are another common design. Typically, while all individuals are genotyped, only the offspring is phenotyped. This type of study again provides no information on the heritability of the trait but can be used to test the association of a marker with the quantitative trait, but only in the presence of linkage. Thus, this approach represents a safeguard against spurious inferences of association due to hidden population stratification. Unfortunately, there is no power to detect linkage in the absence of association, and thus this approach is essentially limited to fine-mapping studies when a bona fide linkage has already been documented.

The other family-based designs listed can be used to make inferences in all of the major areas of genetics. Data on sib pairs, nuclear families, and extended pedigrees can be used to estimate heritabilities of quantitative traits, localize QTLs via linkage information, and fine-map and identify QTLs using association information. Thus, family-based designs based on larger configurations of relatives possess the ability to look at all types of genetic information relating to quantitative trait variation. At the initial stages of genetic dissection of complex quantitative disease-related phenotypes, family-based designs are essential.

Major family-based studies that have examined the heritabilities of quantitative traits for thrombosis/hemostasis include studies of twins or sib pairs [33,34], nuclear families [35–37] and extended kindreds [2,38–41]. Most linkage studies to date come from the GAIT study of extended pedigrees [42–46]. Additionally, the San Antonio Family Heart study of extended Mexican-American kindreds has obtained one relevant QTL localization [47]. Other linkage studies restricted to candidate gene regions include those on a large Vermont pedigree [48,49] and an early linkage study of quantitative variation in histidine-rich glycoprotein on a large Dutch kindred [50]. Family-based association inferences related to thrombosis have been performed in nuclear families [35,37] and extended pedigrees [42–44,51–53].

The importance of pedigree size for mapping QTLs

Information regarding the location of QTLs comes from correlations between related individuals for the quantitative phenotype and between DNA variation at specific locations across the genome. The pattern of sharing of alleles that are identical by descent is examined between all individuals at all genomic locations and can be tested to see if it predicts the observed patterns of phenotypic similarities. This is the foundation of the variance-component method of quantitative trait linkage analysis [13,54].

The power to localize QTLs using the variance-component method has been studied extensively and the necessary analytical theory has been developed [55]. It is now known that large extended pedigrees provide substantially more power per study subject than designs that focus on smaller familial units. Table 2 shows several study designs currently being used in studies of thrombosis/hemostasis-related traits and their relative efficiencies for QTL mapping. The most powerful of the studies listed involves an isolated population of eastern Nepal, the Jirels, in which 2000 people have been genome-scanned in a single extended pedigree [56], and who are currently being assessed for a number of hemostasis-related traits. As a benchmark, this study is assigned a relative efficiency of 1.0. Following closely in relative efficiency after this very large pedigree is the study based around a single large protein C-deficient kindred with 331 subjects from Vermont that has been used to study genetic interactions by the University of Vermont research group [48,49,53]. As the table shows, the Vermont pedigree has a relative efficiency that is 91% that of the massive Jirel pedigree. The next most powerful study is the of the San Antonio Family Heart Study [57], which has a relative efficiency of 0.59. This study is comprised of large Mexican-American families with an average of about 30 individuals per family. Similarly, the GAIT study [2] of extended Spanish families with an average of 19 studied individuals per pedigree exhibits a relative efficiency that is about one-third that of the Jirel study (and thus would require about three times as many individuals to obtain the same power to localize genes). The world-famous Framingham study (the calculations for this study are based on published family descriptions [37,58]), includes primarily nuclear families with some larger extended configurations, which have an average number of five phenotyped individuals. Given this structure, the relative efficiency per person is only about one-quarter that of the large isolated population. Also shown are the results for nuclear families with four and three siblings, and the most common design in human genetic studies, the sibling pair. It is clear that nuclear families are much less powerful than the more extended kindreds. Nuclear families with three siblings would require 10 times as many individuals sampled as the larger pedigree studies. Similarly, the results show that sib-pair studies represent very poor designs to map QTL, requiring over 20 times as many individuals to be sampled.

Table 2.  Relative per-subject power to localize QTLs for different study designs
Population/studyRelative
efficiency
Average
pedigree size
Pedigree
type
Jirel (Nepal)1.002000Extended, isolate
Vermont0.91331Extended
SAFHS0.5931Extended
GAIT0.3519Extended
Framingham0.245Extended, nuclear
Nuclear (4 sibs)0.176Nuclear
Nuclear (3 sibs)0.115Nuclear
Sib pair0.042Relative pair

These results are all obtained under the assumption of random ascertainment. This assumption is reasonable because of the emphasis that we are placing on quantitative phenotypes. If the focus is on a specific phenotype, non-random sampling (oversampling of families with extreme phenotypes) can be used to improve the power of smaller familial configuarations. However, these studies then are essentially limited to the study of a single quantitative phenotype. Given the cost of genome scanning, it may be better to design studies that can serve as resources for mapping genes for many quantitative phenotypes. Once a powerful set of families has been typed for a genome scan, it is very cost-efficient to measure additional relevant quantitative phenotypes on stored plasma/serum samples and perform the quantitative trait linkage analyzes to localize genes influencing these new phenotypes.

The story so far: localized QTLs for thrombosis/hemostasis-related traits

Much of our recent success in the initial localization of QTLs influencing quantitative disease risk factors in extended human pedigrees is attributable to the development of the variance-component method of linkage analysis [12,13]. The variance-component approach can accommodate pedigrees of any size and complexity, whereas penetrance-based models rapidly become computationally intractable as pedigree size/complexity increases. Since it is now clear that large complex pedigrees have substantially more power per sampled individual than do smaller families [11,54], the advantage of using variance component methods for localizing QTLs is considerable.

The localization of QTLs influencing thrombosis/hemostasis-related quantitative traits is still in its infancy. However, the progress of the past few years suggests that this approach will be very fruitful. Table 3 provides the current list of localized QTLs relevant to thrombosis. They are listed in order of their statistical evidence, as measured by the LOD score. A LOD score of at least 3 is required to obtain an acceptable QTL localization. The table shows that QTLs affecting levels of factor XII:C, histidine-rich glycoprotein, free protein S, P-selectin, VWF, and FVIII:C plus APCR have been localized. Four of these QTLs have been identified. Two of the QTLs, a novel QTL primarily influencing APCR on chromosome 18 [44] and a QTL due to the F12 structural gene [43], have been shown to have direct pleiotropic effects on the risk of thrombosis. Such tests of the pleiotropic effect of a QTL on disease risk require a sophisticated bivariate linkage analysis of both the quantitative trait and disease affection status [5]. However, a major benefit of such an approach is that the power to detect the direct effect of a QTL on disease risk when jointly analyzed with a quantitative trait that is also influenced by the QTL is much greater than when the disease affection status is analyzed by itself.

Table 3.  Previously localized thrombosis/hemostasis-related QTLs
Trait(s)LODLocationGeneCitation
  1. FXII:C, factor XII coagulant activity; FII:C, factor II coagulant activity; FVIII:C, factor VIII coagulant activity; APCR, activated protein C ratio; VWF, von Willebrand factor; HRG, histidine-rich glycoprotein; ABO, ABO blood group.

FXII:C/thrombosis11.735q35F1243
FII:C/thrombosis4.7011p11F242
APCR/FVIII:C/thrombosis4.5018p11?44
HRG4.173q27HRG50
Free protein S4.071q32?45
P-selectin3.8115q26?47
FXII:C3.5310p13?43
VWF3.469q34ABO46
APCR3.051q24F544

Considering that very few family-based linkage studies have as yet been performed on hemostasis-related traits, the available results are striking. While several of the QTLs appear to be reflecting obvious variations in well-known candidate genes, others are novel. For example, there are no obvious positional candidate genes in the region on chromosome 18 where a major QTL influencing APCR has been found. Thus, there is great potential to identify a novel genetic contributor and perhaps to learn additional information about the underlying biologic pathway.

Beyond localization: statistical identification of functional polymorphisms

What happens after localization of a QTL? The initial localization using quantitative trait linkage analysis tends to encompass a large genomic region, typically 10–20 cM. To further reduce this interval, the usual trajectory of a study is to employ some type of linkage disequilibrium-based method for fine mapping. Variance-component methods of family-based samples can be used to simultaneously exploit both linkage and linkage-disequilibrium information [42,52,59–61]. This is a particularly powerful approach and does not suffer from the usual problem associated with studies of unrelated individuals of increased Type I error due to hidden population stratification. However, even after fine mapping, we still have the difficult problem of identifying the functional variants within a large genomic region encompassing several megabases. There has been little statistical research on this problem and even fewer actual successes. Part of the difficulty comes from the widespread practice of testing for association/disequilibrium in samples other than those in which the original linkage was found. Since there may be considerable luck involved in obtaining a strong linkage signal due to the functional polymorphisms segregating only in certain pedigrees, the original families should continue to be exploited for fine mapping. Additionally, the families in which the linkage was discovered represent a valuable dataset for testing whether any particular SNP or sets of SNPs can completely account for the linkage when employed in a combined linkage linkage disequilibrium analysis. We have used this approach in the analysis of the prothrombin gene G20210A mutation and its effect on FII plasma activity in the GAIT sample [42].

With the dramatic improvements in resequencing technologies, it is likely that in the future most studies will routinely resequence a large number of individuals from the family-based linkage sample to identify all polymorphisms within a positional candidate region. If we have prior evidence for particular candidate genes in a linkage region, we may pursue these candidates first in the sequencing/polymorphism discovery effort. After the polymorphisms are enumerated, they can be typed in the original linkage dataset using high-throughput SNP typing methods. The critical problem then is how to prioritize polymorphisms for molecular functional characterization. Because of the relative unpredictability of linkage disequilibrium due to its large variance [62,63], it is clear that standard association methods (which exploit linkage disequilibrium) are not optimal for choosing functional polymorphic variants. What is needed is a method that will effectively eliminate the correlation between a marker and a QTL that is due to linkage disequilibrium. Unfortunately, there has been very little work on the subject of statistically finding the main functional effects in high-dimensional SNP data [64,65].

We have developed a promising statistical approach called Bayesian quantitative trait nucleotide analysis that we have used to estimate the posterior probability that any SNP represents a functional polymorphism and have applied this to completely dissect the allelic architecture of the FVII gene in relation to FVII:C levels (J.M. Soria et al., unpublished data). Although extremely computer-intensive, it is likely that this critical area of the identification and the dissection of the allelic architecture of QTLs will become a major focus in complex disease genetics.

Acknowledgements

The authors warmly thank Drs E.G. Bovill and S.J. Hasstedt for providing access to the pedigree structure from their study on the large Vermont kindred. Grant support for this work was provided by NIH grants MH59490, MH61622, HL45522, and HL70751.

Ancillary