Nonreplication in Genetic Studies of Complex Diseases—Lessons Learned From Studies of Osteoporosis and Tentative Remedies


  • Hui Shen,

    1. The Key Laboratory of Biomedical Information Engineering of Ministry of Education and Institute of Molecular Genetics, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, China
    2. Osteoporosis Research Center, Creighton University Medical Center, Omaha, Nebraska, USA
    Search for more papers by this author
  • Yongjun Liu,

    1. Osteoporosis Research Center, Creighton University Medical Center, Omaha, Nebraska, USA
    Search for more papers by this author
  • Pengyuan Liu,

    1. Osteoporosis Research Center, Creighton University Medical Center, Omaha, Nebraska, USA
    Search for more papers by this author
  • Robert R Recker,

    1. Osteoporosis Research Center, Creighton University Medical Center, Omaha, Nebraska, USA
    Search for more papers by this author
  • Hong-Wen Deng PhD

    Corresponding author
    1. The Key Laboratory of Biomedical Information Engineering of Ministry of Education and Institute of Molecular Genetics, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, China
    2. Osteoporosis Research Center, Creighton University Medical Center, Omaha, Nebraska, USA
    3. Laboratory of Molecular and Statistical Genetics, College of Life Sciences, Hunan Normal University, ChangSha, Hunan, China
    • Osteoporosis Research Center, Creighton University Medical Center, 601 N. 30th Street, Suite 6787, Omaha, NE 68131, USA
    Search for more papers by this author

  • The authors have no conflict of interest.


Inconsistent results have accumulated in genetic studies of complex diseases/traits over the past decade. Using osteoporosis as an example, we address major potential factors for the nonreplication results and propose some potential remedies.

Over the past decade, numerous linkage and association studies have been performed to search for genes predisposing to complex human diseases. However, relatively little success has been achieved, and inconsistent results have accumulated. We argue that those nonreplication results are not unexpected, given the complicated nature of complex diseases and a number of confounding factors. In this article, based on our experience in genetic studies of osteoporosis, we discuss major potential factors for the inconsistent results and propose some potential remedies. We believe that one of the main reasons for this lack of reproducibility is overinterpretation of nominally significant results from studies with insufficient statistical power. We indicate that the power of a study is not only influenced by the sample size, but also by genetic heterogeneity, the extent and degree of linkage disequilibrium (LD) between the markers tested and the causal variants, and the allele frequency differences between them. We also discuss the effects of other confounding factors, including population stratification, phenotype difference, genotype and phenotype quality control, multiple testing, and genuine biological differences. In addition, we note that with low statistical power, even a “replicated” finding is still likely to be a false positive. We believe that with rigorous control of study design and interpretation of different outcomes, inconsistency will be largely reduced, and the chances of successfully revealing genetic components of complex diseases will be greatly improved.


OSTEOPOROSIS IS A common skeletal disease characterized by an excessively fragile skeleton and susceptibility to fracture.(1) The genetic contribution to osteoporosis susceptibility is well documented. Many studies have shown that genetic factors play important roles in determining population variation of various osteoporosis-related phenotypes, such as BMD,(1,2) bone structure and quality,(3,4) bone turnover,(5,6) bone loss (although not very clear),(7) and osteoporotic fracture per se.(1,8,9) Osteoporosis is one of several so-called complex diseases that are influenced by multiple genetic and environmental factors, as well as interactions among them. In other words, many genetic variants will contribute to the susceptibility to osteoporosis, a few with large effects and the majority with modest effects.(10,11)

The identification of genetic variants that are involved in osteoporosis etiology will lead to a better understanding of osteoporosis pathophysiology and to the development of new diagnostic tools and therapeutic agents. The majority of those variants may confer only a modest impact on individuals. However, because of their relatively high frequencies, they may have dramatic effect on the human population—the risk allele is carried by billions of people.(12) Such an effect has been shown in studies of other complex diseases. For instance, a common polymorphism in the peroxisome proliferative-activated receptor-γ (PPARγ) gene is associated with a modest (1.25-fold) increase in diabetes risk, but because the risk allele is so common in the human population, its modest effect translates into a large population attributable risk—influencing as many as 25% of type 2 diabetics in the general population.(12) On the other hand, knowledge of mutations responsible for rare Mendelian inherited bone diseases may also shed light on the mechanisms of bone metabolism, and in turn, may be useful for understanding or even developing new treatments for osteoporosis. An example is the LRP5 gene mutation identified in subjects with autosomal dominant high bone mass.(13) Nevertheless, because such mutations are normally very rare, their contributions to the risk of osteoporosis in the general population may not be large.(14)

Although the importance of genetic dissection of osteoporosis has long been accepted, as other complex human diseases,(15) identifying the genetic variants predisposing to this disease turns out to be quite challenging. Association and linkage approaches are currently the major tools for genetic mapping of complex diseases. However, findings from either linkage or association studies are often difficult to replicate in subsequent independent studies. In association studies of osteoporosis, hundreds of studies on polymorphisms of >30 candidate genes have been published; however, no convincing conclusions have been made on any of these genes.(16) In addition, >10 whole genome linkage scans for BMD, a strong risk factor for osteoporosis, were reported, with genomic regions showing linkage evidence identified on every human chromosome except the Y.(16) However, except for a few tentative limited replications (e.g., 1p36 and 11q14-23), the identified genomic regions are largely different across various studies.

The situation has led to concerns that results obtained from these studies are unreliable. Some researchers even doubt the value of these study strategies. Here, we address the potential reasons for inconsistent and nonreplicable results in gene mapping of osteoporosis and enumerate tentative strategies to deal with the problems arising from them.


Scarcity of power

The first and foremost factor among the issues surrounding association studies is lack of sufficient power to generate reliable and reproducible results. Apart from the sample sizes, the power of an association study can be affected by multiple other factors, including genetic heterogeneity, the extent and degree of linkage disequilibrium (LD) between the markers tested and the causal variants, and the allele frequency differences between them.(17,18) Any of these conditions may individually or interactively affect the power of a specific study.

Sample size

It is believed that most of the individual genetic determinants of complex diseases have small effects.(15)

So far, in the candidate gene studies reporting significant association with BMD, the genetic variants generally explain modest variation in BMD (<5%).(16) For example, the Sp1 polymorphism at the first intron in the type I collagen α 1 (COL1A1) gene has been related to various bone phenotypes. A number of studies have reached a consensus that the s allele or the ss genotype of this polymorphism may contribute to osteoporosis or osteoporotic fractures, but confers only moderate odds ratios.(16) Considering the modest genetic effects as such, sample sizes at the scale of thousands are generally needed to generate reliable and replicable association (Fig. 1). However, among the 226 published osteoporosis-related association studies as of the year 2002,(16) very few of them reached this level, and >50% of the studies have used sample sizes smaller than 200 and thus are probably not suitable for use in reaching a reliable and robust conclusion (Fig. 2).

Figure Fig. 1..

Statistical power for quantitative traits with random selected population sample. QTL is assumed to be responsible for 2% or 5% of phenotypic variation, respectively. The frequency of the causal variant (P) is 0.05 or 0.2. The power is estimated under a rather ideal situation, in which the tested marker is assumed as the QTL itself and the QTL is under additive inheritance. The significance level is set as α = 0.001.

Figure Fig. 2..

The sample size distribution of 226 osteoporosis-related association studies published as of 2002. The 226 osteoporosis-related association studies have been exhaustively summarized in a review article by Liu et al.(16)

The merits of large sample sizes have been exemplified in studies on other complex disease. The PPARγ locus was first reported as being associated with type II diabetes in a Japanese population, but four subsequent studies found no significant association. However, when data from all published studies were combined,(12)PPARγ again emerged with statistical significance, confirming the original study instead of refuting it. An independent study with >3000 samples has also confirmed the association.(12)

Combining studies in meta-analyses provides an alternative approach.(19,20) However, meta-analyses are prone to biases resulting from selective reporting or publication of positive results, differing ascertainment and diagnostic criteria, and population stratification of allele frequencies. The first source of bias can be partially alleviated by excluding first reports of positive associations and focusing the meta-analysis on subsequent, independent samples. The other issues are more difficult to tackle unless research protocols are strictly standardized across studies before data collection and quality control is adhered to throughout.(21)

LD patterns

Until recently, most candidate gene studies tested only one or a small number of variants within or nearby the candidate gene.

These polymorphisms were used simply because they were the only ones that were identified previously in the gene or because they were easier to genotype than other polymorphisms. A typical example is the estrogen receptor α (ER-α) gene, which spans ∼300 kb. However, most studies for ER-α merely focused on two common polymorphisms, PvuII and XbaI, both located at intron 1.

The premise of association studies is that the polymorphism tested will be in LD with the causative variant and that this will be reflected in a finding of association. However, the existence and the degree of LD vary greatly across of the human genome and in different populations. Emerging data have suggested that the human genome can be portrayed as a series of high LD regions separated by short discrete segments of low LD.(22–25) Those high LD regions, termed “haplotype blocks,” exhibit limited haplotype diversity so that a small fraction of variants (i.e., haplotype tagging single nucleotide polymorphisms [htSNPs]) can distinguish most of the haplotypes in a population.(23,25–27) In contrast, low LD regions can only be characterized adequately by typing highly dense markers.

Because of the variation in the LD pattern in the human genome, testing one or a few markers almost certainly cannot thoroughly examine the candidate gene, especially when the gene stretches over a large genomic region. Furthermore, because LD depends on population history, studies in different populations might have different findings for the same gene. Recent data suggested that the haplotype architecture across the human genome is more complex than previously suggested and the definition of haplotype block is not robust.(28,29) Complete resolution of common haplotypes and block structure are highly dependent on SNP density, with complete SNP discovery leading to different inferences of haplotype structure than incomplete SNP discovery.(28) Hence, without prior knowledge of the complete genetic variation within a candidate gene or of the structure of LD in the region, studies centered on only a few markers are likely to derive inconsistent results and, in this case, both positive and negative results should be interpreted with caution.

Allele frequency difference

It has been pointed out that, in association studies, not only the frequency of the causal variant but also the marker allele frequency would influence the likelihood of detecting association.(18) The interplay among allele frequency, LD strength, and effect size can substantially influence the statistical power of association studies.(18,30,31) For any allele frequency of the causal variants, the power of an association study is greatest when the allele frequencies of the marker and the causal variant match.(32) Differing allele frequency can have substantial negative impact on the power of association studies.(31) In Fig. 3, we show the impact of allele frequency difference on the power of association studies. This figure explicitly shows that, the greater the allele frequency difference between the marker and the causal variant, the larger the sample size required. Because allele frequency of the marker and the casual variant can vary considerably by ethnicity and country of origin, such variation may partially account for the observed lack of replication in association studies for BMD across different national and ethnic groups.(33,34) However, we argue that at least for those well-designed studies with sufficient power in populations of the same ethnicity, the results should to a large degree be reproducible, because allele frequencies of both markers and causal variants will not vary much within the same ethnicity or the same population.(34)

Figure Fig. 3..

The impact of the allele frequency difference on the power of association studies. The power estimation was based on a variance component QTL association design for sibships. It shows sample sizes required for achieving 80% power to detect a QTL accounting for 5% of phenotypic variation, with different causal variant frequency (P) and marker allele frequency (x-axis). We assume that the QTL is under additive inheritance and the LD between the QTL and marker is D′ = 0.80.

Population stratification

Population stratification could generate spurious association outcomes, not only false-positive but also false-negative results.(35) The actual impact of population stratification on association studies has been a matter of some debate.(36,37) Although the bias caused by population stratification might not be as significant as previously concerned,(38,39) modest amounts of population stratification were indeed detected in case-control and case-cohort association studies.(40) Unless thoroughly addressed, population stratification will remain a potential source of bias.

One way to avoid the effects of population stratification is using various family-based association approaches. Among these methods, the transmission disequilibrium test (TDT) is the most popular one.(41) The TDT tests simultaneously for both linkage and association and is robust to population stratification.(42) In comparison with conventional association approaches, this method also has some unfortunate side effects, such as requiring larger sample sizes and the need to acquire parental genotypes. However, advances in this methodology have been made,(43–45) and these side effects may be alleviated to a large extent.

Unfortunately, for the majority of contemporary population samples that consist mostly of unrelated subjects, family-based approaches such as the TDT are not applicable. Aware of this limitation, with some assumptions, other methods for detecting and controlling of population stratification have been proposed, including Genomic Control(46) and the structured association method.(47) Both methods estimate potential stratification based on genotypes of a set of unlinked markers. The Genomic Control method makes a quantitative estimate of λ, the degree of over dispersion of statistics generated by population stratification, and uses it to adjust the ordinary test statistic. The structured association method first infers population structure using a set of unlinked genetic markers and then performs tests in homogeneous subpopulations. These two methods are well suited for population-based association studies and definitely merit application and further development. Two recent empirical studies used these methods to assess stratification in genetic association studies.(40,48) The results show that a modest amount of stratification can exist even in well-designed studies, and depending on the level of stratification (i.e., λ) that needs to be ruled out, the number of required unlinked markers could vary from a few dozen to hundreds.(40)

Phenotype difference

Complex diseases like osteoporosis typically vary in severity of symptoms and age of onset, which results in difficulty in defining an appropriate phenotype and selecting the best population to study. For example, BMD has been well accepted as the most important measurable risk factor for osteoporosis and is widely used as a surrogate phenotype of osteoporotic fractures. However, other factors such as bone size and structure, bone loss, bone turnover, and propensity to fall(49) have all been shown to be important factors predicting osteoporotic fractures(50) and have been used as phenotypes in genetic studies.(16) Because different phenotypes may have their unique genetic determinants, using different phenotypes to test the same candidate gene or the same variant might generate different outcomes. Caution should be taken in claiming replication or consistent findings between studies using different phenotypes or samples. For instance, a recent study(51) detected significant association between the LRP5 gene with male but not female spine bone mass and declared that their finding is consistent with a previous linkage report.(52) However, the earlier linkage result(52) was detected for female bone mass. Because there may be sex-specific genetic contributions to BMD (see Complex etiology), thus, solely based on the earlier study,(52) we cannot draw certain conclusions on whether there is significant linkage at this locus in males. Therefore, claiming consistency between those two results is problematic.

So far, no satisfactory answer can be given to the question of what phenotype or combination of phenotypes is the best for gene mapping of osteoporosis. However, we would like to emphasize that subphenotypes, which are phenotypically correlated with the endpoint disease, may not be genetically correlated with the endpoint disease. For instance, although hip BMD has been commonly used as a surrogate phenotype for predicting hip fracture risk, our study showed that the shared genetic variance between hip osteoporotic fractures and hip BMD is modest.(9,50) In other words, genes that contribute to hip BMD variation may not necessarily be relevant to hip fracture. A very recent study showed a similar relationship between wrist fracture and forearm BMD.(53) In addition, other studies also showed that the genetic correlation between BMD and speed of sound (SOS)(54) or between BMD and fat mass or lean mass is weak.(55) To overcome this drawback, with appropriate and feasible approaches available now, employing osteoporotic fracture per se(50) as the phenotype, or combining several phenotypes by principal components or factor analysis,(56,57) may be worthwhile.

Quality control

Errors in either genotyping or phenotyping will result in significantly diminished power or even cause false-positive results.(58–60) Several studies have shown that genotyping errors can substantially affect the power of linkage and association studies.(61–63) For instance, in case-control association studies, every 1% increase in genotyping error rate may require that sample size of both cases and controls to increase 8% to maintain constant asymptotic type I and type II error rates.(61) To minimize the effects of genotyping errors, several methods that are robust to a variety of error models have been developed.(64,65)

Compared with genotyping error, phenotyping error has received less attention. For example, in a large scale genetic study for BMD variation, the phenotyping task may be performed in different research centers and could last for years, and thus BMD of the subjects is highly likely to be measured by different DXA machines and/or models. The discrepancies and measurement errors are very likely to be more severe in longitudinal studies, in which the data are collected over an extended period of time.(66) However, few such studies have mentioned how to deal with this potential systemic bias. One might argue that this bias is generally minor and could be ignored; however, considering the modest effect of an individual gene (e.g., ∼5%), we cannot tell whether the detected significant results are caused by genetic variation or just the phenotypic variance generated during the measurement.

Generally, a statistical test is based on certain assumptions about the nature of the data, and if these assumptions are violated, the significant results generated may not be valid. The assumptions that underpin statistical tests should be known and should be verified. Normally, there is some “cleaning” of data before formal analysis, such as identifying and excluding outliers.(67) Extreme caution must be used in data cleaning, for it is possible to alter data to get the answer that one desires. Cleaning should be conducted to get the most accurate data, not the most significant or desired results.(60) The best way to deal with these potential errors is performing a replication study, because the very same methodological artifacts, procedural errors, and investigator biases are unlikely to occur in the replication study. However, it is important to note, that, if the same experimenter—with the same biases and using the same methods and procedures—repeats a study that led to an erroneous conclusion, then the same erroneous conclusion could be reached again.(60) This could be the likeliest reason underlying detection and replication of quantitative trait loci (QTL) in severely under powered studies by the same study group (also see Are replicated findings always true?). In addition, robust approaches that can accommodate outliers(68–70) are valuable alternatives.

Multiple testing

Within association studies, multiple genetic markers can be tested for association with a specific phenotype or phenotypes. In addition, association studies often analyze the sample repeatedly for different subphenotypes. Multiple testing creates a substantial risk of false-positive results (type I error). Obviously, the classic nominal significance-threshold framework (p < 0.05) is not appropriate for such studies. A p value of 0.05 means that the result obtained would have occurred by chance 5% of the time. What a p value of 0.05 does not mean is that there is 95% confidence that the positive result is “real” (Also see Are replicated findings always true?). It should be noted that, among the dozens of association studies for BMD published in 2003, few of them have taken into account of the issue of multiple testing. Colhoun et al.(71) indicated that the most likely reason for nonreplication is false-positive results by chance in the initial study. The chance of a false-positive result is exacerbated by the fact that statistical tests for multiple loci are assessed in each study and only the positive results are reported.

Because a large number of SNP markers are available for examination, we suggest that extremely careful consideration of significance levels should be undertaken in the study design and interpretation of results, treating the p value quantitatively but not qualitatively. So far, a prudent method for correcting multiple testing might be one involving permutations of each individual dataset.(26) To overcome the publication bias, we suggest a mechanism for publishing negative results in a “brief report” section of specialty journals and storage and dissemination of all association data, perhaps in a widely accepted Web site, such as the recently developed Genetic Association Database (

Complex etiology

Besides the study design-related and data analysis-related factors described above, inconsistent results obtained from different populations might result from real biological differences. The complicated nature of etiology of complex diseases poses special challenges for gene discovery, including locus and allelic heterogeneity,(72,73) epistasis,(74) gene-environment interaction,(75,76) incomplete penetrance,(77) variable expressivity, and pleiotropy.(78) In different populations, all those factors could be different for the same disease/trait, which would result in nonreplicable positive results from different samples. Even when the causal variant is under investigation, the genetic variant may be more or less important in different populations, especially if the variant has low relative risk, variable penetrance, and variable allele frequency in different populations.(78)

Efforts to maximize sample homogeneity by recruiting subjects from the same ethnic group, and rigorously controlling possible environmental or confounding factors, will increase the chance of success. For instance, recent studies suggested that there may be sex-specific genetic contributions to BMD and that gene-sex interactions may exist.(79,80) In addition, the genes that regulate peak bone mass might differ from those that regulate bone loss.(5) Therefore, subgroup analyses, for example, in which samples were stratified based on age or sex,(81,82) may generate more homogenous subsamples and shed light on potential gene-gene and gene-environment interaction effects. However, such studies can exacerbate the multiple testing problem and require even larger sample sizes to ensure that individual subgroups retain adequate power to detect significant associations.


As in most complex diseases/traits, the results from genome-wide scans for BMD or other osteoporosis-related phenotypes have been disappointing: only a few significant LOD scores have been achieved, with little replication across different scans. Some confounding factors leading to inconsistency in association studies such as variation in LD patterns do not pose problems for linkage studies. However, other forms of confounding factors or bias in association studies undoubtedly can have qualitatively similar impacts in linkage mapping. Complex etiology, including genetic heterogeneity, epistasis, incomplete penetrance, variable expressivity, and pleiotropy,(83) are believed to contribute partially to the situation.

Assessment of the genome-wide significance for linkage studies is one of the major challenges facing whole genome linkage studies of complex diseases and has led to extensive debates. Lander and Kruglyak(84) proposed a significance threshold guideline for genome-wide linkage studies, assuming an infinitely dense map of markers. Several authors have suggested that the infinitely dense map assumption may be too conservative.(85,86) Concerned about this, Cheverud(87) proposed a method that calculates the effective number of independent tests performed in a genome scan and thus provide appropriate significance thresholds. On the other hand, Curtis(88) argued that the guideline proposed by Lander and Kruglyak(84) is not useful, because it does not account for multiple models and multiple phenotypes tried in analyses. In this case, an additional multiple analysis correction is necessary. Furthermore, because the multiple analyses usually represent multiple strategies to dissect the same phenotype or set of related phenotypes, they are not independent. Therefore, the traditional Bonferroni correction would be overconservative. To establish an appropriate additional correction factor, Camp and Farnham(89) proposed a method to quantify the dependence that exists between multiple analyses and estimate the number of “effectively independent” analyses that have been performed. Several other approaches have also been proposed for controlling the multiple testing problem. Churchill and Doerge(90) proposed a very general method for determining empirical thresholds by generating many different samples from the actual data by shuffling the phenotype values with respect to the marker genotypes. This method computed the thresholds directly from the actual data being analyzed and thus would be a robust and prudent method for adjustment of the effects of multiple comparisons. However, this method is computing intensive. Another method is the false discovery rate (FDR) proposed by Benjamini and Hochberg,(91) which is the expected proportion of the number of erroneous rejections to the total number of rejections. This method has been shown to be valuable in genetic mapping of complex disorders and may provide greater power than the method of Lander and Kruglyak,(84) especially for loci with modest effects.(92,93)

Because of multiple testing and polygenic effects, linkage peaks are rather unstable, and a high LOD score alone does not necessarily guarantee that the result is always “true,” particularly in small samples. A concrete empirical example that shows statistical fluctuation of LOD scores in small study samples is as follows. A sibling pair study with 374 sister pairs reported a maximum LOD score of 3.50 near D11S987 for linkage to femoral neck BMD variation.(52) However, a subsequent linkage study by the same group with 595 sister pairs reported a LOD score <2.2 at D11S987.(94) Moreover, the linkage signal at D11S987 completely disappeared in their most recent sample of 774 sister pairs.(95)

It is necessary to emphasize that, in linkage studies, the important feature of sample size is not merely the total number of subjects, but also is the relative pairs that are informative for linkage analyses. Generally, extended pedigrees can be more powerful because of increasingly large numbers of relative pairs informative for linkage analyses. Although sib pairs in extended pedigrees are not necessarily all independent, this may be of limited relevance to practical linkage analyses. Even under completely informative mating, the significance level is only mildly inflated when all possible sib pairs from sibships of varying sizes are treated independently in linkage analyses.(96,97) Under realistic situations when mating types are not completely informative for linkage analyses, it is actually true that all possible sib pairs can be used as if they were independent.(96,97)

As an illustration, we(16) compared the estimated statistical power between two empirical linkage studies for BMD—a sample of 635 subjects from 53 extended pedigrees(98) and a sample of 595 sister pairs.(94) The results(16) indicated that even under an ideal situation in which QTLs have relatively large effects (heritability = 0.25) and the marker tested was the QTL (θ = 0), the power of the study of Koller et al.(94) is <30%; whereas the 53 extended pedigrees(98) may offer much higher statistical power, although still insufficient to detect QTLs with small effects. We do not intend to blame the sib-pair method, but argue that sufficient sample sizes should be at hand before embarking on a sib-pair linkage analysis, because a sample size of thousands of independent sib pairs is generally required for any results to be of decent reliability (Table 1). With sufficient statistical power, sib-pairs studies could perform as well as extended pedigree based studies and may have advantages on age and cohort matching for an age-dependent trait.(99)

Table Table 1. Sample Sizes Required to Achieve 80% Power With a LOD Score of 3.0
original image

There are several other issues we would like to mention for linkage studies. First, ascertainment through a proband in the upper or lower tail of the distribution of a quantitative trait may cause greater phenotypic variability and increase QTL heterozygosity within the pedigree, which can lead to higher statistical power.(100) This sampling scheme will only change the QTL heterozygosity but not identity; thus, it will not introduce locus heterogeneity. Another important issue to keep in mind is that estimates of locus-specific effect sizes at genome-wide LOD score peaks tend to be grossly inflated and can even be virtually independent of the true effect size, even for studies on large samples when the true effect size is small. Overestimation of a locus effect could be one of the major factors responsible for frequent failures to replicate initial claims of linkage or association for complex diseases, even when the initial localization is, in fact, correct.(15,101) In addition, for linkage studies, some investigators have indicated that replication is problematic under conditions of locus heterogeneity.(102) Pooling or Bayesian approaches with raw data may offer advantages,(103,104) and data availability for such analyses would require collaborative efforts at a national or international level. Whereas pooling data from well-designed studies certainly should enhance power to unveil the etiological mysteries underlying complex disease, the effectiveness of such study would be affected by differences in marker map density and sample ascertainment criteria between individual studies.(105) Therefore, investigators not only should carefully control the design of their own individual studies, but also should keep this collaboration in mind.


We here proposed several criteria and remedies (Table 2), by no means exhaustive, for rigorous control of study design and interpretation of results in association and linkage studies for complex traits. Readers are also referred to several well-written articles(17,18,26,71,100,106–112) that extensively discuss problems in genetic dissection of complex diseases and provided useful guidelines for performing and interpreting such studies. Moreover, we would like to emphasize that a replication study is more preferred with an independent sample, rather than a sub- or extended sample from the existing dataset.(113) On the other hand, as more researchers realize that genetic dissection of complex diseases requires large samples and tend to recruit more families or combine data from others, the likelihood of overlapping between the initial and replicate samples is increased. If a completely independent replication is not feasible, replication studies with overlapping samples from the initial study are still valuable.

Table Table 2. Potential Confounding Factors Causing Nonreplication in Association and/or Linkage Studies
original image


The likelihood that the detected significant result is “real” could be estimated by the positive predictive value (PPV). Obviously, this likelihood is important and meaningful to researchers when a study has been done. To illustrate the meaning of PPV, we give a simplified example. Suppose that in an association study, 10% of all test markers are “truly” associated with the phenotype and the power of this study is 80% for detection of the odds ratio of interest at the 5% significance level. Consequently, if we test 1000 markers, only 100 true associations will exist, and we will detect 80 of them as being significant. Of the remaining 900 nonassociation markers, we will declare 5% (i.e., 45) of them as being significantly associated with the phenotype. Thus, the overall probability of a declared association being true, the PPV, is only 64% (i.e., 80/125), rather than generally (although wrongly) presumed to be 95% (i.e., 1–5%). Clearly, PPV depends on not only the significance level, but also the statistical power and the prior probability that the hypotheses being tested are true.

PPV can also be applied to linkage studies.(114) In the following, by estimating PPV for an initial genome-wide linkage scan and a subsequent replication study, we showed that, given the low power of a study, a high LOD score may not necessarily be “true,” and even a “replicated” linkage could be a false positive. Linkage for BMD has been detected on chromosome 1q with 595 independent white sister pairs(94) and subsequently replicated with an expanded sample of 938 sister pairs.(115) For the initial whole genome scan (WGS),(94) although while the reported LOD score is 3.6, the estimated PPV is only 0.038, even under an ideal situation (see Appendix for detailed calculation). Subsequently, taking into account the PPV of the first WGS and adjusting the significance level for the replication study, the PPV for the independent replication study(115) is estimated to be 0.33 (see Appendix for detailed calculation). In other words, there is at least a 67% chance of the replicated linkage finding on 1q to be a false positive.

This concrete empirical example suggests that, to yield reliable linkage results, both the initial and replication studies should have sufficient statistical power. Extreme caution should be undertaken in interpretation of the results obtained from studies with low statistical power, even if replicated findings have been made, especially when the replication study was performed by the same group as the initial report.


Various factors can result in the lack of reproducibility seen in genetic studies of osteoporosis, as well as in other complex diseases; however, this situation does not condemn the approach as being futile, but rather indicates that caution and rigor should be undertaken when designing a study and interpreting the results. Adoption of appropriate study design and procedures to limit bias and confoundings and recruitment of sufficiently large samples to ensure adequate power to detect modest genetic effects would reduce the number of nonreplicated results considerably. We are glad to see that progress and improvement have been evolving rapidly in the field, as several well-designed studies with exciting findings have emerged recently.(57,82,116–119)

Finally, we recommend that knowledge from multidisciplinary approaches, including genetic epidemiology, gene expression profiling, and proteomics be incorporated. These approaches are individually not perfect, but they can complement and confirm each other. A tentative example is the recent identification of the Alox15 gene for BMD.(120) With advancement and progress in different fields, we anticipate that more promising opportunities will be provided to disentangle the genetic determinants of complex diseases/traits.


The investigators were partially supported by grants from Health Future Foundation, the State of Nebraska (LB595 and LB692), and the U.S. Department of Energy. The study also benefited from 211 State Key Research Fund to Xi'an Jiaotong University. The study also benefited from grant support from CNSF, Huo Ying Dong Education Foundation, and Human Normal University.



For linkage studies, the PPV is the probability that an observed linkage is true. It can be estimated as:

equation image(1)

where, “+” and “−” denote the presence and absence of true linkage, respectively. L denotes the rejection of the null hypothesis (i.e., the conclusion that there is linkage). P(L|+) = 1 − β = power, and P(L|−) = α = significance level. The P(+) is the prior probability of linkage, which may not be estimated accurately and thus may be subjective.(114)

We estimated PPV for two linkage studies, an initial WGS that detected significant linkage for BMD on chromosome 1q with 595 independent white sister pairs(94) and a subsequent study replicated the initial finding with an expanded sample of 938 sister pairs.(115) Assuming an ideal situation in which all sib pairs are independent and the identity by descent (IBD) can be inferred unambiguously, under the trait overall heritability (h2) of 0.6, the power to detect a QTL (h2 = 0.25) with the reported LOD = 3.6 in their first WGS(94) is 9.2% (θ = 0.00). Similarly, the power to detect a QTL (h2 = 0.25) with the reported LOD = 2.5 in their confirmation study(115) is 8.8% (θ = 0.00). Therefore, even under optimistic situation, the power of both studies are <10%; in other words, the probability of detecting a QTL with LOD = 3.6 in 595 independent sister pairs is <0.1, and the same level of probability is for their confirmation study. Thus, the chance of not only detection but also replication of a QTL(94,115) is around 1 in 100, and correspondingly, the chance of repeatedly catching such rare events by the same group for three distinct osteoporosis phenotypes(94,95,115,121,122) would be ∼1 in 1,000,000. If the former could be explained by extremely “lucky,” then the latter is apparently beyond any rational comprehension. For their initial WGS,(94) we apply the prior probability of linkage as P(+) = 1/46, which has been used in the earlier calculation.(123) Then, the PPV for their first WGS(94) is only 0.038 ‘in Eq. 1, P(+) = 1/46, 1 − β = 0.09, α = 0.05’.

It is important to note that when estimating the PPV for a WGS, a genome-wide significance level instead of a point-wide significance level should be used.(114) For instance, if we use the LOD score of 3.6 as the cut-off for significant linkage in a WGS, the appropriate significance level (i.e., α) for computing PPV is 0.05 rather than 2.2 × 10−5. Otherwise, the PPV would be wrongly inflated dramatically,(123) such that even for suggestive linkage (LOD = 2.2) and with statistical power as low as 10%, there would be >75% chance that the detected linkage is true. However, it is unlikely to be the case, given that even for BMD alone, plenty of different genomic regions haven been reported with suggestive or significant linkage.(16)

Because their replication study was carried out based on the prior linkage detected in their initial WGS study, to compute the PPV for their replication study, we use the estimated PPV of their first study (PPV = 0.038) as the prior probability of linkage, P(+). In addition, the overall significance level for the replication study only needs to adjust for the tested linkage region, rather than the entire genome. The chromosome 1q critical interval claimed by Econs et al.(115) is delimited by D1S2777 and D1S2823, spreading ∼46 cM. Therefore, using the formula proposed by Lander and Kruglyak(84) and the reported LOD score on 1q (LOD = 2.5, which corresponds to a point-wise p value of 3.43 × 10−4), the overall significance level α for chromosome 1q is estimated as 0.0072. Thus, employing the parameters as P(+) = 0.038, 1 − β = 0.09, and α = 0.0072 in Eq. 1, the PPV for their independent replication study(115) is estimated to be 0.33.