Reliability and replicability of genetic association studies


There has been a proliferation of genetic association studies over the last decade, as genotyping costs have decreased and interest has grown in the potential application of this technology to understanding the biological basis of behaviour and psychiatric illness, including addictive behaviours [1]. Unfortunately, this has not been matched by corresponding insight into the genetic architecture of behavioural traits, or their underlying neurobiology. Very few genetic associations reported to date might be considered well established. The primary reason for this is that, all too frequently, initially promising findings, which typically generate considerable excitement, fail to replicate subsequently [2]—sometimes described as the ‘Winner's Curse’[3]. The situation has not necessarily improved with the advent of gene–environment interaction studies, which attempt to capture more phenotypic variation by assessing simultaneously the impact of genetic and environmental factors and, critically, their interplay. In principle, this approach might serve to increase statistical power, for example by allowing researchers to focus on genetic effects within a subgroup, defined on the basis of some environmental variable, in which these effects might be expected to be larger. In practice, however, this is almost impossible to achieve, given the difficulty in defining a priori what those subgroups might be [4]. Defining subgroups a posteriori, on the other hand, increases the risk of Type I error.

One issue with genetic association studies is the degree of scope for flexibility in the statistical analyses employed, and the large number of statistical tests that may, in principle, be performed. Given the thousands of candidate genes which are expressed in the brain (and are therefore plausible candidates for most behavioural phenotypes), multiple polymorphisms within these genes and multiple phenotypes with which these may be associated, the multiple testing burden becomes very large. Gene–environment interaction studies allow further multiplication of this burden through the investigation of multiple subgroups. A recent simulation study by Sullivan [5] illustrates the ease with which potentially publishable ‘findings’ may be obtained from random data. Sullivan simulated a candidate gene association data set and showed that in more than 90% of cases some potentially publishable (i.e. nominally significant) correlation between a genetic variant and a phenotype might be obtained, given multiple possible groupings of genotype groups (even assuming only one candidate gene is investigated), multiple polymorphisms tested within that gene, and so on. Furthermore, Sullivan showed that in the majority of cases these ‘findings’ can be replicated, given a weak definition of ‘replication’, again using random data.

There is therefore a balance to be struck between publishing all data which meet minimum criteria for study quality, in order to avoid publication bias, and only publishing data which meet the most stringent criteria for study quality, in order to avoid potentially flooding the literature with false positive results. Part of the difficulty lies in the pressures on researchers to present their data in the most favourable light, for example in order to achieve publication in journals with a high ‘impact factor’. There is now considerable empirical evidence that this occurs [6–8]. However, nominal statistical significance does not mean that the reported results are in fact true; there is increasing concern that a substantial proportion of published findings, particularly in ‘hot’ areas or where there is considerable scope for multiple testing, may in fact be false [9].

What can be done? First, candidate genes should be selected with great care, either on the basis of known neurobiology or localization through other genetic techniques such as linkage or genome-wide association. Secondly, polymorphisms should be selected which have known functional consequences; if multiple polymorphisms within a candidate gene are investigated, investigators should consider combining these within a haplotype analysis to reduce the number of statistical tests required. Thirdly, the total number of genes and polymorphisms investigated should be reported, in order that the base rate of statistical testing is transparent and appropriate correction can be made. Similarly, the total number of phenotypes investigated should be reported. Fourthly, very stringent criteria should be adopted before claiming replication. A recent review of gene–environment interaction studies concluded that unwarranted claims of replication are common, for example where there is statistical evidence of interaction, but where the nature of this interaction effect differs qualitatively from previous reports [10]. Fifthly, replication data from an independent sample should be reported within a single study wherever possible. Sixthly, sample sizes should be sufficient to detect very small effects; the growing consensus is that single gene effects on behavioural phenotypes are likely to account for less than 1% of phenotypic variance [1]. While some have argued that intermediate phenotypes may afford gains in statistical power, through greater genetic penetrance and correspondingly larger effects [11], this is an empirical question and should not necessarily be assumed [12]. Seventhly, and finally, non-significant results, or failures to replicate, should be reported as such, in particular if characteristics of the study (e.g. sample size) mean that these null results are compelling.

Biological samples suitable for genotyping exist in storage in multiple research groups world-wide, and as genotyping costs continue to decrease these will inevitably be put to use to test multiple putative gene–phenotype associations. It is highly unlikely that a report which describes a correlation between a single polymorphism in a single gene and a single phenotype represents the only such analysis performed by that group [13]. Other analyses may be published elsewhere, but this nevertheless represents occult multiple statistical testing. Many analyses will go unreported because they fail to reach nominal statistical significance, while those which reach nominal significance are justified a posteriori to provide a justification for looking at that particular polymorphism in that particular gene. If there is a high proportion of null associations to true associations tested (i.e. when exploratory analyses are common), low statistical power is a particular concern, as it will inevitably mean that the proportion of true findings to false findings is unfavourable [14]. Candidate gene association studies which appear to conform to this pattern should be treated with extreme caution.

Declaration of interest