Missing heritability: the dark matter of the genome
Many rare alleles
Looking in the wrong place
Looking but not seeing
Needles in a haystack
Replicating and verifying associations
The genetic architecture of quantitative traits in plants
Association mapping is rapidly becoming the main method for dissecting the genetic architecture of complex traits in plants. Currently most association mapping studies in plants are preformed using sets of genes selected to be putative candidates for the trait of interest, but rapid developments in genomics will allow for genome-wide mapping in virtually any plant species in the near future. As the costs for genotyping are decreasing, the focus has shifted towards phenotyping. In plants, clonal replication and/or inbred lines allows for replicated phenotyping under many different environmental conditions. Reduced sequencing costs will increase the number of studies that use RNA sequencing data to perform expression quantitative trait locus (eQTL) mapping, which will increase our knowledge of how gene expression variation contributes to phenotypic variation. Current population sizes used in association mapping studies are modest in size and need to be greatly increased if mutations explaining less than a few per cent of the phenotypic variation are to be detected. Association mapping has started to yield insights into the genetic architecture of complex traits in plants, and future studies with greater genome coverage will help to elucidate how plants have managed to adapt to a wide variety of environmental conditions.
A fundamental goal of evolutionary biology is to understand the genetic basis of adaptation in natural populations (Orr & Coyne, 1992) and it is therefore rather surprising that we still know relatively little about the genetic architecture of many adaptive traits (Mackay et al., 2009). The primary reason for this is that phenotypic variation in most adaptive traits in natural populations is caused by the action of many genes, each having only a small to moderate effect on the phenotype. Although Fisher (1918) showed that the underlying principles of Mendelian inheritance also explain segregating variation in quantitative traits, quantitative genetics has traditionally been used to partition phenotypic variation within and among individuals with known degrees of relatedness. However, as a result of to the small contribution of individual genes, the methodology and technology required to dissect the genetic architecture of quantitative traits down to individual causal loci have long eluded us.
A route to deeper understanding of the genetic basis of quantitative traits opened with the introduction of quantitative trait locus (QTL) mapping in the 1980s (Lander & Botstein, 1989). In QTL mapping, designed crosses are used to dissect quantitative variation that distinguishes the individuals making up the parental generation of the cross. Progress in using QTL mapping was initially limited to a few model organisms because of the lack of available genetic markers (Mackay et al., 2009). However, with the advent of technologies that allowed for rapid and cost-effective genotyping in almost any organism, QTL mapping has proved to be extremely useful for identifying many genomic regions that influence complex traits in a large number of species (Mauricio, 2001; Doerge, 2002; Mackay et al., 2009). But QTL mapping has a number of drawbacks; for instance, genetic variation in the mapping population is usually quite restricted with only two parents used to initiate the QTL mapping population. Moreover, because a QTL mapping population usually consists of early-generation crosses (usually F1 or F2), the number of recombination events per chromosome is small, which in turn limits the resolution of the genetic map. Finally, in many organisms the generation of mapping populations through controlled crosses is either time-consuming or not even possible, further restricting the utility of QTL mapping. Furthermore, when identified in a mapping population consisting of a few hundred individuals, a typical QTL region can span anywhere from between a few to tens of centiMorgans, corresponding to genomic regions encompassing several megabases and which typically contain hundreds or even thousands of genes (see Rae et al., 2009 for such an example). Even when a QTL of large effect is identified, tracking down the causal gene is a tedious and time-consuming task. In addition, a single large-effect QTL often breaks down into multiple, closely linked QTLs of smaller, and sometimes opposite, effects on the phenotype (Doerge, 2002; Mackay et al., 2009).
The wealth of molecular markers developed over the last decade has opened up the possibility to directly study statistical associations (linkage disequilibrium, LD) between genetic markers and adaptive traits in natural populations, so-called association genetics (Nordborg & Weigel, 2008). Using natural populations eliminates many of the drawbacks of traditional, pedigree-based, QTL mapping such as limited sample sizes, low variation and a lack of recombination within pedigrees. An association mapping study utilizes variation segregating in a diverse germplasm and therefore does not suffer from the lack of variation that characterizes many QTL mapping populations. In addition, because recombination events that have occurred throughout the entire evolutionary history of the mapping population are used to delineate linkage blocks that associate with the traits of interest, genomic regions identified using association mapping are usually substantially smaller that those identified in a traditional QTL mapping population, allowing fine-scale mapping (Nordborg & Tavaré, 2002; Nordborg & Weigel, 2008). However, because LD usually extends over much shorter distances in association mapping populations, a substantially greater number of genetic markers are needed to ensure adequate coverage to detect linkage between markers and a causal locus. As the cost of genotyping has dropped dramatically, association mapping has rapidly come into focus as a very promising approach for the genetic dissection of complex traits in plants.
A major driving force behind the rise of association genetics as a method for complex trait dissection is the rapid development in DNA sequencing and genotyping technologies that has occurred over the last decade. While a few hundred markers were usually sufficient for traditional QTL mapping experiments, genome-wide association studies (GWAS) typically require hundreds of thousands, or even millions, of genetic markers to achieve sufficient coverage (Nordborg & Weigel, 2008). The development of next-generation sequencing technologies provides unprecedented genotyping capabilities, even in nonmodel organisms (Gilad et al., 2009; Simon et al., 2009; Varshney et al., 2009). The high throughput of next-generation sequencing machines stems from a change in methodology compared with the traditional Sanger-sequencing method, which produced read lengths of up to 1 kb from individual DNA clones. Current state-of-the-art second-generation (or ‘next’ generation) sequencing technologies produce read lengths ranging from 30 to 400 bp (although lengths are rapidly increasing) from single DNA molecules arrayed and subsequently PCR-amplified on beads, in wells or immobilized on solid surfaces (Varshney et al., 2009). Current second-generation sequencing technologies routinely assay anywhere from hundreds of thousands to tens of millions of DNA molecules in parallel, with throughput rapidly increasing and with constant improvements to reduce error rates and increase read lengths (Varshney et al., 2009). The next major breakthrough in sequencing technology (third-generation sequencing) has been the development of single molecule sequencing (SMS) that will reduce the need for extensive template preparation and library construction required by all second-generation technologies. Complete or draft genome sequences are now available for a large number of plant species, such as Arabidopsis, rice, sorghum, Populus, grape, soybean and Medicago (Arabidopsis Genome Initiative, 2000; International Rice Genome Sequencing Project, 2005; Tuskan et al., 2006; Huang et al., 2009; Paterson et al., 2009; Schnable et al., 2009; Varshney et al., 2009; Schmutz et al., 2010), and the number of species with completely sequenced genomes is expected to rapidly increase in the near future as next-generation sequencing technologies are put to use for de novo genome sequencing (Imelfort & Edwards, 2009).
Another aspect of genotyping that is constantly improving with the development of next-generation sequencing technologies is the genotyping error rate. While current SNP scoring methods are quite robust, the error rate in genotyping can vary significantly between individual SNPs even when scored in a single assay. This is clearly relevant for mapping purposes, since even low error rates (c. 3% or less) are known to have dramatic consequences for the accuracy of estimates of LD and hence also for association mapping (Akey et al., 2001).
An important issue to consider when selecting SNPs for inclusion in an association study is how the selection process itself can potentially bias the results, a process known as ascertainment bias. Ascertainment bias is usually attributed to the process of identifying and selecting SNPs for further use in an association study, most often the result of small SNP discovery panels that will undersample low-frequency mutations. There will thus be a bias towards SNPs occurring at intermediate frequencies (Clark et al., 2005). Ascertainment bias introduces an oversampling of mutations at intermediate frequencies, resulting in amounts of LD that are lower than would be found if SNPs had been selected completely at random. With respect to the power of an association study, the effect of ascertainment bias is more complex, and depends on, among other things, whether low- or intermediate-frequency SNPs are assumed to have larger effect on the phenotypic trait in question (Clark et al., 2005; Manolio et al., 2009).
With genotyping costs rapidly declining, an increasing proportion of the budget of most association genetics studies will be devoted to phenotyping traits of interest. While the importance of accurate identification and scoring of genotypes has received a good deal of attention in the literature (Akey et al., 2001; Sobel et al., 2002; Clark et al., 2005), the effects of phenotyping on the power and performance of association genetics studies have not been evaluated in great detail (Myles et al., 2009). This lack of attention to phenotyping is puzzling, as increasing the number of individuals for which phenotype information is obtained has a much greater effect on the power of an association study than the number of SNPs used (Long & Langley, 1999; Myles et al., 2009).
Since typical association genetic studies usually involve a relatively large number of accessions, phenotyping traits with high accuracy and precision can be both costly and time-consuming compared with genotyping. One benefit of working with plants, however, is the ability to replicate individual genotypes, both within and across multiple environments. Replication of genetically identical individuals, either by using inbred strains or through vegetative propagation, allows for far more precise phenotyping as the amount of environmentally induced variation can be estimated and partitioned out in association analyses. Data from replicates of each accession can then be used to estimate a ‘mean’ phenotype of the accession that is less biased by environmental effects or by measurement errors, or all data points can be used in the association study directly. Stich et al. (2008) referred to these two approaches as two-stage and one-stage association mapping designs, respectively. One example of a two-stage approach is the estimation of breeding values from either clonal replicates of a single genotype or from full or half-sib offspring, a common practice in quantitative genetics and breeding (Lynch & Walsh, 1998). Once breeding values have been estimated, they can be included as dependent traits in an association study and linked to genetic variation segregating at SNPs distributed across the genome. Stich et al. (2008) found broad agreement between results based on one and two-stage designs, but other studies have found large differences in power between the two approaches (Kang et al., 2008) and this is an issue that remains to be investigated in greater detail.
Replication across multiple environments also allow for the estimation of genotype × environment interactions (G × E), a topic that has largely been ignored thus far in most association genetics studies. For instance, many of the loci that have been implicated in the control of flowering time in Arabidopsis thaliana under glasshouse conditions (Atwell et al., 2010) cannot be replicated when plants are grown in the field under more realistic conditions (Brachi et al., 2010). This is surprising because glasshouse and growth chamber studies yield flowering time data that are highly correlated (Atwell et al., 2010). It does, however, suggesting strong G × E for the loci controlling flowering time, at least under more normal growth conditions.
IV. Study designs
A major focus in association genetics thus far has been the choice between a candidate gene-based approach and whole-genome approaches (GWAS). The critical deciding factor is the extent of LD, as this determines the mapping resolution that can be achieved and also the number of markers needed to ensure an adequate coverage of the genome in a GWAS study (Nordborg & Tavaré, 2002). In species with LD extending over large physical distances, a relatively modest number of markers are needed to ensure adequate genome coverage. For example, in predominantly selfing plants, LD can extend for tens or even hundreds of kb. For instance, in A. thaliana, GWAS is possible using roughly 140 000 tag SNPs (representative SNPs in a genome region of high LD). This corresponds to roughly one marker per kb across the A. thaliana genome (Clark et al., 2007). On the other hand, in many predominantly or obligate outcrossing species such as maize and many forest trees, LD extends at most a few hundred bp. For these species, several million SNPs would have to be genotyped to ensure adequate genome-wide coverage.
A recently developed approach that combines aspects of QTL mapping and association mapping is that of nested association mapping (NAM) developed by Yu et al. (2008). In NAM, recombinant inbred lines (RILs) are created from a diverse set of parents. By combining aspects of QTL and association mapping, NAM takes advantage of both historic and recent recombination events and thus requires lower numbers of markers than GWAS while still having a substantially higher mapping resolution than traditional QTL mapping (Yu et al., 2008). NAM was developed specifically with maize in mind because of the excessive number of SNP markers that would be needed for true GWAS mapping and because of the confounding effects of population structure in different maize breeds (McMullen et al., 2009).
For species where GWAS is currently not feasible and where the development of RILs is not possible, one alternative approach is to perform association studies based on the more or less complete gene space of an organism. Much can be accomplished by focusing on the gene space of plant species where complete genome sequences are lacking, and generating such data is neither expensive nor technically challenging today. A gene space study involves association mapping using SNP data generated from transcribed genes, and while gene space studies leave the large, nongene coding regions of a genome unstudied, these regions are available for later study when technology develops and sequencing costs decrease even further.
Finally, the most focused, but also the most limited, approach for association mapping is candidate gene-based association studies. Candidate-gene association mapping is by definition more hypothesis-driven than a GWAS study because mapping is restricted to genes thought to be good candidates for controlling the trait of interest (Neale & Savolainen, 2004; Hall et al., 2010). Although the selection of candidate genes is not always straightforward, genes are usually selected based on information obtained from, for instance, genetic, biochemical, or physiology studies in both model and nonmodel plant species. Therefore, candidate-gene selection is usually facilitated when restricted to well-characterized developmental pathways, such as the flowering pathway (Aranzana et al., 2005; Shindo et al., 2005;Skot et al., 2007; Ehrenreich et al., 2009) or to traits with a well-understood biochemical basis, such as the starch synthesis pathway (Wilson et al., 2004; Tian et al., 2009) or the lignin biosynthesis pathway (Gonzalez-Martinez et al., 2007). Candidate-gene studies are far less demanding in terms of the number of markers required and a growing number of studies have used candidate-gene approaches to study the genetic architecture of adaptive traits in plants. However, it is vital to remember that a candidate-gene approach is inherently limited by the choice of candidate genes used, and candidate-gene association studies will always fail to identify causal mutations that are located in nonidentified candidate genes. Moreover, candidate genes are often discovered from loss-of-function mutations in inbred laboratory strains and it is not clear how well such mutations describe the variation that actually underlies quantitative trait variation in natural populations (Nordborg & Weigel, 2008). Despite these caveats, a number of recent studies in a wide variety of plant species have shown that the candidate-gene approach can be quite successful in identifying genotype–phenotype associations for a large number of different traits, including morphology, phenology, growth and resistance traits (Wilson et al., 2004; Ehrenreich et al., 2007, 2009; Gonzalez-Martinez et al., 2007; Ingvarsson et al., 2008; Eckert et al., 2009).
V. The genetics of the ‘omics’
Traditionally, phenotypes have largely been synonymous with morphological traits and, when feasible, physiological traits. This is perhaps not surprising, at least in the case of plants, as it is these traits that are the targets of breeding programs and there was much hope invested in the use of QTL mapping for marker-assisted selection. The way in which phenotype is considered was dramatically altered by the introduction of the ‘genetical genomics’ concept, a term attributed to Jansen & Nap (2001), who proposed that gene expression values (genomics) should be considered as any other phenotypic trait. This, of course, makes perfect sense, as phenotype is any derived characteristic of the underlying genotype. As a result, the idea of mapping QTL for gene expression values (expression quantitative trait loci, eQTLs) has received much attention (Kirst et al., 2004, 2005; Keurentjes et al., 2007; West et al., 2007; Drost et al., 2010). Consequently, an inevitable extension was made to include other cellular-level phenotypes, such and metabolite and protein amounts. Particularly in the case of genomics, rapid advances in high-throughput quantification methods, such as microarrays, facilitated such approaches at a genome-wide level for entire mapping populations.
The application of genetical genomics is as equally valid for association mapping as it is for linkage mapping. There have been a small number of linkage mapping eQTL studies published in plant species, including A. thaliana (Keurentjes et al., 2007; West et al., 2007), eucalyptus (Kirst et al., 2004, 2005) and hybrid poplar (Drost et al., 2010), which have offered an insight into the genetic architecture of gene expression control. In plants, as in animal models, the emerging picture has been that the majority of genes have associated eQTLs and that cis mapping QTLs are of a generally larger effect size than trans QTLs. In both A. thaliana and Populus there are also trans loci ‘hotspots’ affecting surprisingly large numbers of genes, something yet to be explained. There is also emerging evidence that gene ‘essentiality’ and cross-tissue expression conservation can affect the genetic architecture of expression control (Drost et al., 2010; Emerson et al., 2010; McManus et al., 2010). A major limitation of eQTL mapping, especially in outbreeding species, is the limited mapping resolution available. This makes it particularly hard to ascertain whether an eQTL maps in cis or trans, with this definition often being limited to whether the eQTL maps to the same chromosome as the gene. This is certainly a source of bias that will influence the relative number of cis and trans QTLs mapped and is something that is likely to explain some of the variation in this ratio that exists between different species examined to date. In many cases, linkage blocks will be so large that a trans acting factor may actually cosegregate with its trans target, leading to misidentification as a cis eQTL. Although in some cases larger population sizes or combined population approaches can improve the situation significantly, such as the population design described previously for maize, at least in outbreeding tree species resolution will always remain limiting. In these cases use of association mapping populations can offer at least gene-specific mapping resolution.
Second-generation sequencing methods used for RNA sequencing (RNA-Seq) have appeal in this context as they simultaneously provide the quantitative expression data required for phenotypic association mapping and the SNPs required to perform association analysis. Information on allele-specific expression can also be obtained, albeit with the caveat that allele-specific expression will confound lack of genetic variation with lack of expression variation. Thus, to completely dissect the effects of allelic variation in gene expression, nuclear DNA must be used for genotyping.
The cost of RNA-Seq is rapidly falling and it is now feasible to perform population-wide expression profiling using either the latest Illumina or SOLiD sequencing platforms or, in the near future, additional technologies such as that from IonTorrent. Today, per sample RNA-Seq prices are in the region of $450, making the method comparable to whole-genome oligo arrays but with the distinct and significant advantage that genotyping is also included in the cost. It is an appealing proposition to be able to quantify expressed genes, identify a large number of protein-coding SNPs and potentially detect high-abundance splice variants using a single assay.
Association-based eQTL mapping is increasingly used in human studies on relatively small sample sizes (typically in the region of 100–200 individuals) and recent studies present utilization of RNA-Seq to perform eQTL mapping (McManus et al., 2010; Montgomery et al., 2010; Pickrell et al., 2010). An area of immediate attention for plant scientists should be modeling of population sizes required for association eQTL mapping.
VI. Missing heritability: the dark matter of the genome
In human genetics there has been much discussion surrounding the topic of so-called ‘missing heritability’ (Manolio et al., 2009) and it has been pointed out that this missing heritability provides biologists with their own equivalent of a dark matter hunt: we know it’s there because we can observe its influence, but to date the ability to identify its cause eludes us. As such, explaining missing, or more correctly unexplained, heritability components is possibly the largest challenge in genetics currently and the perceived failure of GWAS studies to identify the missing components has resulted in something of a backlash against the use of GWAS approaches. As far as humans are concerned, most studies focus on identifying new genes associated with disease or disorder states, such as autism, obesity, cancer survival, heart disease and diabetes risk (Frayling, 2007; Erdmann et al., 2009; Ma et al., 2009; Simon-Sanchez et al., 2009; Yang et al., 2009; Bolton et al., 2010; or for a comprehensive list see http://www.genome.gov/gwastudies/). For any such case it should not have been unexpected that the ‘common disease, common variant’ hypothesis would not hold true as all such diseases should be expected to be associated with negative selection which, by its very nature, would remove common variants from the population. However, the same cannot be said to hold for height, a trait with high heritability (in the region of 80%) and for which > 40 variants have been identified, explaining a rather uninspiring 5% phenotypic variation (Visscher, 2008). A source of frequent pessimism is that those 40 loci represent the low-hanging fruit, something that would suggest a potentially huge number of loci contributing to phenotypic variation for height, with each of those loci having an almost vanishingly small effect on phenotype (Goldstein 2010). Evidence to justify and indulge such pessimism is, however, currently lacking. There are a number of alternatives remaining to explain the 75% missing heritability (see, for instance, Yang et al., 2010) and it would be sensible to design future studies with these in mind.
VII. Gene interactions
Genomics has provided stark affirmation that interactions among genes abound: gene networks and transcriptional modules have been identified and explored in all phenotypic traits, developmental processes, stress responses and biotic interactions examined across all species (Alcázar et al., 2009; Gutierrez-Gonzalez et al., 2010; He et al., 2010; Kliebenstein, 2010; Lee et al., 2010a,b). The complex interaction structure of gene expression networks suggests that almost no gene can be considered in isolation and that variation in the expression of the majority of genes will exert at least some degree of influence on a number of other genes. Indeed, this is affirmed by the results of eQTL studies, in which the majority of genes have mapped eQTL. However, these eQTLs represent only a fraction of all true QTLs, because of the low power of most current eQTL studies. Following this train of thought, it is evident that a clear limitation of current association studies is the inability to detect even limited degrees of epistasis. New algorithms for detecting epistatic interactions are under development (Zhang et al., 2010) but this is an area of research that deserves more attention. Noyes et al. (2010) provide an insightful example of the number of cis and trans acting loci, their contrasting effect sizes and the sensitivity of trans loci. Their results also caution against any assumption that loci identified in linkage populations will remain relevant in natural populations.
VIII. Many rare alleles
In the field of human genetics and disease susceptibility there is currently a migration from the previous ‘common disease, common variant’ view to one of ‘common disease, many rare variants’. As mentioned earlier, it was somewhat illogical to have assumed that common variants would have accounted for variation in disease susceptibility as these would have been exposed to generations of negative selection pressure. However, it was exactly this type of variant that the HapMap project was set up to detect. The lack of variants identified using GWAS studies in the HapMap population should not disappoint; it is rather reassuring that evolutionary theory held true. Faced with increasing evidence of such negative results, attention is now shifting to the view that common diseases derive from numerous rare alleles (McClellan & King, 2010) and there is evidence to support such a view (Walsh & King, 2007; McClellan & King, 2010). Similar attention to this issue has not yet been addressed in plants. It is expected that the situation may be somewhat different as plant research focuses on adaptive traits rather than diseases. For example, Atwell et al. (2010) performed a GWAS study on 107 phenotypes in A. thaliana and found that many adaptive traits (such as pathogen resistance and flowering time) were controlled by alleles segregating at appreciable frequencies, suggesting that the genetic architecture for these traits differs appreciably from the genetic basis of most human diseases. The actual phenotypic trait of interest is therefore another area of research deserving careful consideration when designing GWAS studies.
IX. Looking in the wrong place
One result of human GWAS studies that has surprised many is that the majority of significant associations detected to date do not lie in protein coding regions, although there is significant enrichment of SNPs in such regions compared with random expectation (Hardy & Singleton, 2009). Similarly, in plants, the majority of mutations that have been identified as being associated with genetic variation in quantitative traits are not associated with changes in the amino acid composition of proteins. Only c. 15% (27 of 177 associations; see Section XIV) of positive associations in plants are nonsynonymous, whereas almost half (80 of 177) of the associations involve noncoding mutations located in introns, untranslated regions or intergenic regions. It has been shown that many of these mutations show up as significant associations because they are in LD with untyped causal mutations that in turn are nonsynonymous mutations. However, in most cases our current knowledge of genome function is too limited to assign biological roles to these polymorphisms, although it should be made clear that this does not render them uninteresting or suggest that they lack biological significance. Current GWAS studies, particularly in plants, are often designed to identify nonsynonymous SNPs within protein coding regions. However, it is increasingly clear that gene expression variation plays an important role in controlling natural variation (Gilad et al., 2008). From an evolutionary perspective, such a finding is logical as it is far less likely that changes in expression domain, degree, timing or response will be severely deleterious (Carroll, 2008). There are few empirical data in this area, although studies such as Kasowski et al. (2010) clearly show how SNPs in regulatory elements can result in significant changes in expression levels between individuals. A significant challenge in the coming years will be ascribing biological roles to regulatory elements, understanding how coding variation in those elements affects gene expression, and exploring the sequence and functional conservation of such elements.
Recent results from second-generation sequencing studies have revealed new layers of unexpected genome complexity. For example, long noncoding RNAs of as yet unidentified function, highly complex populations of short RNAs, including microRNAs, and natural antisense transcripts have been identified. Both long noncoding RNAs and short RNAs can trigger and control epigenetic changes. In addition to not looking in the right place for causal variation, it is also the case that many currently assayed SNPs may simply not be in linkage with causal SNPs. The degree to which this is likely to be an issue depends on the linkage structure of the genome of interest. In many cases we assume that current linkage estimates hold true across the whole genome, but in reality we typically know little about linkage structure in nonprotein-coding regions. As such, it would be ideal to design GWAS studies to expect the unexpected, or rather to have no assumptions about where causal polymorphisms lie, which strongly advocates whole-genome approaches.
Finally, structural variations, such as copy number variation (CNV), have also recently received a great deal of attention in the human genetics community and are now commonly included in association genetic studies (McCarroll & Altshuler, 2007). Little information on CNVs exists in plant species to date, but this is certainly another area that will gain significantly from the application of high-throughput genomic sequencing.
X. Looking but not seeing
It has become something of a cliché to cite the example of human height in relation to GWAS studies, with some using the example to argue against future investment in additional GWAS projects. Yang et al. (2010) provide evidence that the cliché may have had its day as they show that the inability to detect loci explaining height is largely a limitation of current study design and statistical methodology. The results presented show that, rather than the typically cited 5% explained variance, current SNP data actually explain 45% of the phenotypic variance, albeit still with each SNP explaining small percentages of that variance. These results suggest that genetic variation for traits that are not under strong selection can readily be estimated using SNPs distributed across the genome in populations of only a few thousand individuals. By contrast, the results suggest that to identify individual loci contributing to such traits, extremely large population sizes are required.
Related to the theme of looking but not seeing is the quality of phenotype data discussed earlier. Association mapping relies on the ability to assay the phenotype of individuals within a population with high accuracy and repeatability. If measurement error between individuals approaches the amount of true phenotypic variance between individuals, association detection will be severely affected. Both technical and biological factors can influence ability to accurately measure phenotype and careful attention must be paid to both. Although not truly a source of inaccuracy, G × E interaction, or phenotypic plasticity, can also represent a considerable source of phenotypic measurement variance at the level of clones or inbred lines.
XI. Needles in a haystack
The situation of potentially huge numbers of SNPs contributing to trait variation represents a proverbial ‘needle in the haystack’ challenge. It is infeasible that every SNP, or even every gene harboring significant SNPs, can be functionally characterized, and as a result approaches are needed to help focus attention on a manageable number of target loci. Integration of multiple sources of genome-wide evidence, which can been termed ‘systems genetics’ (Lucioni et al., 2010), holds much potential and the approach has the added advantage of bringing together many of the ideas outlined earlier relating to missing heritability. Gene expression networks and transcriptional modules contain information on expression-level interactions, and constructing expression networks from population expression data may help identify interesting target genes. (Lee et al., 2010a). Expression data can be used in a pseudo-bulk-segregant analysis approach whereby individuals at either end of a phenotypic distribution are assayed and significant differences in expression between the two groups are identified. If individuals at either tail of the distribution are fixed for causal alleles affecting expression, these should be identified when significant expression differences are overlaid on top of genotype data. Even if a common allelic variant is not fixed within each group, such studies are useful to identify whether shared expression variation may be accounting for phenotypic variation. Naukkarinen et al. (2010) used a similar approach where monozygotic twin pairs of contrasting body mass index were assayed to provide expression data that were examined in relation to GWAS results. In suitable plant species, similar approaches using recombinant inbred lines or near-isogenic lines could prove beneficial. Equally, clonal replication and comparison of expression differences resulting from phenotypic plasticity may prove insightful when viewed in relation to whole-genome association and eQTL results. As causal variation in expression can be of low magnitude (Alimonti et al., 2010), such approaches may succeed where population correlation analysis would fail to achieve statistical significance.
Combinatorial approaches need not make use of expression data collected from the individuals in which association mapping will be performed. Rather, association data can be overlaid on to a network structure constructed from expression data concerning a phenotype of interest. For example, Baranzini et al. (2009) combined a human protein interaction network and gene-wise SNP P-values to identify novel pathways potentially involved in susceptibility to multiple sclerosis. Nicolae et al. (2010) showed that overlying ‘phenotypic trait’ (i.e. not expression) and eQTL data can help to identify the most informative SNPs, and Nicolae et al. (2010) discussed the general approach while accounting for local LD structure. For plant researchers, a major limitation to the use of such approaches is the severe lack of genome-wide expression data for any species other than A. thaliana. The falling cost of RNA-Seq will help overcome this barrier as it removes the prerequisite stage of designing whole-genome oligo arrays, opening up genome expression studies to all species. In the not too distant future, comparative analysis of such results will ascertain how commonly variation in the same phenotypic traits is controlled by polymorphisms at orthologous loci or through alterations in conserved gene coexpression modules.
XII. Confounding effects
A significant contributing factor for incurring false positives in association mapping studies is confounding by unmeasured variables. The most commonly discussed factor causing confounding in association mapping studies is population structure, a problem that has been pointed out repeatedly in the literature (Yu et al., 2006; Zhao et al., 2007). The problem of population structure arises whenever phenotypic traits are correlated with the underlying population structure at noncausal loci. In such cases, even loci that are unrelated to the trait will show varying degrees of association because of the confounding effects of population structure. However, population structure is neither a necessary nor a sufficient condition for confounding to occur in GWAS studies. As outlined by Atwell et al. (2010), when dealing with complex traits, the problem of confounding is better thought of as model mis-specification. GWA analyses are usually performed using a single SNP at a time and this implicitly assumes that a multifactorial trait is treated as being caused by a single locus, and all polygenic background variation is effectively ignored in the analysis. From a statistical point of view, any confounding effect, such as that caused by population structure, yields an inflated number of false-positive results. On the other hand, controlling for population structure will inevitably result in an increase in the number of false negatives, that is, true associations that go undetected because their pattern of variation coincides with patterns of population structure (see Atwell et al., 2010 for several such examples). This is really an unsolvable problem, in a statistical sense, and highlights the need for other methods, such as transgenic experiments or controlled crosses, to complement association mapping studies.
XIII. Replicating and verifying associations
Once an association between a particular SNP and variation in a trait of interest has been established, a crucial but yet too often overlooked step is to replicate the association in an independent mapping population. As the number of studies documenting significant associations between SNPs and variation in quantitative traits of interest accumulates, increasing emphasis should be placed on replicating studies to validate effects of significant associations. These issues have been strongly advocated in the human genetics community, where strict guidelines for conducting both initial and replication association genetics studies are being devised (Chanock et al., 2007). Replication of significant associations has proved crucial for separating true from false positives and to provide less biased estimates of allelic effect sizes. In fact, in the human genetics literature, a substantial fraction of all significant results are never replicated in follow-up studies, suggesting a high proportion of false positives. Failure to replicate significant associations can arise for a lot of reasons, including poor experimental design in either the initial or the replication study, difficulties in replicating environmental effects, small sample size or lack of rigorous phenotype scoring (Chanock et al., 2007; Pearson & Manolio, 2008; Manolio et al., 2009).
So far, relatively few of the genotype–phenotype associations found in plants have been replicated and verified in independent studies. Notable exceptions include mutations in FRI and FLC that affect flowering time in A. thaliana (Michaels & Amasino, 1999; Johanson et al., 2000; Gazzani et al., 2003). The FRI and FLC genes act epistatically to delay flowering in A. thaliana in the absence of vernalization, with FLC being negatively regulated by vernalization and positively regulated by FRI. Mutations in FRI that render the protein nonfunctional remove the vernalization requirement and lead to rapid flowering (Michaels & Amasino, 1999; Johanson et al., 2000; Gazzani et al., 2003). Similarly, there is generally a positive correlation between expression of the FLC gene and flowering time, such that increased expression of FLC leads to delayed flowering, and a substantial portion of the variation in FLC gene expression can be attributed to sequence variation in FRI (Michaels & Amasino, 1999; Gazzani et al., 2003). Association mapping studies in A. thaliana have verified the large effects of mutations in FRI on both flowering time and FLC expression (Aranzana et al., 2005; Zhao et al., 2007). However, controlled crosses have shown that the effects of FRI nonsense mutations are likely overestimated in natural populations, because of the strong inbreeding and population structure in A. thaliana (Scarcelli et al., 2007). Because of allelic heterogeneity at FRIGIDA (Johanson et al., 2000), this locus is hard to identify in GWAS studies, despite the large documented effect of individual FRIGIDA alleles (Atwell et al., 2010), suggesting that a failure to replicate a positive association in subsequent studies does not necessarily imply that the initial result represents a false positive.
Another example deals with the Dwarf8 gene that has been shown to be associated with variation in flowering time in maize (Zea mays, Thornsberry et al., 2001). Because of high LD between markers in the Dwarf8 gene, the causal mutation or mutations could not be identified, but one promising candidate was a 6 bp deletion of the SH2-like domain of the Dwarf8 protein (Thornsberry et al., 2001). The association between the Dwarf8 deletion and flowering time has since been verified in several studies, some of which involved substantially larger sets of inbred lines and landraces of maize (Andersen et al., 2005; Camus-Kulandaivelu et al., 2006). Studies from several other plant species have also implicated deletions in the SH2-like domain of homologous genes for involvement in modulating variation in flowering time and plant height (Peng et al., 1999), lending further credibility to the association.
Finally, verification of genotype–phenotype associations does not necessarily have to come from replicate association studies, but can include validation of biological function through transgenic experiments and other molecular biology techniques (Koornneef et al., 2004). In fact, detailed functional characterization will likely be needed both to verify and to work out the biological details underlying many positive associations that are identified in GWAS studies.
XIV. The genetic architecture of quantitative traits in plants
Early QTL studies found that many traits contained QTLs that explained a significant fraction of variation in the phenotypes studied (see Mackay et al., 2009; Flint & Mackay, 2009 for recent reviews) At first this was a rather surprising observation, but yielded high hopes that the genetic architecture of most quantitative traits was only moderately complex, involving only a handful of loci. However, as mapping populations increased in size, which resulted in QTL mapping experiments with greater resolution, these hopes were largely shattered. It became apparent that QTL mapping in small to moderate mapping populations (consisting of a few hundred individuals) were underpowered to detect QTLs with small effects and substantially overestimated the effects of large QTLs (the so-called ‘Beavis effect’; Beavis, 1994; Xu, 2003). As QTL mapping experiments increased in size and power, large-effect QTLs were usually shown to fractionate into many, closely linked QTLs with smaller effects. These smaller-effect QTLs often also have opposite effects on the trait in question, such as was shown in a study of A. thaliana where two tightly linked QTLs with opposite effects on growth rate were identified in a region that showed no evidence for harboring a QTL for growth in a traditional QTL experiment (Kroymann & Mitchell-Olds, 2005). Hopes have been raised that association mapping will circumvent many of the problems that have plagued QTL mapping experiments and provide a better picture of the genetic architecture of quantitative traits, down to individual causal mutations (so-called quantitative trait nucleotides, QTNs). However, many of the limitations that apply to QTL mapping experiments, such as limited power and overestimation of effect sizes, also apply to the identification of QTNs in association mapping studies.
Association mapping studies in plants to date have identified mutations associated with a large number of phenotypic traits. Fig. 1(a) shows the distribution of effect sizes for 267 QTL/QTN loci identified in association mapping studies from 15 different plant species. It is apparent from Fig. 1(a) that the amount of variation explained is usually low, with most associations explaining only a small percentage of the phenotypic variation. However, the distribution is highly skewed, with a few loci explaining a substantial effect of the phenotypic variation in some traits. Many of these effect sizes are estimated from associations identified in mapping populations of relatively modest size (a few hundred individuals at most), suggesting that they are likely suffering from the same overestimation witnessed in early QTL studies (Beavis, 1994; Xu, 2003). Upon closer inspection (Fig. 1b) of the distribution of effect sizes, it is also apparent that the distribution is truncated at loci with very small effects (< 1–2%), suggesting that most current association mapping studies are generally underpowered to identify such loci. This is also evident from the relatively low combined percentage of variation that most association mapping studies can explain, usually in the range of 5–20%. As discussed previously, one possible explanation for this ‘missing’ heritability is that rare alleles (with a minor allele frequency < 5%) are usually excluded from association studies and these studies also have very limited power to detect the effects of such loci, unless their effects are very large, a finding recently confirmed in the case of human height (Yang et al., 2010. Most association mapping experiments in plants identify SNPs from sequencing a modest number of individuals (usually 20–30 individuals) and consequently low-frequency mutations are most likely excluded from most current association mapping studies.
Data on effect sizes of QTNs from model organisms, such as Drosophila or mice, suggest that the distribution of effect sizes does not differ dramatically between different phenotypic traits (Flint & Mackay, 2009). However, the data from natural plant populations presented in Fig. 1 suggest significant differences in effect sizes between different categories of phenotypes (broadly classified as growth, morphology, phenology, reproduction or resistance, Kruskal–Wallis (KW) χ2 = 88.2, P <0.001, Fig. 2a). The main reason for this is appears to be substantially greater effect sizes for resistance and, to a lesser degree, also for phenology traits (Fig. 2a). It is not surprising that resistance QTNs can explain a substantial fraction of the phenotypic variation since many resistance traits involve traits that are qualitative rather than quantitative. There is also an apparent difference in effect size between primarily selfing and outcrossing species (KW χ2 = 54.4, P <0.001) but this is to a large extent an effect of differences in the types of traits that have been investigated in species with different mating systems. The effect of mating system is only weakly significant when the trait type is taken into account in the analysis (KW χ2 = 4.1, P = 0.043, Fig. 2b). Finally, insertions and deletions have a substantially greater effect than other types of mutations (KW χ2 = 54.1, P <0.001, Fig. 2c). This is not hard to understand, since insertions and deletions often have dramatic consequences, such as frame shifts, resulting in nonfunctional alleles that can have large phenotypic effects. One such example is the null alleles at FRIGIDA, which result in earlier flowering in A. thaliana (Johanson et al., 2000; Aranzana et al., 2005; Shindo et al., 2005). Even though nonfunctional alleles can have a large effect when studied in isolation, they are often hard to identify in GWAS studies (Atwell et al., 2010; Brachi et al., 2010). The reason for this is allelic heterogeneity; there are clearly many ways to render a locus inactive through insertions, deletions or other kinds of null mutations.
In the coming years, the dropping genotyping costs are likely to drive association studies away from candidate gene-based studies towards truly genome-wide studies. This will likely involve whole-genome resequencing of all individuals in a population and will allow an assessment of the effects of point mutations and insertions, deletions and larger structural variation, such as copy number variation. A recent such example involved whole-genome resequencing of Arabidopsis lyrata populations that were adapted to growing on either serpentine or normal soils (Turner et al., 2010). This analysis identified a number of genes that were strongly differentiated between the different populations, including several genes involved in heavy metal transport, suggesting that these are likely candidates for conferring local adaptation to serpentine soils.
In the coming years it will also be likely become standard to collect at least some RNA-Seq data to include eQTL mapping in GWAS studies, and some degree of expression network integration is also likely to take place. One of the benefits of the rapid development of high-throughput sequencing technologies is that population choice for GWAS studies will no longer be restricted to current model organisms and will slowly become more focused on which species are most relevant for answering biological questions. As costs are diverted away from genotyping, a substantially greater proportion will be spent on phenotyping. In fact, with plants where clonal replication or development of inbred lines is possible, genotyping can be performed once and phenotyping can be repeated under virtually any environmental conditions, highlighting the need for the development of rapid, high-throughput phenotyping techniques. The utility of most GWAS studies will ultimately be dependent on accurate and reproducible phenotypic data. It is also easy to envision that as GWAS studies move from model organisms to nonmodel organisms, a major limitation will be the availability of funding to establish collections of suitable material that capture phenotypic diversity that is representative of the species or the phenomenon of interest. Finally, a greater emphasis should be placed on the population sizes used in association mapping studies. As outlined in the preceding section, association mapping populations in use today are likely large enough to identify mutations explaining a few per cent of the phenotypic variance. However, substantially larger mapping populations are needed to identify mutations with much smaller effects. Larger mapping populations, preferably replicated in many different environments, are also needed to explore the importance of epistasis, G × E and phenotypic plasticity. From an applied perspective, it would be extremely valuable to understand the genetic basis and nature of phenotypic plasticity. Since many traits of economic importance in plants are highly plastic, it would be invaluable if plastic traits could be made more rigid, as breeders usually prefer to deal with traits that are as predictable as possible.
The combination of GWAS eQTL data and whole genome marker data will yield significant insight into the genetic architecture of complex traits and help elucidate the degree to which protein-coding polymorphisms and variation in gene expression contribute to controlling natural trait variation (Hoekstra & Coyne, 2007; Carroll, 2008). It should also be possible to ascertain whether certain classes of genes are more likely to be ‘targets’ of natural selection than others and to what degree this pattern varies between trait types or between different types of plant species.
Finally, to ensure the greatest utility of GWAS results in the future, all phenotype and genotype data should to be made public and be deposited in public databases. As such, file-format and minimum information standards need to be established, such as those available for sequence data or microarray experiments. A major obstacle to overcome will be to develop formats for storing information on the SNPs and other polymorphism data at a whole-genome level. For instance, what is the most efficient way to store information on SNP variants and their frequencies? As the number of GWAS studies in plants will likely increase dramatically in the near future, developing tools for efficient storage and dissemination of both phenotypic and genotypic data should have the highest priority.
This study has been funded by a research grant from the Swedish Research Council and by a ‘Young Researcher Award’ from Umeå University to P.K.I.