HGV2012: Leveraging Next-Generation Technology and Large Datasets to Advance Disease Research
Contract grant sponsors: NIH Training Grant (T32 GM007175); Fudan University; the National Natural Science Foundation of China (C060501);National Genome Research Institute (R13 HG003953); EU Project GEN2PHEN; University of Leicester; McLaughlin Centre at the University of Toronto; Affymetrix; Illumina; Macrogen; Agilent Technologies; BGI Tech Solutions; Wuxi Genome Center (WuXi AppTec); Genesky Biotechnologies; Complete Genomics; Life Technologies; eBiotrade.
Correspondence to: Nina Gonzaludo, University of California, San Francisco, Department of Bioengineering & Therapeutic Sciences, Cardiovascular Research Institute, 555 Mission Bay Blvd South, MC-3118, Room 282, San Francisco, California 94158. E-mail: Nina.Gonzaludo@ucsf.edu
The 13th International Meeting on Human Genome Variation and Complex Genome Analysis (HGV2012: Shanghai, China, 6th–8th September 2012) was a stimulating workshop where researchers from academia and industry explored the latest progress, challenges, and opportunities in genome variation research. Key themes included advancements in next-generation sequencing (NGS) technology, investigation of common and rare diseases, employing NGS in the clinic, utilizing large datasets that leverage biobanks and population-specific cohorts, and exploration of genomic features.
The 13th International Meeting on Human Genome Variation and Complex Genome Analysis (HGV2012) was held in Shanghai, China on 6–8 September 2012. Over the 3 days of the meeting, 34 multidisciplinary speakers shared their research in nine sessions (Box 1), interspersed among 100 poster presentations. HGV2012 was attended by 151 participants from 26 countries, with all participants presenting original work (a requirement for attending this meeting). HGV2012 was part of the series of meetings that began in 1998, then entitled “SNPs and Complex Genome Analysis,” in response to the emergence of great interest in SNP research. The first small meeting consisted of approximately 50 investigators who came together for a workshop-style brainstorming and debate in Skokloster, Sweden. Although the number of participants has expanded since then, this meeting maintains its collaborative workshop atmosphere. More information about the meeting, along with abstracts and a program, can be found on the meeting website (http://www.hgvmeeting.org/hgv2012/).
The availability of large datasets produced by next-generation technology has accelerated research on disease and clinical phenotypes. New technology has spurred the development of new methods, presented here, that are strongly impacting how researchers address and understand human health, population genetics, and genome characteristics including and beyond DNA sequence. As the price of next-generation sequencing (NGS) technology continues to drop, researchers are faced with an unprecedented opportunity to capture DNA variation in both larger and more specific populations. Results presented at HGV2012 show encouraging progress in the attempt to better understand disease susceptibility and genomic variation. At the same time, preliminary analyses of large datasets highlight the need to continuously question our assumptions and consider factors beyond the genome in our research.
Advancements in NGS Technology
The DNA sequencing market continues to experience rapid growth, with an increasing number of platforms, service providers, and analysis pipelines available to researchers. New technologies allow for genomic measurements that are more accurate and have higher throughput, with all this being achieved at lower costs. One other key benefit from these advances in NGS technology is speed. For example, Peidong Shen described utilizing long padlock probes and an Illumina MiSeq to identify SNPs and copy number variants (CNVs) in 5,471 exons in 524 disease candidate genes. The process, which would have previously taken days or months to complete, was performed in 48 hr. Such improvements open up many possibilities for clinical genomic applications. In addition to technical advancements, significant progress has been made on methods to align sequencing reads and call variants. However, certain regions of the genome still present daunting challenges. Charles Cox presented work focused on the major histocompatibility complex region, which has been associated with many disease and adverse drug events. Given the structural diversity and highly polymorphic nature of HLA genes, Cox showed that NGS analysis pipelines systematically undercall variants in this region and proposed new strategies for more accurately genotyping this region using alterative reference sequences.
Box 1. HGV2012 Meeting Sessions
- Common disease (Part I)
- NGS research (Part I)
- Rare disease
- Novel dataset analysis
- New technologies
- Beyond GWAS (Part I)
- Common disease (Part II)
- Beyond GWAS (Part II)
- NGS research (Part II)
Other researchers at HGV2012 focused on practical aspects of utilizing NGS. For example, researchers who decide to leverage NGS technology are faced with numerous options. Pui-Yan Kwok shared comparative findings based on samples from five individuals. Kwok presented results comparing data from blood versus saliva samples, whole genome sequencing (WGS) versus whole exome sequencing (WES), and Nimblegen SeqCap EZ Library (Roche NimbleGen, Inc., Madison, WI) versus Agilent SureSelect kits (Agilent, Santa Clara, CA) for exome capture in WES. Anthony Brookes described the phenomenon of “Thermodynamically Ultra-fastened” (TUF) DNA—genome sequences that resist denaturation at temperatures well in excess of 100°C—thereby explaining why assay signal strengths vary across different regions of the genome assay, across methods and technologies, and across DNA samples (dependent upon DNA quality in terms of integrity). Brookes illustrated that using known fragmentation methods before DNA amplification can substantially overcome the effects of TUF DNA on regions surrounding TUF elements.
Although researchers are rapidly adopting currently available NGS technologies, engineers continue to work on improving sequencing speed and accuracy for even higher throughput. Jingyue Ju presented molecular engineering work that leverages nucleotide analogues to potentially enable real-time single molecule sequencing by synthesis with electronic detection.
Investigation of Common and Rare Diseases
Candidate gene analyses and genome-wide association studies (GWAS) have laid the foundation for understanding the contribution of genetic factors to human disease. As NGS technology continues to progress and become more accessible, researchers are investigating diseases and complex traits genome-wide across multiple populations at increasing resolution. Along with increasing NGS availability, the recent release of Pilot and Phase 1 data from the 1000 Genomes Project has created opportunities to more systematically assess the extent and distribution of both risk and protective alleles in the genome, which may provide insight into observed differences in disease frequencies among populations. Yuan Chen used 1000 Genomes Pilot data to show the number of potentially deleterious and disease-associated variants per individual observed in WGS data. The deleterious variants were predicted by Condel scores [González-Pérez and López-Bigas, 2011], whereas the disease variants were derived from the Human Gene Mutation Database [Stenson et al., 2003] and then manually annotated. This work gives medical resequencing projects using next-generation technology a baseline for the number of deleterious and disease variants to expect in the general population and illustrates the need for caution in interpreting apparent disease variants. Steven Brenner presented results from the Second Critical Assessment of Genome Interpretation meeting. Results from a series of prediction challenges with different types of NGS data and phenotypes spurred discussion on the strengths and limitations of current functional prediction and annotation methods.
Although researchers strive to increase sample sizes in GWAS to discover additional variants underlying human disease, it is not clear what sample size characteristics are necessary for prediction methods based on NGS data. Nilanjan Chatterjee assessed the predictive performance of polygenic models of 10 complex traits, investigating the impact of the number and distribution of effect sizes for susceptibility SNPs, training set sample size, and true and false positives associated with SNP selection in models. By optimizing such parameters, predictive polygenic models using comprehensive sets of common SNPs, and later rare variants, could prove valuable input for clinical applications.
Other speakers at HGV2012 focused on progress in specific disease areas and complex traits. Hidewaki Nakagawa discussed results from a GWAS and subsequent replication study to identify genetic factors affecting prostate cancer susceptibility (MIM #176807) in a Japanese population. Liang Li presented work that integrated multiple data types to elucidate mechanisms underlying the large interindividual variability observed in response to gemcitabine and cytosine arabinoside treatment. Li leveraged mRNA expression and SNP data from a human lymphoblastoid cell-line model system, as well as cytotoxicity assay phenotypes, to show that cell-based GWAS and functional validation can help to identify biomarkers for drug response and elucidate mechanisms of drug resistance. Hongyan Wang established an unexpected association between cystathionine β-synthase (CBS) and reduced risk of congenital heart disease in a Chinese Han population. He reported that a functional variant in the CBS gene promoter region, which increased CBS activity, might provide a protective effect to cells during critical stages of heart development. Using a large Chinese cohort of 2,031 cases and 2,044 controls, Dongxin Lin discovered multiple new genetic susceptibility loci for Esophageal squamous-cell carcinoma (MIM #133239), as well as loci showing significant association in a gene–environment interaction analysis of alcohol consumption.
Stephen Chanock presented work by Nathaniel Rothman, noting that studies of disease based only on genetics may produce results contradictory to those of studies based on gene–environment interactions. In-depth research of environmental factors may improve our understanding of disease risk, particularly in regard to cancer. Chanock discussed examples of following up top GWAS loci for cancer susceptibility in the context of environment, and discussed how such analyses may impact public health. Jerome F. Strauss III also discussed the impact of environment on observed phenotypes. Using data from twin studies based on subjects from the United States and Sweden, Strauss showed that both fetal and maternal genetic factors, in addition to the environment that is shared among pregnancies or unique to a specific pregnancy, including virulence genes in the vaginal microbiome, differ substantially by population and may partially explain disparities in rates of preterm birth between Americans of European and African ancestry.
Despite advances in genomic technologies, certain genomic analyses still remain impractical to perform in humans, even with large cohorts. David Buchner described his work on understanding the complex genetic basis of diet-induced obesity by fine mapping of quantitative trait loci (QTL) in mouse congenic strains. Buchner observed many large-effect QTL, epistasis, and unconventional inheritance patterns, highlighting the complex genetic bases of obesity and demonstrating a novel mapping strategy to identify genes implicated in disease. Shi Huang discussed strategies to investigate the possible influence of natural selection on minor alleles of common SNPs. Using recombinant inbred lines in model organisms, Huang found that more minor alleles were significantly correlated with adaptive traits and sensitivity to exogenous compounds. This finding was also observed in 21 human GWAS datasets of common diseases and phenotypes, suggesting a new angle for investigating the genetic basis of complex traits and disease.
Employing NGS in the Clinic
Although results from GWAS and NGS projects have significantly contributed to our understanding of disease through discovery of causal variants, the implementation of such technology in the clinic remains a challenge. One promising application of NGS in the clinic is its use for noninvasive prenatal diagnosis. Y.M. Dennis Lo described progress in methods used to obtain fetal genomic information in NGS data generated from maternal plasma DNA. Lo discussed three proof-of-concept studies showing that the entire fetal genome can be sequenced from maternal plasma. In this regard, selected fetal aneuploidies such as trisomy 21 can already be robustly detected and used clinically. Regardless of application, interpretation of any data type used in clinical practice depends largely on the use of meaningful vocabularies to describe phenotypes of interest. Ségolène Aymé described Orphanet, a collection of published expert classification systems and phenotypes indexed using common vocabularies, which includes nomenclature for rare diseases down to the gene level. Generation of NGS data, particularly in a clinical context, requires that secure and standardized tools for the sharing of results and datasets be in place to allow for the linking of potential mutations to phenotypes. Owen Lancaster discussed the web-based Cafe Variome, an innovative tool that facilitates “open discovery” rather than “open sharing” of sensitive and private data by researchers and diagnostic laboratories.
Utilizing Large Datasets that Leverage Biobanks and Population-Specific Cohorts
Many participants reported progress from consortia and international collaborations investigating specific disease areas or populations. Such resources include numerous data types and sources, allowing for a detailed investigation of genomic and environmental factors underlying clinically observed phenotypes. Many speakers discussed encouraging preliminary results, implementation and interpretation challenges, as well as opportunities for collaboration and replication of methods across institutions. Arthur Holden described the International Serious Adverse Events Consortium, a collaborative effort between pharmaceutical companies, the Wellcome Trust, and numerous research networks and institutions, which is working to identify and validate variants underlying drug-induced serious adverse events by combining datasets and phenotyping methods. Ming Qi described aims and progress of the Chinese Consortium of the Human Variome Project (HVP-CHINA). This project currently spans 11 institutions from five Chinese provinces and aims to develop clinical tests for 100 clinical diseases and phenotypes, supporting clinical and research work with both traditional and NGS technology. Gerome Breen described work from the Psychiatric Genomics Consortium, which combines over 60,000 samples from 46 case-control studies of five major psychiatric diseases. In addition to discussing how downstream pathway analysis of GWAS results revealed considerable overlap in significant pathways across the diseases studied, Breen also provided commentary comparing the many available pathway analysis tools available to researchers.
One emerging trend reported at HGV2012 was the use of biobanks linked to electronic medical record (EMR) systems for genomic studies. Catherine Schaefer provided an overview of the Genetic Epidemiology Research Study on Adult Health and Aging, a collaborative effort between Kaiser Permanente and University of California, San Francisco that aims to leverage patient genomic data from 100,000 individuals (later increasing to 500,000) and EMRs to identify genetic and environmental factors associated with health and aging. Schaefer described early results of a GWAS of plasma lipid levels and response to statins, validating known candidate loci and supporting the use of EMRs to derive phenotypes for genetic studies. Nina Gonzaludo discussed GWAS results from this same resource in which EMRs were used to derive the phenotype of weight gain induced by atypical antipsychotics. Cisca Wijmenga also described an EMR-linked biobank resource called Genome of the Netherlands (GoNL), which includes both NGS and array-based genotyping of 769 individuals, based on 250 trio or quartet families. She discussed how results from this project have helped to identify disease-associated variants in a cohort of patients with celiac disease using the GoNL as a reference for imputation of array-based GWAS genotyping. Samples in this resource are also being assessed by other technologies, such as microbiome profiling and RNA-Seq to establish eQTLs.
Highlighting Population Variation
The recent availability of large, high-resolution datasets across multiple populations has also positively impacted our ability to investigate population-specific variation and migration patterns. Many HGV2012 participants leveraged 1000 Genomes data to address these topics, as well as to develop new methods that utilized detailed population data to improve genomic studies. Principal Component Analysis, which is often used to estimate population substructure reflected in GWAS, appears to inflate Type I errors when applied to rare variants identified by WGS. Dandi Qiao described a novel method for identifying study subjects that may create population substructure and bias results in studies that leverage WGS. Her method requires minimal computational capacity and was able to distinguish samples from close but differing populations in the 1000 Genomes Phase 1 data. Identifying regions of positive selection in the human genome using coalescent theory requires a large number of genomes. Hang Zhou leveraged 1000 Genomes data to develop a new coalescence-based method for fine mapping such regions. To better understand the impact of archaic hominid admixture, Li Jin also leveraged this resource. Jin described an algorithm to identify archaic segments in non-African individuals and characterized the geneology of each segment by analyzing coalescence in Neanderthal, Denisovan, African, and chimpanzee data.
Jean Alain Trejaut described a phylogenetic analysis of the distribution of Y-chromosome haplogroups in 1,400 Southeast Asians to delineate migration routes of modern human from Mainland Southeast Asia into Island Southeast Asia. Using high-resolution genotyping, Trejaut also compared genetic migration patterns with known linguistic migration study results. Observed distributions are suggestive of migratory directions and may help to identify the origin of native tribes.
Exploration of Genomic Features
As the technology and methods available for detecting genomic level variation improve, so too does our ability to assess variation beyond DNA sequences. Such variation can have direct impact on transcription and translation, ultimately affecting phenotype and disease susceptibility. Charles Lee described analyses of structural genomic variation, including CNVs, within model organisms, including an observation of increased genetic variation even within a strain of mice. These observations support regular genotyping of model organisms as flawed assumptions of homogeneity can greatly affect research results. Methods for detecting CNVs in WGS data cannot be applied to WES because of the differences in read distributions. Günter Klambauer presented cn.MOPS, a novel method that uses a Bayesian approach to detect CNVs in both WGS and WES data based on read variation mixture components and Poisson distributions. Pui-Yan Kwok presented a novel method for long-range genome mapping that uses long DNA molecules labeled at specific sequence motifs and stretched in nanochannels with fluorescent imaging. This approach is suitable for whole genome CNV analysis and haplotyping, gaining high fidelity with just 10× coverage of the genome. This method also provides scaffolds for de novo DNA sequence assembly.
Yun Zheng showed that conservation, genomic context, secondary structure, and functional importance of human microRNAs (miRNAs) affect the frequency of SNPs in miRNA genes. Zheng observed a correlation between variation in miRNA SNP frequencies and geographical distributions of various populations. Although such variation may explain some observed posttranscriptional effects, understanding how genome-wide SNPs affect translational activity remains challenging. Constantin Polychronakos approached this challenge with a high-throughput method that measured actively translated transcripts with polysomal mRNA as a proxy for translational efficiency and tested for association with SNPs. Related to this, in Wijmenga's discussion of GoNL, she described research on association of celiac disease-associated SNPs with lincRNA expression levels that may help elucidate the link between noncoding SNPs and protein-coding genes.
Stephen Chanock cautioned researchers to be aware of genetic mosaicism in GWAS data, particularly in cancer samples, but also in diseases related to aging. Analysis of more than 100,000 samples, including controls, revealed a higher than expected number of structural somatic events and chromosomal abnormalities. Such findings are generally filtered out during quality control steps but may play an important role in disease progression and warrant further study, particularly in the context of predisposition to diseases strongly linked to aging, as well as the instability of genomes with age.
Envisioning Future Progress
The increasing accessibility of next-generation technology will continue to drive the development of methods and approaches to further our understanding of the human genome, disease, and populations. The availability of large datasets, many openly and freely available, provides an unprecedented opportunity to study rare and common variants at greater resolution and in more populations than ever before. Tools, methods, analyses, and resources generated from these data and next-generation technology will likely impact future research in this field, as well as the ability to apply new findings in the clinic. The HGV meeting series provides a leading forum for active discussion of new developments, with HGV2013 currently scheduled for 30 September–2 October 2013 in Seoul, South Korea.