SEARCH

SEARCH BY CITATION

Keywords:

  • copy number variation;
  • CNV;
  • text mining;
  • disease;
  • CNV database

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. The CNVD Database
  5. Conclusions and Future Perspectives
  6. Acknowledgements
  7. References

Copy number variation (CNV) is a kind of chromosomal structural reorganization that has been detected, in this decade, mainly by high-throughput biological technology. Researchers have found that CNVs are ubiquitous in many species and accumulating evidence indicates that CNVs are closely related with complex diseases. The investigation of chromosomal structural alterations has begun to reveal some important clues to the pathologic causes of diseases and to the disease process. However, many of the published studies have focused on a single disease and, so far, the experimental results have not been systematically collected or organized. Manual text mining from 6301 published papers was used to build the Copy Number Variation in Disease database (CNVD). CNVD contains CNV information for 792 diseases in 22 species from diverse types of experiments, thus, ensuring high confidence and comprehensive representation of the relationship between the CNVs and the diseases. In addition, multiple query modes and visualized results are provided in the CNVD database. With its user-friendly interface and the integrated CNV information for different diseases, CNVD will offer a truly comprehensive platform for disease research based on chromosomal structural variations. The CNVD interface is accessible at http://bioinfo.hrbmu.edu.cn/CNVD. © 2012 Wiley Periodicals, Inc.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. The CNVD Database
  5. Conclusions and Future Perspectives
  6. Acknowledgements
  7. References

Copy number variations (CNVs) are segments of DNA that range from about one kilobase (1000 nucleotide bases) to several megabases in size [Feuk et al., 2006; Goidts et al., 2006]. CNVs represent imbalances between two individuals of a species [Freeman et al., 2006]. Genome variants such as deletions, duplications, triplications, translocations, or insertions can all result in CNVs [Stankiewicz and Lupski, 2010; Wain et al., 2009]. Compared with the genomic variations that were found previously (for example, SNPs and indels), CNVs are longer. In recent years, CNVs have been observed using a variety of biological techniques such as array-based comparative genomic hybridization (aCGH), deep sequencing, single nucleotide polymorphism arrays (SNP-A), quantitative real-time PCR (qPCR), and fluorescence in situ hybridization (FISH).

Imbalances in the genome caused by CNVs are related to the occurrence and progression of diseases. Studies have shown that CNVs are likely to influence gene expression via gene dosage effects and structural variations which sequentially affect individual phenotypes, finally causing diseases [Rodriguez-Revenga et al., 2007]. Linzmeier et al. [Linzmeier and Ganz, 2005] reported CNVs in the alpha-defensin genes DEFA1, DEFA3, and in the beta-defensin genes DEFB4, and DEFB103. They found that the variations could affect the expressions of these genes and were related to the prevention of infection in the human body. Legartova et al. [Legartova et al., 2009] found that the Mcl-1S variant in the human genomic region of 1q21 was highly expressed in multiple myeloma (MM) MOLP-8 cells and reported that this could cause the oncogenesis of MM. They also suggested that amplification of the 1q21 region was an important diagnostic marker of MM. De Wilde et al. [de Wilde et al., 2011] discovered that copy number alterations and hypermethylation of LKB1 contributed to pancreatic acinar cell carcinoma in humans. Clearly, more and more studies are showing that CNVs are closely related with complex diseases. The study of CNVs and particularly the genes involved in CNVs regions, will offer important clues to the pathologic causes of diseases and to the disease process.

Increasing amounts of CNV data, almost covering the whole human genome, have been produced by high-throughput biological technologies in recent years. These data provide an unprecedented opportunity to explore the genetic variance in human diseases; however, the storage and analysis of the CNV data associated with the many diseases is a challenge. After the release in 2006 of the first comprehensive map of CNVs in the human genome [Redon et al., 2006], many online CNV databases have been developed. However, in spite of the important roles of CNVs in diseases such as the cancers, there is no database that comprehensively integrates the CNVs that have been experimentally detected in multiple diseases. For example, the database of genomic variants (DGV, http://projects.tcag.ca/variation/) [Iafrate et al., 2004; Zhang et al., 2006] represents only structural variations that have been identified in healthy control samples. Obviously, it cannot be used to study the role of CNVs in diseases. CNVVdb (Copy number variations across vertebrate genomes, http://CNVVdb.genomics.sinica.edu.tw/) [Chen et al., 2009], a database of copy number variations across vertebrate genomes, uses the pair-wise alignment of sequences based on the Blastz algorithm to identify the putative CNVs in 16 vertebrate genomes. However, CNVVdb does not contain information about the relationships between CNVs and diseases or gene information and, importantly, the reliability of prediction results obtained by sequence search algorithms is still debatable. CaSNP (http://cistrome.dfci.harvard.edu/CaSNP/) [Cao et al., 2011], a database for interrogating copy number alterations of cancer, collected cancer-related CNV data from SNP-A using DNA-Chip Analyzer (dChip) software; however, no other diseases or any other experiment types have been included.

To provide a comprehensive and reliable web interface that linked CNVs to diseases, we developed the copy number variation in disease database (CNVD) based on text mining. Specifically, we used the EndNote software to download from PubMed original CNV-related papers published from 2006 to 2012. We manually extracted and collected CNV information that included the associated diseases, genes, chromosome segments, and the descriptions of CNVs, with the aim of building a database that contains the most comprehensive and reliable data that are currently available.

The CNVD Database

  1. Top of page
  2. Abstract
  3. Introduction
  4. The CNVD Database
  5. Conclusions and Future Perspectives
  6. Acknowledgements
  7. References

The Composition of CNVD

A total of 6301 original papers published from January 2006 to March 1st, 2012 were obtained from the PubMed database using the key words ‘copy number’, ‘CNV(s)’, ‘CNA(s)’ ‘CNP(s)’, ‘genomic imbalance(s)’, ‘genomic rearrangement(s)’, ‘microdeletion(s)’ and/or ‘microduplication(s)’ to search in titles or abstracts. The number of papers per year that were obtained was shown in detail in Table 1. Information about CNVs that were associated with diseases, genes, chromosome segments, and description of CNVs were manually extracted. To ensure the reliability of the data, we mainly chose to include data for CNVs that resulted from experiments including aCGH, SNP-A, FISH, and qPCR.

Table 1. Quantity of literatures from 2006 to March, 2012
Year2006200720082009201020112012Total
Number of literatures78788410811060115310303066301

The CNVD database covers 792 diseases (Fig. 1) in 22 species, Homo sapiens, Mouse, Dog, Arabidopsis thaliana, Bacillus thuringiensis, Bovine, Caenorhabditis elegans, Chimp, Drosophila, Gorilla, Grapevine, Great Apes, Maize, Pig, Plasmodium falciparum, Primates, Saccharomyces cerevisiae, Sheep, Chicken, Salmonella paratyphi C, Pseudomonas, and Mycobacterium ulcerans, and significantly increases the scope of the information that is available about CNVs in diseases. More than 28.41% (225) of the CNVD disease are found in various kinds of neoplasms, implying that CNVs potentially influence very specific processes in the induction of cancers. Many of the previously published studies also confirm this point [Beroukhim et al., 2010; Zhao et al., 2004].

thumbnail image

Figure 1. Various kinds of diseases included in the CNVD. DEMP: diseases of the ear and mastoid process (7, 0.88%); CIPD: certain infectious and parasitic diseases (7, 0.88%); CMDCA: congenital malformations, deformations and chromosomal abnormalities (7, 1.27%); DRS: diseases of the respiratory system (45, 5.68%); PCP: pregnancy, childbirth and the puerperium (15, 1.89%); DEA: diseases of the eye and adnexa (20, 2.53%); DDS: diseases of the digestive system (14, 1.77%); DBBFO: diseases of the blood and blood-forming organs (23, 2.90%); DSMS: diseases of the skin and musculoskeletal system (26, 3.28%); ENMD: endocrine, nutritional and metabolic diseases (17, 2.15%); MBD: mental and behavioural disorders (26, 3.28%); DIIM: disorders involving the immune mechanism (26, 3.28%); DMSCT: diseases of the musculoskeletal system and connective tissue (40, 5.05%); DCS: diseases of the circulatory system (39, 4.92%); DGS: diseases of the genitourinary system (56, 7.07%); DNS: diseases of the nervous system (80, 10.10%); Neoplasms (225, 28.41%); others (75, 9.47%).

Download figure to PowerPoint

The information mined from the literature was extended to include “details of the record”, “result map” and “information of gene”. In the “details of the record” section, chromosome regions, genome browsers and disease categories can be obtained. In the “result map” section the chromosome regions were divided according to NCBI Map Viewer Build 36.3 (http://www.ncbi.nlm.nih.gov/projects/mapview/). In the “information of gene” section, gene ID, gene name, and location from NCBI Gene (NCBI's database for gene-specific information, ftp://ftp.ncbi.nih.gov/gene) are available along with the Gene Ontology (GO) [Ashburner et al., 2000] description downloaded from the Ensembl database (http://www.ensembl.org/) [Hubbard et al., 2002].

Using the CNVD Database

CNVD provides multiple modes to query disease-related CNVs. The query modes include gene search, disease search, chromosome search, map view and advanced search that allow users to access the interface most relevant to their needs. On the help page, the parameters used for the query interface are explained and the query results are described. This information helps the user to easily query the database to obtain the best results.

For example, to find which CNV regions have copy number that are significantly different between patients with follicular lymphoma and healthy people, and to find the genes that are involved, the user could use the “Disease Search” option on the top page of the website which would lead them to the interface shown in Figure 2-a. By selecting “Homo sapiens” as the species and entering “follicular lymphoma” in the “Other disease” textbox, the results shown in Figure 2-b were returned. The results are in the form of a table that includes information on the species, chromosome number, start and end positions, the genes involved in CNVs region, the associated disease and a clickable link to the corresponding PubMed entry. All the information in the results tables was extracted from the literature by manual text mining. The ‘view’ button in the first column of the table is a link that takes the user to the corresponding record that contains detailed information about the selected CNV. The records include information about the platform/method used to detect this CNV, the number of case and control samples, the description of this CNV (amplification or deletion), the diseases class, and provide graphic links to UCSC and Ensembl, as well as links to NCBI Gene and GO (Fig. 2-d). Clicking on the ‘Result Map’ tab at the top right hand corner of the result page (see Fig. 2-c) brings up a page that shows, in this example, the distribution of follicular lymphoma-related CNVs on the human chromosomes. By clicking on a gene name on the results page, for example “C8orf17”, a page showing annotation details, such as gene ID, gene symbol, GC content, gene name, gene type, location, PDB ID, Pfam ID, OMIM ID, GO term, GO accession number, and PubMed ID, is displayed (Fig. 2-e) All the IDs displayed on this page are clickable.

thumbnail image

Figure 2. An example for disease search. a) The interface for entering diseases you want to query. b) The result of disease search. c) Visualization of all query results. d) The detailed information of each record. e) The detailed information of gene.

Download figure to PowerPoint

On the “Disease Search” query page, the top eight diseases that occur in most papers in CNVD, breast cancer, prostate cancer, lung cancer, schizophrenia, gastric cancer, ovarian cancer, epilepsy and autism, can be conveniently accessed using the quick query option. These eight diseases are displayed with checkboxes at the top of the page and can selected by checking one or more of the boxes, eliminating the need to enter their names each time. If the results of the query do not satisfy the user, then fuzzy matching searches or inputting shorter keywords are likely to produce better results. In addition, the “Download” option allows users to download a list of all disease names contained in the CNVD database. This list can be used to help the user set up a useful query. Lists of species names, gene symbols, segments, paper lists and all CVND records can also be downloaded using the “Download” option. The query results can be downloaded as Text or Excel files using the “send to” drop down menu at the top right hand corner of the results page. Users can either select the records they want to download or download all of them; the default setting is to download all the results without the need to make any selection. The option to upload data is also available. The other search options that are available from the top page (Fig. 3) can all be used in ways that are similar to the “Disease Search” example described above. Users can apply these search interfaces flexibly according to their requirements.

thumbnail image

Figure 3. The CNVD search modes.

Download figure to PowerPoint

The Role of the CNVD Database in Diseases Research

The data in the CNVD database has been carefully mined from most of the publicly available literature related to CNVs that has been published in recent years. CNVD contains information on more than 500 diseases. Most of results that are documented in CNVD were derived from CNV detection experiments that are generally considered reliable. It is expected that the high coverage and reliability that CNVD provides will open up new opportunities for diseases research from the point of view of genetic variation.

Specifically, new links between CNV regions and diseases may be discovered by querying the database. These CNVs could be considered to be genetic markers for the disease and a whole genome association analysis of the CNV regions may help locate potential candidate genes for the disease. Moreover, the extensive experimental data in CNVD could contribute to clinical diagnosis and prognosis. In addition, the CNVs that are found to be related to various kinds of diseases could be used to study similarities between diseases and this may help identify common targets for these diseases. Further, by including expression data, the CNVD database will make the comprehensive analysis of the relationship between CNVs and expression available.

Conclusions and Future Perspectives

  1. Top of page
  2. Abstract
  3. Introduction
  4. The CNVD Database
  5. Conclusions and Future Perspectives
  6. Acknowledgements
  7. References

The CNVD is a practical online database for disease studies based on CNVs. As far as possible, all recently published, publicly available information was collected from the CNV-related literature. The authenticity and reliability of data are ensured because the results that were manually mined from the literature were mainly from experiments. After taking into account the power of algorithms in identifying CNVs, the predicted results were included as well. Because no other database for the collection and systemization of CNV data in a wide variety of diseases is available, the CNVD database was built with the aim of providing a wider platform for studies into the role of CNVs in many diseases. The CNVD has a user-friendly interface that provides various ways of querying the database and visualizing the results. Therefore, with the integrated reliable CNV data for 792 diseases that is stored in the CNVD database, it can be expected that CNVD will become a convenient and efficient resource for disease researchers.

As described in this paper, CNVD has many advantages over other CNV databases; however, it can still be improved further. The CNVD database will be updated at regular intervals to include fresh data. In addition to the query features that currently exist, we plan to add some analysis function to CNVD. For example, an alignment tool that allows user-submitted sequences to be searched against the sequences in the CNVD could help predict the risk of diseases or identify new candidate targets.

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. The CNVD Database
  5. Conclusions and Future Perspectives
  6. Acknowledgements
  7. References

This work was supported by the Natural Science Foundation of Heilongjiang province (Grant No.D201116), the Postdoctoral science-research developmental foundation of Heilongjiang province (Grant No.LBH-Q11044) and the Innovation Research Fund for Graduate Students of Heilongjiang province (YJSCX2012).

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. The CNVD Database
  5. Conclusions and Future Perspectives
  6. Acknowledgements
  7. References
  • Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT and others. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics 25:259.
  • Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M and others. 2010. The landscape of somatic copy-number alteration across human cancers. Nature 463:899905.
  • Cao Q, Zhou M, Wang X, Meyer CA, Zhang Y, Chen Z, Li C, Liu XS. 2011. CaSNP: a database for interrogating copy number alterations of cancer genome from SNP array data. Nucleic acids research 39:D968D974.
  • Chen FC, Chen YZ, Chuang TJ. 2009. CNVVdb: a database of copy number variations across vertebrate genomes. Bioinformatics 25:141921.
  • de Wilde RF, Ottenhof NA, Jansen M, Morsink FH, de Leng WW, Offerhaus GJ, Brosens LA. 2011. Analysis of LKB1 mutations and other molecular alterations in pancreatic acinar cell carcinoma. Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc 24:122936.
  • Feuk L, Carson AR, Scherer SW. 2006. Structural variation in the human genome. Nature reviews. Genetics 7:8597.
  • Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, Aburatani H, Jones KW, Tyler-Smith C, Hurles ME and others. 2006. Copy number variation: new insights in genome diversity. Genome research 16:94961.
  • Goidts V, Cooper DN, Armengol L, Schempp W, Conroy J, Estivill X, Nowak N, Hameister H, Kehrer-Sawatzki H. 2006. Complex patterns of copy number variation at sites of segmental duplications: an important category of structural variation in the human genome. Human genetics 120:27084.
  • Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T and others. 2002. The Ensembl genome database project. Nucleic acids research 30:3841.
  • Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. 2004. Detection of large-scale variation in the human genome. Nature genetics 36:94951.
  • Legartova S, Krejci J, Harnicarova A, Hajek R, Kozubek S, Bartova E. 2009. Nuclear topography of the 1q21 genomic region and Mcl-1 protein levels associated with pathophysiology of multiple myeloma. Neoplasma 56:40413.
  • Linzmeier RM, Ganz T. 2005. Human defensin gene copy number polymorphisms: comprehensive analysis of independent variation in alpha- and beta-defensin regions at 8p22-p23. Genomics 86:42330.
  • Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W and others. 2006. Global variation in copy number in the human genome. Nature 444:44454.
  • Rodriguez-Revenga L, Mila M, Rosenberg C, Lamb A, Lee C. 2007. Structural variation in the human genome: the impact of copy number variants on clinical diagnosis. Genetics in medicine : official journal of the American College of Medical Genetics 9:6006.
  • Stankiewicz P, Lupski JR. 2010. Structural variation in the human genome and its role in disease. Annual review of medicine 61:43755.
  • Wain LV, Armour JA, Tobin MD. 2009. Genomic copy number variation, human health, and disease. Lancet 374:34050.
  • Zhang J, Feuk L, Duggan GE, Khaja R, Scherer SW. 2006. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenetic and genome research 115:20514.
  • Zhao X, Li C, Paez JG, Chin K, Janne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C and others. 2004. An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer research 64:306071.