The technology associated with mutation screening has developed rapidly and now massive sequencing is available using next-generation sequencing applications [Voelkerding et al., 2009]. Through this technology, much research has been conducted with the aim of discovering causative mutations in genetic diseases, and the number of identified disease-causing mutations has increased dramatically. Although human gene mutations/variations differ depending on disease phenotypes and ethnicity, there is a very limited amount of organized information concerning these characteristics that is available to the public. Clinicians and researchers typically have to search for genetic mutation information using multiple resources, such as publications and databases, to determine whether a mutation they have discovered has been characterized previously. Ideally, they need instant access to all the information regarding a mutation in their gene or locus of interest to efficiently conduct their research and to bring “genetic healthcare” to the highest standards [Cotton et al., 2007]. Up-to-date and accurate information regarding mutations is important in the diagnosis of conditions that affect human health. To assist clinicians and genetics researchers, a systematic collection of human mutations is required.
A mutation database is basically a repository wherein allelic variations are described and assigned within a specific gene locus [Knoppers and Laberge, 2000]. The major mutation databases are the National Council for Biotechnology Information (NCBI) Online Mendelian Inheritance in Man (OMIM), Human Gene Mutation Database, and the Database of the Human Genome Variation Society (HGVS). For consolidation of an individual mutation database, the locus-specific database connected these Websites [Horaitis et al., 2007]. The mutation spectra of genes differ by disease type, and are also well known to depend on ethnicity [Rosenberg et al., 2002]. The existence of population-specific mutations for single-gene disorders has been well documented, and there is also good evidence for ethnic differences in the frequencies of genetic variations. Thus, country-specific databases have recently been developed. The HGVS Website has links to several national and ethnic mutation databases, such as the Singapore Human Mutation and Polymorphism Database [Tan et al., 2006]. Also, in the United Arab Emirates, a countrywide mutation cataloging process has been reported along with a discussion of the value of nationwide mutation reporting [Al-Gazali and Ali, 2010]. In Korea, building a publicly available database to consolidate ethnicity-specific mutation data through an automated pathway that requires a minimum of effort has been a target.
The Korean Mutation Database (KMD) was created to organize, catalog, store, and distribute Korean mutations of genes related to diseases. The database is freely accessible at http://kmd.cdc.go.kr. KMD is a repository for individual researchers to electronically register newly discovered mutation information, making it immediately accessible to others. This will ensure adequate curation of human variation knowledge from a country-specific database perspective with a view to improve accuracy, reducing errors, and developing a comprehensive data set ultimately comprising all human genes. From an individual country's perspective, country-specific databases are expected to assist in the delivery of improved genetic healthcare. We aim to provide organized and accurate information to genetic disease researchers, thereby facilitating research in this field and the development of diagnostic markers and possible therapeutics.
We collected mutation data primarily via two routes. One involved searching published journal articles and the other was through data collections from the genetic diagnostic laboratories. Data from journal articles, identified by searching in PubMed and Korean local journals and confirmed by geneticists, were compiled and assembled into KMD. We used relevant articles dated from 1989 to 2009 as a reference in our database and articles were obtained from 299 different journals. We also collected unpublished data directly from the genetic diagnostic laboratories, which are affiliated with major genetics clinics in Korea. Because some of the laboratory data are being researched for publications, the submitter could postpone the release of mutation data with a “data hold.” Among 4,143 individual patients, 1,283 patients (31%) and 1,026 patients (25%) were registered from published journal articles through PubMed and Korean local journals, respectively. Among the unpublished data for 1,834 patients, data for 419 patients (10%) were openly available and data for 1,415 patients (34%) were on hold.
Mutation data were collected and sorted based on gene transcripts. To date, KMD contains 1,654 mutations and 4,143 individual patients from 245 genes (Table 1). For each gene, KMD provides basic information, the mutation spectrum in Korean, variation from the KMD project and Single Nucleotide Polymorphism Database (dbSNP), and information on the protein motif. Basic information includes RefSeq ID, gene name, protein ID, cytomap, and related disease information. OMIM numbers for the gene/diseases are also provided. They are displayed graphically and in tables. Mutation information, which includes mutation type, mutation name, location, and references, is included. For each mutation, the mutation nomenclature format approved by the HGVS was used. Mutations are used to construct a separate picture showing their information including DNA type, typing method, typing range, gender, ethnicity, heredity pattern, related disease, and published journal articles. Through the KMD project, single-nucleotide polymorphism (SNP) data against registered mutations for each gene are also available. SNPs were identified in 96 healthy Koreans by sequencing coding regions. Polymerase chain reaction primers were designed to amplify exons and flanking intronic regions. At the present time, data for 56 SNPs from 12 genes (ATP7B, HBB, MPZ, ASS1, VHL, GJB1, PARK2, PTPN11, SPAST, STK11, MECP2, and FBN1) registered in KMD are now included in the database and connected with dbSNP. In the case of genes with SNP data, these data are displayed with graphics and a table when a user enters information to search for a mutation. In the future, we plan on making the whole-exome sequence data available. Protein information in KMD includes protein domains and motifs, and KMD is connected to the UniProt and InterPro databases.
Table 1. Database Contents
Number of entries
Composition and Availability of Database
The composition of a KMD Web page is shown in Figure 1. It consists of a user page for search and registration and an administrator page. A user can search registered mutation information without a login. For registration of new mutation and individual patient data, a user has to have a login with certification. After logging in, the user can delete or modify his or her registered data. An administrator also has to log in with certification. An administrator has authority over the registration of researchers and institutes and can approve new mutation and individual patient data. There are three different types of status for users, and each status has a different grade of accessibility to the database (Table 2).
Table 2. User Access Levels in KMD
User for search
User for registration
Access to registered mutation data
Access to statistics of data
Registration and login required
Insert and edit mutation data
The homepage of KMD is shown in Figure 2. Genes and diseases are ordered according to the first letter of their name. All genes and diseases under each letter of the alphabet can be easily shown by clicking on that letter. It is also possible to search genes and diseases by OMIM ID, mutation name, and reference journal.
Mutation Searches at KMD
When searching for a gene or disease by name, the user can, initially, find the transcription information for that gene. The exact gene name and related disease information, including RefSeq ID, are shown in the top table and connected with the NCBI Website. Mutation data are displayed with graphics and tables, and different colors are used according to mutation type for easy comparisons. Figure 3 shows a gene symbol search result when finding mutations in ATP7B related to Wilson's disease. ATP7B has two transcripts and the user can choose to select the transcript of interest. In the graphics, mutations are indicated according to their location in the coding region. Protein domains and motifs are also indicated according to their location. When clicking on each location, more detailed information becomes available. In the case of mutations, the mutation ID is provided at each position. The length of the bar indicates the number of registered patients. In the mutation table, related mutation information can be seen such as mutation type, mutation name, location, and references. Location and type of mutation are also shown with the transcript sequences. Individual patient information can also be seen. By clicking on the individual ID, detailed information, such as hereditary patterns and typing methods, can be seen (Fig. 3).
Mutation Registering at KMD
To register a new mutation and individual information, a user needs to log in. The registration of mutation data is divided into two procedures: first, the registration of mutation data is required, and then the registration of individual phenotype information is necessary, to reduce the burden and avoid errors due to repeated data insertion. First, the user needs to search for the gene or disease related to the information he or she is about to submit. If the mutation has not been reported previously, the user can start registering the mutation data from the transcript information page by clicking on the “add mutation” tab at the mutation table. The mutation name needs to follow the HGVS format to be registered. Other annotation information, such as exon and intron number, mutation position, and mutation type, will be extracted automatically using the database annotation or HGVS format itself (Fig. 4). This should reduce the number of items that need to be entered, which will facilitate mutation data registration and reduce error rates. The registered mutation will be displayed publicly after permission has been granted by an administrator. By clicking on mutation ID in the mutation table, the user can add new individual information for the eight mandatory items: DNA type, typing method, typing range, gender, ethnicity, heredity pattern, disease, and published (yes/no). The user may also add information such as onset age, a phenotype detail on this page (Fig. 4).
Guidelines for registering mutation data and HGVS examples are presented on the help page in KMD.
Characteristics of Mutation Data in KMD
Statistical information is available about the Korean mutation data registered in KMD. Figure 5 shows the distribution according to mutation type and location. Most are single-base substitutions in coding regions (Fig. 5). There are also some deletions, insertions, duplications, and simple sequence repeats. Among the 1,654 mutations registered in KMD, 1,127 mutations are substitutions and the rest are deletions (n = 285), insertions (n = 70), duplications (n = 27), simple sequence repeats (n = 129), and others (n = 16). By location, 1,397 mutations are located in coding regions and the rest are located in the 5′-untranslated region (UTR) (n = 21), 3′-UTR (n = 54), splicing region (n = 130), and others (n = 51).
Table 3 contains information about the most common genes and diseases registered in KMD. The gene with the highest number of mutations in KMD is the NF1 gene, with a total of 77 mutations, followed by the FBN1 gene with 53, the PAH gene with 51, the ATP7B gene with 50, and the DMPK gene with 35 mutations. In our database, neurofibromatosis, type 1 disease (NF1; MIM #162200) has the largest number of mutations. NF1 has been reported to be one of the common autosomal dominant disorders affecting all ethnic groups [Friedman, 1999]. We can confirm that mutations in the NF1 gene differ among patients and only a few mutations were common in the Korean population. Other common genetic diseases are Wilson's disease (MIM #277900), Marfan syndrome (MIM #154700), phenylketonuria (MIM #261600), and myotonic dystrophy (MIM #160900).
Table 3. Information About Korean Mutation Data Registered in KMD
Number of mutation
Top five genes
Top five diseases
Neurofibromatosis type 1
We have developed a systematic online database for mutations of genes related to diseases discovered in Koreans. It provides updated information about Korean mutations and reference sequence data. It will be a useful resource in the era of functional genomics. We are willing to provide the software information for any institution or organization on the agreement that the information would be used noncommercially.
The most important aspect of the mutation database is the continuous curation of registered data. KMD has excellent standards with regard to data security in collection and curation. As this database has been constructed by The Korea National Institution of Health, we have received funding for the project from the Korean government and can expect to receive it continuously. We also have collaborators who conduct diagnostic and genetics research for the collection of country-specific mutation data. We plan to collect and add these to our system continuously.
KMD provides a convenient approach to mutation searches and registration. It contains guidelines for accurate mutation registration and variant information. It also provides reference information about mutations through comparison with KMD SNPs. KMD is connected to other databases such as the database of Korean rare disease information (http://helpline.cdc.go.kr), dbSNP, OMIM, Uniprot, and InterPro. We plan to further expand the database system and content. Additionally, we will be working with the Human Variome Project [Cotton et al., 2008] as an international collaboration. KMD will prove to be valuable with regard to improving research in genetic diseases, the development of diagnostics, and therapeutic optimization.
We thank Hwanseok Rhee, Gu-Hwan Kim, Seung-Tae Lee, and Chul-Ho Lee for data collection, and Dong Hoon Oh, You-Bok Cho, and Ki-Seok Yoon for DB construction.