LOVD–DASH: A comprehensive LOVD database coupled with diagnosis and an at‐risk assessment system for hemoglobinopathies

Abstract Hemoglobinopathies are the most common monogenic disorders worldwide. Substantial effort has been made to establish databases to record complete mutation spectra causing or modifying this group of diseases. We present a variant database which couples an online auxiliary diagnosis and at‐risk assessment system for hemoglobinopathies (DASH). The database was integrated into the Leiden Open Variation Database (LOVD), in which we included all reported variants focusing on a Chinese population by literature peer review‐curation and existing databases, such as HbVar and IthaGenes. In addition, comprehensive mutation data generated by high‐throughput sequencing of 2,087 hemoglobinopathy patients and 20,222 general individuals from southern China were also incorporated into the database. These sequencing data enabled us to observe disease‐causing and modifier variants responsible for hemoglobinopathies in bulk. Currently, 371 unique variants have been recorded; 265 of 371 were described as disease‐causing variants, whereas 106 were defined as modifier variants, including 34 functional variants identified by a quantitative trait association study of this high‐throughput sequencing data. Due to the availability of a comprehensive phenotype‐genotype data set, DASH has been established to automatically provide accurate suggestions on diagnosis and genetic counseling of hemoglobinopathies. LOVD‐DASH will inspire us to deal with clinical genotyping and molecular screening for other Mendelian disorders.


| INTRODUCTION
Hemoglobinopathies are the most common monogenic disorders worldwide. The major β-hemoglobinopathies, especially sickle cell disease and β-thalassemia are lethal hemoglobinopathies that have caused global health burdens due to their serious pathogenicity and high prevalence (Taher, Weatherall, & Cappellini, 2018;Weatherall., 2010). Previous efforts and programs for preventing hemoglobinopathies have proved to be effective in the Mediterranean populations, especially for β-thalassemia, showing a reduction from 1:250 live births to 1:1660 in 2009 (Cao & Kan, 2013). The birth rates of hemoglobinopathies, however, remain high. In China, approximately 12,900 newborns are estimated with hemoglobin disorders of various types each year, which in turn may cause a serious social burden (Shang et al., 2017;Xiong et al., 2010). The number of fetuses with hemoglobinopathies can be largely reduced if a robust system for clinicians is developed to master standard guidelines and to rapidly make correct clinical management choices for hemoglobinopathy patients or at-risk couples.
Thus, an accurate diagnosis of hemoglobinopathies calls for not only the proper genotyping of the disease-causing mutations in globin gene clusters, but also newly identified variants in modifier genes, such as KLF1, BCL11A, and GATA1, which are responsible for altered expression of γ-globin and also influence β-thalassemia severity (Bauer & Orkin, 2015;D. Liu et al., 2014;Thein et al., 2007). Substantial effort has been made to establish the databases to record global mutation spectra causing and modifying hemoglobinopathies (Giardine et al., 2011;Kountouris et al., 2014). HbVar, built by Giardine et al. (Hardison et al., 2002) has thus far been an authoritative hemoglobinopathy database for both researchers and clinicians. We present herein a comprehensive variant database of hemoglobinopathies focusing on a Chinese population, recording the details of all reported variants through literature peer reviewcuration and existing databases. Moreover, unpublished data from our laboratory, including all the phenotype-genotype datasets derived from high-throughput sequencing data of 2,087 hemoglobinopathy patients and 20,222 general southern Chinese individuals, were also merged into the database (Shang et al., 2017). The addition of 34 novel functional variants from these genes has been detected using this high-throughput approach. All the variants are classified according to American College of Medical Genetics and Genomics (ACMG) recommendations with the use of specific standard terminology: "pathogenic", "likely pathogenic", "uncertain significance", "likely benign", and "benign". (Richards et al., 2015; Table   S1). Details of all the variants are integrated into the Leiden Open Variation Database, which is available at http://www.genomed.zju. edu.cn/LOVD3/genes. An online auxiliary diagnosis and at-risk assessment system for inherited hemoglobinopathy (DASH) has also been established based on the following: (a) the integrity of the hemoglobinopathy mutation spectrum of a Chinese population; (b) the availability of a comprehensive phenotype-genotype data set corresponding to the 22,309 samples; and (c) the detailed information of variants according to the latest version of HbVar (Giardine et al., 2014).
Aiming to accomplish the molecular screening and clinical genotyping of hemoglobinopathies in a Chinese population, DASH consists of three main workflows. DASH not only infers the thalassemia trait based on the input of the hematologic phenotype but also recognizes the uploaded copy number variants (CNVs) and single nucleotide variants (SNVs) data then interprets the data with a specific hemoglobinopathy annotation library. Both disease-causing and modifier variants will be evaluated for a combined analysis, which will ultimately lead to an overall hemoglobinopathy diagnosis. Furthermore, the system will conduct an at-risk assessment of known disease-causing mutations and reveal critical clinical information for potential offspring. A diagnostic and assessment report will be automatically presented which could provide accurate suggestions on diagnosis and genetic counseling of hemoglobinopathies. DASH is available at www.smuhemoglobinopathy.com.
In this study, we portrayed the most comprehensive mutation spectrum of hemoglobinopathies in the Chinese population. In addition, LOVD-DASH will make a contribution in research and clinical application and provide a new method for treatment and precaution of hemoglobinopathies in Chinese patients. With the LOVD database and DASH system, we are one step closer to complete molecular screening and accurate clinical genotyping of hemoglobinopathies. were categorized into disease-causing and modifiers, and would be accepted as keywords to perform literature mining on hemoglobinopathies. Currently, 371 unique variants have been recorded; 265 of the 371 unique variants were described as disease-causing variants, while 106 were defined as modifier variants (Table 1), including 34 functional variants identified by a quantitative trait association study of the highthroughput sequencing data in Plink ( Figure S1; Table 2; Table S2). The inevitable bias caused by haplotype that the causative variant(s) among them cannot be judged simply by statistical approaches. For further research, the corresponding phenotypic information and genotypic data, including disease-causing and identified modifier variants in detail of 2,087 hemoglobinopathy cohort were displayed in http://www.genomed. zju.edu.cn/LOVD3/individuals.

| Data submission
The LOVD-China database is available for public submission. The

| DASH for hemoglobinopathy inference
The hemoglobinopathy inference module provides a judgment algorithm according to the traditional routine strategy for thalassemia carrier screening (Traeger-Synodinos et al., 2015). Hematologic and biochemical tests and subsequent molecular genetic testing are required for identification (Danjou et al., 2015). Thus, both basic information and hematologic phenotype are required as the input. Basic information includes age, gender, and native places, while hematologic phenotype includes red blood cell indices (HGB, MCH, MCV, and Hb pattern [Hb A 2 and Hb F]). A standard criterion was used for the judgment of the thalassemia trait ( Figure S2). For example, a 5-year-old girl from Guangdong province of China had the following hematologic phenotype: MCH, 24 pg; MCV, 73 fL; Hb F, 3%; and Hb A 2 , 6% (http://www. genomed.zju.edu.cn/LOVD3/individuals/00001639) will be inferred as the β-thalassemia trait in the output report. In addition, clinical genotyping is highly recommended for this individual; however, individuals with "silent" forms of thalassemia are undetectable because such individuals have normal or borderline red cell indices and/or Hb A 2 levels (Hallam et al., 2014). Moreover, it is important to note that iron deficiency alone or co-exist with the thalassemias can also cause microcytic hypochromic anemia, which could lead to misinterpretation.
If an individual is found to be iron-deficient, it is recommended to repeat the hematologic screen once the individual is iron-replete (Traeger-Synodinos et al., 2015).

| DASH for clinical genotyping
The clinical genotyping module consists of two sub-modules (hemoglobinopathy-specific annotation and genotype combinatory analysis), which will be executed sequentially. Different format of hemoglobinopathy-related SNVs and CNVs list can be recognized as inputs, then annotated by the integrated comprehensive hemoglobinopathy-specific annotation data set. After annotation, disease-causing and modifier variants will be evaluated for a combined analysis, which ultimately leads to an overall hemoglobinopathy diagnosis, especially for βthalassemia. For example, heterozygotes of the β-variant combined with α-globin gene triplicates or quadruplicates will be reported as βthalassemia intermedia and compound heterozygotes or homozygotes of variants located in or destroying the zinc finger domain of the KLF1 gene will be reported as atypical thalassemia. For perfect use of this Note: Among the 510 β 0 /β 0 samples from 22,309 sequencing data, 74 variants were shown to be significant after association analysis in Plink judged by the P-values after a Bonferroni correction. 34 of the variants were located in our candidate genes or the HMIP region. The 74 variants are available in the supplementary document. Abbreviation: SNP, single nucleotide polymorphism. a The chromosomal locations are given in GRCh37/hg19 coordinates. ZHANG ET AL.

| 2225
module, the details of the input requirement are on the right side and output interpretation can be obtained in the Q&A from the homepage (http://www.smuhemoglobinopathy.com/question/#tab=1).

| DASH for at-risk assessment
At-risk assessment module has been established for couples, the For each individual, clinical genotyping will be performed first to get an overall hemoglobinopathy diagnosis, then combinatory analysis of variants of individual and spouse will report whether or not the offspring of this couple will be at-risk for hemoglobin disorders. The possible at-risk genotype and modifier variants of offspring will be reminded to assist the clinicians for comprehensive genetic counseling.

| DISCUSSION
The LOVD-China database, which was first built by Zhejiang University as part of the International Human Variome Project, has properly managed and stored thousands of phenotype-genotype datasets from China in strict accordance with the regulation of There is an all-time difficulty to make an accurate determination of the morbidity for hemoglobinopathies mainly due to complicated environmental factors, medical conditions, and individual difference.
Besides, as mentioned before, modifier genes have a significant impact on the severity of hemoglobinopathies. Combination analysis of diseasecausing and modifier variants is considered to be the key factor underlying accurate clinical genotyping (Danjou et al., 2015). During the era of next-generation sequencing, the bulk of variants and polymorphisms have been identified in the hemoglobinopathy-relevant genes, especially in modifier genes, such as the KLF1, BCL11A, and HMIP regions (Basak & Sankaran, 2016;Orkin, 2016). More and more variants in these genes were shown to be clinically effective, while the contributions towards clinical severity have shown great ethnic-specificity. The reawakening of fetal hemoglobin based on these variants holds promise for new therapies for β-hemoglobinopathies (Bauer, Kamran, & Orkin, 2012). Our group has accumulated various kinds of samples representing different combinations of disease-causing variants and modifiers. For example, the KLF1 gene plays an important role in alleviating the clinical severity of β-thalassemia (D. Liu et al., 2014;Tepakhan et al., 2016). As the cases in Table 3 show, patients with the β 0 /β 0 genotype, which may result in thalassemia major, turns out to be thalassemia intermediate when they carry functional variants in KLF1. Also, alpha multi-copies are considered to be modifier variants. The heterozygote of β-thalassemia is asymptomatic, whereas the heterozygote of β-thalassemia, combined with alpha multi-copies, results in a thalassemia intermediate phenotype (Mettananda, Gibbons, & Higgs, 2015; Table 3 and Table S3 (Table 3 and Table S3). For example, compound heterozygotes of variants located in the zinc finger of the KLF1 gene may lead to microcytic hypochromic anemia (Huang et al., 2015;Perkins et al., 2016). At-risk assessment is highly recommended for the carriers of these variants.
Here, we used hemoglobinopathies as a model to establish the LOVD-China variant database and DASH system because hemoglobinopathies are the most common monogenic diseases worldwide and are associated with multiple mutations in disease-causing genes, as well as modifier genes. The LOVD-China with DASH system is the first automatic auxiliary diagnosis platform for hemoglobinopathies and thus provides a standard platform for screening, diagnosis, and prevention of hemoglobinopathies. Both these websites will be updated and curated with the increasing production in data by molecular screening, traditional diagnostic approaches, and by the submission of clinicians. We hope that LOVD-DASH will be a paradigm in the online auxiliary diagnosis of genetic disorders and may be an inspiration for other genetic disorders.

ACKNOWLEDGMENTS
The authors thank Dr. Giardine