Prioritization and comprehensive analysis of genes related to major depressive disorder

Abstract Background Major depressive disorder (MDD) is a serious mental health problem in modern society, which is difficult to identify and diagnose in the early stages. Despite strong evidence supporting the heritability of MDD, progresses in large‐scale and individual genetic studies remain preliminary. Methods In this study, a multi‐data source‐based prioritization (MDSP) method was proposed, and an appropriate threshold was determined for the optimization of depression‐related genes (DEPgenes). Analyses on Gene Ontology biological processes, KEGG pathway and the specific pathway crosstalk network were further proposed. Results A total of 143 DEPgenes were identified and the MDD‐specific network was constructed for the pathogenesis investigation and therapeutic methods development of MDD. Comparing with existing research strategies, the genetic optimization and analysis results were confirmed to be reliable. Finally, the pathway enrichment and crosstalk analyses revealed two unique pathway interaction modules that were significantly enriched with MDD genes. The related core pathways of neuroactive ligand‐receptor interaction and dopaminergic synapse supported the neuropathology hypothesis of MDD. And the pathways of serotonergic synapse and morphine addiction indicated the mechanism of drug addiction caused by serotonin used in the treatment. Conclusions This work provided a reference for the study of MDD, although future validation by extensive experimentation is still required.

association studies seemed to be unsuccessful with the lack of distinct understanding of heterogeneity and absence of a biological gold-standard diagnosis (Krystal & State, 2014). Nowadays, strong shreds of heritability evidence of mental diseases have been revealed (Alnaes et al., 2018;Pain et al., 2018), which attracted the studies on the generation of numerous genetic and genomic datasets in MDD studies.
During the past decade, rapid advances in high throughput technologies have helped investigators, aiming to uncover disease causal genes and their actions in complex diseases. Specifically, in psychiatric genetics, there have been numerous datasets from different platforms or sources such as association studies, including genome-wide association studies, genome-wide linkage scans, microarray gene expression, and copy number variation (Michaelson, 2017). Large-scale and individual genetic studies revealed various polymorphisms and overexpression of certain genes in patients presenting with depressive symptoms (Lacerda-Pinheiro et al., 2014;Milanesi et al., 2015). Zhang's group has found that, increased 5-HT1A expression inversely correlated with 5-HT activity via a negative feedback mechanism (Zhang et al., 2014). Moreover, HPA axis hyperactivity was reported as a trigger of MDD due to findings of GR and mineralocorticoid receptor dysfunction in depressed patients (Pariante & Lightman, 2008). However, a pervasive limitation in the existing research is the inherent heterogeneity in MDD studies, which impacts the validity of biomarker data (Young et al., 2016). Thus it is still necessary to simplify these depressionrelated candidate genes to an optimal set for the subsequent biological experiments. Moreover, the incompletion of information resources used in existing calculation and the fixed screening threshold of corresponding online tools also result in arbitrarily preferred results and lower reliability.
In this study, gene information from multiple sources (including OMIM, Phenolyzer, GeneCards and GLAD4U) were integrated and analyzed for MDD. A multi-data-source based prioritization (MDSP) was proposed and an appropriate threshold was determined for the optimization of depression-related genes (DEPgenes). Finally, the acquired genes which were significantly related to depression (DEPgenes) were verified by the receiver operating characteristic (ROC) curve and functional and pathway enrichment analysis. Our work demonstrated a practical framework for complex disease candidate gene analysis, which is of great significance for the comprehensive functional assessment of optimized pathogenic genes.

| MDD candidate genes and optimizing
process OMIM (www.omim.org), which provides vast repositories of rich clinical and genetic knowledge, was considered as a core gene database in this study. For association studies, the susceptibility genes were retrieved by searching all human genetic association studies deposited in Phenolyzer (phenolyzer. usc.edu), GeneCards (www.genecards.org) and GLAD4U (bioinfo.vanderbilt.edu/glad4u), which used as training gene categories. However, the background information of the dataset-related patients is not provided in the database. For all the genes collected, genes presented in a certain training category were assigned a score of 1 point; otherwise, 0 was assigned. Thus, a gene could be represented by a vector of three elements, with each element being 1 or 0. When a gene showed up in all the training categories, all the elements in the vector would be 1's; on the other hand, a gene had at least one element being 1. For each training category, a weight was assigned to measure the category's reliability. A combined score derived from the category-specific weight and gene score in the corresponding category was adopted to measure the correlation between a gene and the phenotype. All the candidate genes were ranked by their combined scores computed from their scores corresponding to the categories and the optimal weights. The combined scores were calculated by equation 1: where i was the training category index, N = 3, W i was the corresponding weight of category i , and Score i and was equal to 1 when a gene showed up in category i ; otherwise, Score i =0.
The combined score of a gene depends on its score from each training category and the corresponding weight value. In order to prioritize the genes collected so that the genes more likely correlated with MDD can be ranked higher in the list, a suitable weight for each training category needs to be determined. In this study, the following procedure was adopted: 1. Randomly selecting weight value between 0 and 1.0 for each training category and normalizing the weight matrix (consisted of the three weights) to have a sum of 1; 2. Calculating the combined score S for all genes by equation 1 and ranking all genes according to their combined scores; 3. Calculating ratio R: calculating the proportion k of core genes known to be related to MDD selected from OMIM in the top 3% of all candidate genes and R = k/23; 4. Reallocate values into the weight matrix and keeping the weight matrix to have a sum of 1. 5. Calculating ratio R after obtaining the new score S and ranking of all candidate genes; 6. Repeating steps 2-5 until no larger R can be found, and then the weight matrix obtained is the optimal weight matrix. (1)

| Evaluation of genetic optimizing results
The ROC curve was employed to assess the discrimination capability of the classifiers proposed in this study. ROC curves represent the performance of a classifier without taking into consideration class distribution or error overheads. And the classification success is then calculated by area under ROC curve (AUC) (Wray, Yang, Goddard, & Visscher, 2010). When the ROC curve deviated from the diagonal, i.e. the AUC value was close to 1, the verified method was evaluated as better reliability.

| Functional and pathway enrichment tests
The relation of the prioritized genes with MDD was evaluated by analyzing the Gene Ontology (GO) biological processes or biochemical pathways enriched in these genes. The Database for Annotation, Visualization, Integration and Discovery (DAVID, david-d.ncifcrf.gov) was used for GO term enrichment analysis, followed by the correction of multiple testing using the Benjamini & Hochberg (BH) method. And the biological processes (BP) term was considered as significantly enriched with a cutoff of PBH < 0.01. In addition, KEGG pathway analysis was performed by WebGestalt online tool (www.webgestalt.org) (Wang, Vasaikar, Shi, Greer, & Zhang, 2017) and PBH < 0.05 was set as the cutoff criterion.

| Pathway crosstalk
The pathway crosstalk analysis was performed to further investigate the interactions of significantly enriched pathways of optimized MDD-related genes. Two pathways are considered to crosstalk if they share a proportion of DEPgenes. Two measurements were introduced to computationally indicate the overlap of a pair of pathways: Overlap coefficient (OC) = �A

| Depression-specific network and cluster analysis by Cytoscape
To construct a depression-specific network, the DEPgenes were imported into the STRING (string-db.org). The information on gene interaction was extracted and used to form a specific network. Module cluster analysis of the depressionspecific network was performed using the MCODE plug-in in Cytoscape. Besides, to verify the nonrandomness of the obtained depression-specific network, the following verification steps were performed: 1. Random network generation: generating 1,000 random networks which had the same node and interaction numbers as the depression-specific network using Erdos-Renyi model in an igraph package of R software; 2. Calculating the average shortest path distance (SPD) and average clustering coefficient (CC) of all the random networks, respectively. 3. Statistics: Calculating the number of the random networks that have shorter SPD than MDD-specific network and the number of random network that have higher CC than MDD-specific network, which denoted as ND and NC, respectively. 4. Calculating the experience p-value: PD = ND/1,000 and PC = NC/1,000, which should reflect the significance of nonrandomness of MDD-specific network.

| Collection of MDD candidate and core genes
A total of 23 genes were collected from OMIM (Table 1), which were regarded as core genes. Besides, 14,144 genes from Phenolyzer, 5,358 genes from GeneCards and 149 genes from GLAD4U were collected regarded as MDD candidate genes. These genes were collected from multi-source, and each gene is showed up in a certain source in Figure 1a. MDSP was proposed and an appropriate threshold was determined for the optimization of MDD candidate genes. As the optimization algorithm flow chart of MDD candidate genes shown in Figure 1b, when a gene shows up in a certain training category, a score of 1 point is assigned; otherwise, 0 is assigned. Each of the four categories has a weight value, which is determined by the optimization algorithm as described in the "Material and Methods" section. The genes are ranked by their combined scores computed from scores of three training categories and their weights. Genes are ranked and prioritized by their combined scores, and further analysis is performed for the selected genes.

| Optimization and evaluation of MDD candidate genes
The combined scores of all candidate genes were calculated based on the optimal weight matrix and the candidate gene score in each source. The MDD candidate genes were ranked according to the combined scores. The gene list and the combined scores distribution of core genes and all candidate genes optimized by our process are shown in Figure 2a. Most of the core genes with higher combined scores appeared in front of the sorted list, and only several appeared in the posterior position, indicating that the distribution of the candidate genes' combined scores was in line with our expectations. From Figure 2b, it was inferred that, the score drops quickly from 1.0 to about 0.848 and then drops to about 0.604; after that, the combined scores decrease slowly. Such a distribution indicated that a relatively small number of genes have higher combined scores, while the majority of genes has moderate or small scores. With a threshold of 0.848, 65.2% of the core genes (15/23) were contained. Although with a threshold of 0.604, 95.7% of the core genes (22/23) could be contained, the number of selected candidate genes would also dramatically increase to 4,105. As the smaller the comprehensive score was, the higher the false positive rate of the prioritized gene was, 143 DEPgenes were identified with a threshold of 0.848 (Table S1).
Finally, the reliability of our method for prioritizing MDD candidate genes was compared with Phenolyzer, GeneCards and GLAD4U through ROC curve. As a result, AUC of MDSP (0.944) is the largest followed by GeneCards (0.893) and Phenolyzer (0.888), and GLAD4U had the smallest AUC value (0.490), which indicated that the results of the MDSP optimization were the best.

| GO enrichment analysis
To explore specific functional features of the 143 DEPgenes, GO enrichment analysis was performed using DAVID.
F I G U R E 2 Optimization and evaluation of MDD candidate genes. (a) Distribution of the combined scores of all candidate genes and the core genes. The percentage of each histogram bin is measured by the genes with scores falling in the bin divided by the total number of candidate genes or the number of the core genes; (b) The distribution of the combined scores of the candidate genes. The genes are ranked by their combined scores. The x-axis is the order of the candidate genes. The y-axis on the left side is the combined score of the candidate genes, and the y-axis on the right side is the number of core genes with higher combined score. (c) ROC curve of different prioritization tools. MDD: major depressive disorder; ROC: receiver operating characteristic Seventy-two biological processes (BP terms) which related to synaptic transmission, neurodevelopment and drug reaction were significantly enriched in DEPgenes (Table 2). The GO terms related to synaptic transmission included synaptic transmission, regulation of synaptic transmission, positive regulation of synaptic transmission and negative regulation of synaptic transmission. The GO terms related to nerve signal transduction included second-messenger-mediated signaling, regulation of transmission of nerve impulse, cell surface receptor linked signal transduction, G-protein coupled receptor protein signaling pathway and glutamate signaling pathway. The GO terms related to neurotransmitter, such as regulation of neurotransmitter levels, regulation of neurotransmitter transport, regulation of neurotransmitter uptake, regulation of catecholamine secretion, regulation of dopamine secretion and regulation of glutamate secretion, while that related to drug reaction (response to tropane, response to cocaine, response to amphetamine and response to histamine) and learning or memory were also significantly enriched.

| Crosstalk among significantly enriched pathways
Since abundant genes and pathways seemed to be involved in MDD, a pathway crosstalk analysis was performed to deeply investigate the relationship between the pathways. As shown in Figure 3a, 16 significantly enriched pathways were identified, including nervous system pathways, such as Dopaminergic synapse, serotonergic synapse, glutamatergic synapse, retrograde endocannabinoid signaling and GABAergic synapse. Besides, the pathways related to drug addiction (cocaine addiction, amphetamine addiction, nicotine addiction, alcoholism and morphine addiction), signal transduction (cAMP signaling pathway, taste transduction and calcium signaling pathway) were enriched. Interestingly, the environmental adaptation processes (circadian entrainment and circadian rhythm) were also involved in the DEPgenes' pathways. In Figure 3b, it was clear that the significantly enriched pathways were clustered into a module which was relevant to the pathogenesis of neurological diseases.

| MDD-specific networks
The information on gene interaction was extracted from the STRING database and used to form a specific network (Figure 4a). To test nonrandomness of the MDD-specific network, we generated 1,000 random networks with same node and edge number with MDD-specific network and compared their SPD and CC. As a result, the average SPD of these random networks was 3.4, which was significantly larger than that of the MDD-specific network with an SPD of 2.5, PD < 0.001. Meanwhile, the CC of random networks was 0.1, which was significantly smaller than that of the MDD-specific networks with a CC of 0.5 (PC < 0.001). So, the nonrandomness of the MDD-specific network could be inferred. Furthermore, two modules were identified by the modular cluster analysis of MDD-specific networks ( Figure  4b,c). KEGG pathway analysis of genes contained in Figure  4b indicated significantly enriched pathways of neuroactive ligand-receptor interaction, dopaminergic synapse and morphine addiction. For genes contained in Figure 4c, the serotonergic synapse was the most significantly enriched pathway.

| DISCUSSION
Drug therapy is still the preferred current clinical treatment for MDD. The most widely used antidepressant drugs are selective serotonin reuptake inhibitors (SSRIs), including fluoxetine, citalopram, and sertraline, which can significantly improve cognitive function of MDD patients (Jakubovski, Varigonda, Freemantle, Taylor, & Bloch, 2016). However, current antidepressant drugs used clinically bring lots of adverse reactions, such as xerostomia, constipation, drowsiness, obesity, cardiotoxicity, and drug withdrawal (Fava, Gatti, Belaise, Guidi, & Offidani, 2015;Hieronymus, Emilsson, Nilsson, & Eriksson, 2016). The lack of approaches on early identification and intervention of MDD patients limits the establishment of safe and effective individualized treatment (Duman, Aghajanian, Sanacora, & Krystal, 2016). Although numerous reports of susceptibility genes or loci to MDD have been reported previously, no disease causal genes and therapeutic target genes were confirmed (Rao et al., 2016). Thus, it is important to reduce the data noise and prioritize candidate genes from multiple datasets and then explore their functional relationships for further validation (Jia, Kao, Kuo, & Zhao, 2011).
In this study, we presented a complete process to collect large-scale genotypic data on MDD from different sources, and provided optimization and comprehensive analyses for the exploration of the pathogenesis and treatment of depression. Twenty-three DEPgenes from OMIM, 14,144 DEPgenes from Phenolyzer, 5,358 DEPgenes from GeneCards and 149 DEPgenes from GLAD4U were collected and optimized for further analyzation. MDSP was proposed and an appropriate threshold was determined for the optimization of MDD-related genes. One hundred and forty-three DEPgenes were identified and used for additional functional and pathway enrichment analyses. Most of these genes, such as PCDH9, MDD1, MDD2, CREB1 and DISC1, have been identified to be associated with MDD (Cacabelos, Torrellas, & Fernandez-Novoa, 2016;Xiao et al., 2018), and some of them (e.g. TPH1, GRIN2B and MAOA) were also this process included gene adjacency, gene fusion, phylogenetic profiles, and gene co-expression based on chip data. A comprehensive score was calculated with the weight matrix of these different methods determined by a scoring mechanism demonstrated above. Finally, the core pathways involved in MDD were shown in the module. The pathways of neuroactive ligand-receptor interaction, dopaminergic synapse and morphine addiction are presented in Figure  4b. And as shown in Figure 4c, the serotonergic synapse seemed to be higher specificity than other pathways. From these results, we inferred that the drug addiction caused by serotonin used in the treatment of MDD might relate to the mechanism of morphine addiction.
The main problems that limit the development of a reliably viable MDD biomarker are the heterogeneity of depressive disorder pathophysiology, etiology, and study designs, which may bring in conflicting data. In this study, a systems biology framework for the genetic information collection, advanced function and pathway analyses for MDD was developed. A total of 143 DEPgenes were identified and the MDD-specific network was constructed for the pathogenesis investigation and therapeutic methods development of MDD. Comparing with existing research strategies, the genetic optimization and analysis results were confirmed to be reliable. As most studies collected data from small samples sizes often consisting of fewer than 100 subjects, this study would F I G U R E 3 KEGG pathway enrichment analysis of DEPgenes. (a) Significantly enriched KEGG pathways of DEPgenes. The abscissa GeneRatio was the ratio of DEPgenes mapped to a KEGG pathway to the total number of genes in the pathway; (b) Visual crosstalk of KEGG pathways. The nodes size represented the number of DEPgenes contained in the pathway. The larger the node was, the more DEPgenes were included. The width of the edge indicated the overlapping degree of genes contained in two pathways. DEPgenes: depression-related genes contribute to improving the precision and generalizability of MDD-related genes in these three databases. However, although this computational framework applied quantity of valuable information that required future validation by extensive experimental, it still provided a reference for the study of other complex disease.