Identification of the gene signature reflecting schizophrenia’s etiology by constructing artificial intelligence‐based method of enhanced reproducibility

Abstract Aims As one of the most fundamental questions in modern science, “what causes schizophrenia (SZ)” remains a profound mystery due to the absence of objective gene markers. The reproducibility of the gene signatures identified by independent studies is found to be extremely low due to the incapability of available feature selection methods and the lack of measurement on validating signatures’ robustness. These irreproducible results have significantly limited our understanding of the etiology of SZ. Methods In this study, a new feature selection strategy was developed, and a comprehensive analysis was then conducted to ensure a reliable signature discovery. Particularly, the new strategy (a) combined multiple randomized sampling with consensus scoring and (b) assessed gene ranking consistency among different datasets, and a comprehensive analysis among nine independent studies was conducted. Results Based on a first‐ever evaluation of methods’ reproducibility that was cross‐validated by nine independent studies, the newly developed strategy was found to be superior to the traditional ones. As a result, 33 genes were consistently identified from multiple datasets by the new strategy as differentially expressed, which might facilitate our understanding of the mechanism underlying the etiology of SZ. Conclusion A new strategy capable of enhancing the reproducibility of feature selection in current SZ research was successfully constructed and validated. A group of candidate genes identified in this study should be considered as great potential for revealing the etiology of SZ.

. The searching history and the dataset inclusion for each of those seven electronic databases (GEO, SMRI, HBB, PubMed, PsycINFO, Embase and Cochrane). First, the total numbers of resulting records by the direct keyword search in the libraries of GEO, SMRI, HBB, PubMed, PsycINFO, Embase and Cochrane equaled to 4,256, 20, 1, 505, 942, 1,346 and 13, respectively. Second, the numbers of resulting records by following the five sequential criteria as described in the 2nd paragraph of Materials and Methods were provided. Third, the numbers of datasets passing five criteria for the libraries of GEO, SMRI, HBB, PubMed, PsycINFO, Embase and Cochrane equaled to 4, 2, 1, 9, 0, 8 and 0, respectively. Finally, nine independent microarray studies were collected and included in this analysis by removing the duplicates across all electronic database.

No. of Records under the Multiple Searching Criteria Applied to Seven Popular Databases GEO SMRI HBB PubMed PsycINFO Embase Cochrane
Step Step 4: Brain Locus Prefrontal Cortex 10 11 1 77 22 58 1 Step 5: Two Distinct Sample Groups 9 11 1 37 0 47 0 Step 6: Availability of Raw Dataset (CEL file) 4 2 1 9 0 8 0 Step 7: Datasets after Removing the Duplicates Nine Independent Microarray Studies (listed in Table 1) a the keyword search was conducted using the "schizophrenia AND (gene expression OR microarray OR transcriptomics)" Supplementary Table S2. SCZ gene signatures identified from nine independent studies using the newly proposed strategy.

Gene Symbol
Relevance between SCZ and the Identified DEGs Confirmed by the Comprehensive Literature Reviews ADM A promising biomarker of SCZ and up-regulated in both the LB cells and plasma of SCZ patients 1 . Its expression was significantly altered during pre-dementia stage of mild cognitive impairment 2 , and it prevented cognitive decline after chronic cerebral hypoperfusion 3 .

ALB
The expression level of serum albumin (ALB) was down-regulated in SCZ patients 4 , and it was differentially expressed in the plasma of mild cognitive impaired subjects 5 .

ANKRD1
The expression of ANKRD1 was upregulated by ZNF804A which was a candidate risk gene for SCZ and affected the cognitive functions including verbal and spatial working memory 6 .
ARMCX5 ARMCX5 interacted with GTF2IRD1 which was considered as the childhood-onset SCZ candidate gene 7 .

CCBL2
CCBL2 catalyzed the central and peripheral formation of kynurenic acid (KYNA) 8 , which was associated with the cognitive impairments in SCZ 9 .
CP Increases in ceruloplasmin (CP) may result in increased levels of copper, which ultimately proves deleterious in SCZ 10 . Novel mutation in ceruloplasmin gene causes a cognitive and movement disorder 11 .

CRB1
CRB1 was differentially expressed between SCZ patients and healthy controls 12 .

HOXB9
HOXB9 was downregulated by miRNA microarray analysis in the target analysis of SCZ associated microRNAs 14 .

HSD11B1
The HSD11B1 gene encoding proteins associated with lipid metabolic processes presented a different expression in SCZ patients compared to controls 15 .

KCNA3
KCNA3 was regulated by KCNE2 gene, the sequence variants or duplications of which was associated with SCZ 16 .
MEFV MEFV mutation may have a protective effect on cognitive impairment with unknown mechanism 18 .

MFSD1
Hypoxia has been identified as a strong risk factor in SCZ, but MFSD1 involved in cell stabilization was upregulated, which may reflect compensatory responses 19 .

NXF1
NXF1-associated gene expression and protein networks that interact with miRNAs was found in cognitive impairment and developmental cognitive disorder 20 .

PCSK6
The PCSK6 VNTR genotypes mediate the expression of psychological phenotypes that involve atypical cerebral lateralization, such that this locus apparently exerts pleiotropic effects on both handedness and psychological-cognitive phenotypes 21 .

PEX14
PEX14 was down-regulated in SZ throughout different brain regions 22 . The patient began to have symptoms of cognitive deterioration at 9 years of age, presented a mutation in the PEX14 gene 23 .

PLCG1
PLCG1 was found in high frequency in the top ranked signaling pathways, which were known to be of importance in SCZ 24 . The abnormal expression and activation of PLCG1 resulted in devastating cognitive, psychological and motor disturbances 25 .
RAPGEF1 Using qPCR, we confirmed that levels of mRNA for RAPGEF1 (P<0.05) was lower in the SCZ comparing controls 26 .

RGN
Interestingly, the concentration of regucalcin (RGN) in cerebral cortex and hippocampus is decreased with aging, and the changes in the neuronal Ca 2+ homeostasis with aging may be implicated in age-related disturbance in cognitive functions 27 .

RPL36
RPL36 is downregulated in validated target genes between SCZ and controls in the whole blood microRNA levels 28 . S100A8 S100A8 consistently changed in expression between schizophrenic patients and controls and were nominally significant in the gene-based association analysis 29 , and it contributed to postoperative cognitive dysfunction in mice undergoing tibial fracture surgery 30 .

SCN1B
A dysregulated gene in SCZ patients 31 . The homozygous SCN1B mutations indicated that SCN1B was an etiologic candidate underlying dravet syndrome which was characterized by early onset epileptic seizures followed by ataxia and cognitive decline 32 .

TAC1
The trend for TAC1 was for decreased density in SCZ patients 33 . It emerged as a top candidate gene for cognitive disorders in a unique multi-stage analysis of human genetic linkage 34,35 .

TNFSF10
TNFSF10 was consistently found differentially expressed between SCZ subjects and healthy controls 36 . An anti-TNFSF10 antibody could reduce brain amyloid-β load and activation of TNFSF10 apoptotic receptors, as well as improve cognition 37 .