IRF1 as a potential biomarker in Mycobacterium tuberculosis infection

Abstract Pulmonary tuberculosis (PTB) is a major global public health problem. The purpose of this study was to find biomarkers that can be used to diagnose tuberculosis. We used four NCBI GEO data sets to conduct analysis. Among the four data sets, GSE139825 is lung tissue microarray, and GSE83456, GSE19491 and GSE50834 are blood microarray. The differential genes of GSE139825 and GSE83456 were 68 and 226, and intersection genes were 11. Gene ontology (GO) analyses of 11 intersection genes revealed that the changes were mostly enriched in regulation of leucocyte cell‐cell adhesion and regulation of T‐cell activation. Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis of DEGs revealed that the host response in TB strongly involves cytokine‐cytokine receptor interactions and folate biosynthesis. In order to further narrow the range of biomarkers, we used protein‐protein interaction to establish a hub gene network of two data sets and a network of 11 candidate genes. Eventually, IRF1 was selected as a biomarker. As validation, IRF1 levels were shown to be up‐regulated in patients with TB relative to healthy controls in data sets GSE19491 and GSE50834. Additionally, IRF1 levels were measured in the new patient samples using ELISA. IRF1 was seen to be significantly up‐regulated in patients with TB compared with healthy controls with an AUC of 0.801. These results collectively indicate that IRF1 could serve as a new biomarker for the diagnosis of pulmonary tuberculosis.

enrichment analysis of DEGs revealed that the host response in TB strongly involves cytokine-cytokine receptor interactions and folate biosynthesis. In order to further narrow the range of biomarkers, we used protein-protein interaction to establish a hub gene network of two data sets and a network of 11 candidate genes. Eventually, IRF1 was selected as a biomarker. As validation, IRF1 levels were shown to be up-regulated in patients with TB relative to healthy controls in data sets GSE19491 and GSE50834.
Additionally, IRF1 levels were measured in the new patient samples using ELISA. IRF1 was seen to be significantly up-regulated in patients with TB compared with healthy controls with an AUC of 0.801. These results collectively indicate that IRF1 could serve as a new biomarker for the diagnosis of pulmonary tuberculosis.

K E Y W O R D S
biomarker, differentially expressed gene (DEG), network analysis, protein, protein interaction, tuberculosis (TB) Liwei Wu, Qiliang Chen and Zilu Wen are co-first authors also estimated that about 1/4 of the world's people are infected with Mycobacterium tuberculosis, which is at risk of developing tuberculosis. 2 Mycobacterium tuberculosis is a kind of intracellular parasitic bacteria. Participation in immunity against Mycobacterium tuberculosis is mainly cellular immunity, including macrophages, T cells and NK cells. 3 However, MTB can inhibit oxidative stress, apoptosis and autophagy, inhibit the synthesis of histocompatibility complex molecules and thus affect antigen presentation. 4 It is precisely because of these mechanisms that inhibit the specific immunity and natural killing of macrophages, thus helping Mycobacterium tuberculosis to escape the immune killing of the body. 5,6 Therefore, a comprehensive view of the immune response mechanism of Mycobacterium tuberculosis infection has essential theoretical significance for clinical diagnosis and the study of novel tuberculosis vaccine and immunotherapy. The immune response in lung tissue can comprehensively reflect the lung immune response to Mycobacterium tuberculosis, so we can screen related molecular markers as sensitive indicators of tuberculosis infection. In this study, bioinformatics methods were used to compare and analyse the original genetic data of patients with pulmonary tuberculosis and healthy people, hoping to find the genes that may play an important role in the pathogenesis of tuberculosis, reveal the molecular immune mechanism of tuberculosis and discover the potential biomarkers of TB.

| Acquisition of RNA data
The data of human lung tissue samples and blood samples infected with TB were extracted from the GEO database. 7  HIV-infected samples and 21 HIV/TB co-infected samples. All sample data were downloaded for further analysis. All sample data were from public databases, so informed consent and ethical approval were not required.

| Identification of DEGs
The original expression matrix had been normalized by the uploader, so there was no need for normalized operation. The differentially expressed genes (DEG) were screened out by limma package. 9 The t test method was used to calculate P-value of genes, and the adjusted P-value was calculated by Benjamini and Hochberg's method.
The following selection criteria were used to screen out differentially expressed genes: |log FC|> 1 between two sample groups; and adjusted P-value <.05. Using Venn diagram, the DEGs of GSE13 9825 and GSE83456 were intersected.

| Protein-protein interaction network construction
We used the Search Tool for the Retrieval of Interacting Genes (STRING) online database (http://strin g-db.org; version 11.0) 13 to build the protein-protein interaction (PPI) network. The mechanism of disease occurrence and development can be revealed by the functional interaction between proteins. In this study, we used the STRING database to build PPI networks of DEGs, and interactions with scores >0.4 were considered to be statistically significant. Cytoscape (version 3.8.0) is an open-source software platform for visualizing complex networks and integrating these with any type of attribute data. 14 CytoHubba 15 is a Cytoscape plugin which was used to discover hub genes of PPI network. We used the maximum correlation criterion (MCC) in the CytoHubba plugin to screen 10 genes. The

| Enzyme-linked immunosorbent assay (ELISA)
The gene selected from the intersection between three gene lists was used as a biomarker for experimental verification. Therefore, we used 20 cases of pulmonary tuberculosis blood samples and 20 cases healthy control samples for verification. The patients were from the Shanghai Public Health Clinical Center. The candidate biomarkers were validated using the Human IRF1 kits (USCN Life Sciences; Wuhan, China). The experiment was carried out according to the manufacturer's instructions.

| Verification of potential biomarker in verification data sets
Two data sets, GSE19491 and GSE50834, were used to evaluate the expression level of potential biomarker. We used idmap R package to find out potential biomarker expression level in two gene matrixes. The ggplot2 R package was used to draw box plot to show the different expression level in different groups. In GSE19491, there were three groups: HC groups, PTB groups and Latent TB groups. In GSE50834, there were two groups: HIV-infected groups and HIV/ TB co-infected groups.

| Gene Set Enrichment Analysis (GSEA) of two gene sets
Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (eg, phenotypes). We used the GSEA software to conduct the GSEA. P-value <0.05 was statistically significant.

| Statistical analysis
All statistical analysis was carried out by R (version: 3.6.2) software.
Student t test was used to test the difference between the two groups.
The receiver operating characteristic (ROC) curve was used to evaluate the reliability of candidate indicators as diagnostic biomarkers. If the area under curve (AUC) was greater than 0.8, it means that the reliability of the candidate biomarkers was higher. All P values were bilateral, and P < .05 was considered to be statistically significant.

| Identification of DEGs in TB
Sixty eight DEGs were screened from the lung tissue gene expression matrix, among which 65 genes were up-regulated and 3 genes  GSE13 9825 and GSE83456 were as follows: IL7R,   TNFAIP6, KLHDC8B, IRF1, HELZ2, ADM, CALHM6, GCH1, CD274, CCR7 and PSTPIP2. The volcano plot and Venn diagram were plotted based on the analysis results of the gene expression matrix ( Figure 1). The GO and KEGG enrichment analysis of DEGs in GSE13 9825 and GSE83456 were in Table 1 and Table 2.

| PPI network construction and screening candidate biomarkers
We used STRING online database (version: 11.0) to analyse 11 intersection DEGs and obtained the interaction data of 11 intersection DEGs ( Figure 3A). Then, 225 DEGs of GSE83456 and 68 DEGs of GSE13 9825 were also analysed. The data were exported as nodes for further analysis. Cytoscape software was used to visualize the obtained data. The cytoHubba plugin was used to analyse hub genes with MCC, and genes with the top 10 scores were identified as hub genes ( Figure 3B, C). We took intersection between the hub genes of GSE83456 and GSE13 9825 and intersection DEGs. The IRF1 was the only one intersection gene ( Figure 3D).

| Verification of potential biomarker expression by ELISA
The IRF1 was selected as a candidate biomarker for experimental verification. The experimental results showed that IRF1 was upregulated in tuberculosis group, which was consistent with the results of our bioinformatics analysis ( Figure 4A, B). The ROC curve showed an AUC of 0.801 ( Figure 4C), demonstrating that IRF1 could be a diagnostic biomarker of PTB.

| Verification of potential biomarker expression in verification data sets
The IRF1 expression level in GSE19491 was as follows: IRF1 was up-regulated in TB group from HC group ( Figure 5A); and IRF1 was up-regulated in TB group from Latent TB group ( Figure 5B). The IRF1 expression level in GSE50834 was up-regulated in HIV/TB-coinfected group from HIV-infected group ( Figure 5C). The above results were consistent with the results of our bioinformatics analysis.

| Gene Set Enrichment Analysis (GSEA) of two gene sets
According to the amount of IRF1 expression, we divided the samples of the two data sets into two groups: IRF1 low expression group and IRF1 high expression group, and then analysed the two data sets by GSEA ( Figure 6)

| D ISCUSS I ON
Tuberculosis has gradually become one of the diseases that threaten human health all over the world. It is caused by Mycobacterium tuberculosis parasite in macrophages. At present, a large number of data are based on the peripheral blood of patients with TB, but there are few data based on the lung tissue of patients with TB. [16][17][18] Therefore, understanding the gene expression in lung tissue after tuberculosis infection can reveal the pathogenesis of tuberculosis and develop targeted treatment strategies. In this study, we used data form lung tissue microarray and blood microarray to find out the candidate biomarkers in PTB. After a series of analysis, the candidate biomarker was determined to be IRF1, and verified by ELISA and ROC curves. IRF1 may be used as a biomarker for the diagnosis of tuberculosis.
In order to find out the host response in the process of tuberculosis infection, we used Metascape online database to analyse biological process of 68 DEGs. Among these BP annotations, cellular response to lipopolysaccharide, positive regulation of leucocyte migration and defence response to other organism were considered to F I G U R E 6 Gene Set Enrichment Analysis (GSEA) results of GSE83456 and GSE13 9825 IRF1 low expression group and IRF1 high expression group. The immunity of low group was different from that of high group play a key role in tuberculosis immunity. Tuberculosis is characterized by the parasitism of Mycobacterium tuberculosis in macrophages and the use of macrophages for proliferation. 19 At the same time, Mycobacterium tuberculosis will release endotoxin to induce macrophages to release cytokines and attract leucocyte, mainly neutrophils and T cells. 20 In this study, we made a comprehensive bioinformatics analysis using the gene expression data of patients with TB. Although this study strongly predicted the potential genes and mechanisms involved in TB, it was not clear whether gene expression data based on public databases are reliable. Our approach improved our understanding of potential biomarkers for TB diagnose. However, our research also had some limitations. First of all, in order to fully clarify the molecular mechanism of the occurrence and development of TB, more gene chip samples of patients with TB were needed. Secondly, many biomarkers related to TB were still not characterized, and more experimental verification and bioinformatics analysis were needed to study the genes involved in tuberculosis. In the future, a prospective study may be needed to further study the biomarkers predicted in this study.
To conclude, our study finds out the IRF1 can be a potential biomarker for TB diagnose. Two gene sets GSE83456 and GSE13 9825 were analysed comprehensively. These results provided an updated perspective on the immune mechanism of tuberculosis and can be used for TB diagnosis.

CO N FLI C T O F I NTE R E S T
The authors declared that they have no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are openly available in NCBI GEO database at https://www.ncbi.nlm.nih.gov/geo/