OSdlbcl: An online consensus survival analysis web server based on gene expression profiles of diffuse large B‐cell lymphoma

Abstract Diffuse large B‐cell lymphoma (DLBCL) is the most common subtype of non‐Hodgkin lymphoma (NHL) and is a clinical, pathological, and molecular heterogeneous disease with highly variable clinical outcomes. Currently, valid prognostic biomarkers in DLBCL are still lacking. To optimize targeted therapy and improve the prognosis of DLBCL, the performance of proposed biomarkers needs to be evaluated in multiple cohorts, and new biomarkers need to be investigated in large datasets. Here, we developed a consensus Online Survival analysis web server for Diffuse Large B‐Cell Lymphoma, abbreviated OSdlbcl, to assess the prognostic value of individual gene. To build OSdlbcl, we collected 1100 samples with gene expression profiles and clinical follow‐up information from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases. In addition, DNA mutation data were also collected from the TCGA database. Overall survival (OS), progression‐free survival (PFS), disease‐specific survival (DSS), disease‐free interval (DFI), and progression‐free interval (PFI) are important endpoints to reflect the survival rate in OSdlbcl. Moreover, clinical features were integrated into OSdlbcl to allow data stratifications according to the user's special needs. By inputting an official gene symbol and selecting desired criteria, the survival analysis results can be graphically presented by the Kaplan‐Meier (KM) plot with hazard ratio (HR) and log‐rank p value. As a proof‐of‐concept demonstration, the prognostic value of 23 previously reported survival associated biomarkers, such as transcription factors FOXP1 and BCL2, was evaluated in OSdlbcl and found to be significantly associated with survival as reported (HR = 1.73, P < .01; HR = 1.47, P = .03, respectively). In conclusion, OSdlbcl is a new web server that integrates public gene expression, gene mutation data, and clinical follow‐up information to provide prognosis evaluations for biomarker development for DLBCL. The OSdlbcl web server is available at https://bioinfo.henu.edu.cn/DLBCL/DLBCLList.jsp.


| INTRODUCTION
Diffuse large B-cell lymphoma (DLBCL) is the most common subtype of non-Hodgkin lymphoma (NHL), accounting for 30%-40% of the NHL. 1,2 DLBCL is a clinical, pathological, and molecular heterogeneous disease, patients of which have highly variable clinical outcomes. 3 The current complex classification of DLBCL is presented in World Health Organization (WHO). 4,5 Although this disease is curable, 20%-30% of DLBCL patients still experience relapse or refractory disease. 6,7 To assist clinical treatment, prognostic biomarkers are being investigated to optimize targeted therapy and to predict the prognosis of high-risk DLBCL patients. 8 So far, some unfavorable prognostic factors for DLBCL have been reported in previous studies, such as high international prognostic index (IPI), MYC rearrangement, double-hit lymphoma, double-expression lymphoma, and high p53 and CD5 expression. 1,6 However, more reliable biomarkers with high repeatability and predictive power to diagnose high-risk patients are needed to facilitate the development of alternative treatment strategies for DLBCL. 9 Using microarray or RNA-Seq technologies, the discovery of prognostic biomarkers at the transcriptional level is one main achievement of cancer genomics. 10 Despite the availability of numerous expression data and the corresponding clinical information in the public database to date, a web server or tool that could quickly evaluate the prognostic value of potential DLBCL biomarkers is still lacking.
In this study, we collected the gene expression profiles and clinical information of 1100 DLBCL patients from seven independent cohorts from the TCGA and GEO databases. We developed an online consensus survival analysis web server, named OSdlbcl, to assess the prognostic value of interested genes. This web server will facilitate the development and validation of new prognostic biomarkers in DLBCL.

| Dataset collection
RNA expression profiling data and clinical follow-up information of DLBCL patients were downloaded from two major sources, including TCGA (https ://portal.gdc.cancer. gov/) and GEO (http://www.ncbi.nlm.nih.gov/geo/). For TCGA data, Level 3 RNASeq data (HiSeqV2) with clinical information of DLBCL patients were downloaded. To gather the data in GEO, the searching keywords of "diffuse large B-cell lymphoma" or "DLBCL" and "survival" were used in GEO database. Only datasets that contain ≥ 25 cases with available gene expression profiles and clinical survival information were selected. In addition, DNA mutation data of DLBCL patients with clinical follow-up information were downloaded from TCGA.

| System implementation and server setup
OSdlbcl was developed as previously described. [11][12][13][14] The Kaplan-Meier (KM) plot of cumulative survival probability over time is a hallmark of biomedical survival analysis. 11 The log-rank test is popularly used to compare survival experience between groups. Thus, the KM plot and log-rank test were used to estimate the risk of the events in OSdlbcl. In short, J2EE (Java 2 Platform Enterprise Edition) architecture and MySQL server were used for integrating gene expression, DNA mutation, and clinical data. The dynamic web interfaces were written in HTML 5.0 and hosted by Tomcat in a Windows server. As the web server is "out-of-the-box," when users input an official gene symbol, the statistical analyses will be performed by the R package "survival" to produce the KM curves with hazard ratio (HR, 95% confidence interval) and log-rank p value. OSdlbcl is available at http://bioin fo.henu.edu.cn/DLBCL/ DLBCL ist.jsp.

| Evaluation of previously reported prognostic biomarkers
To evaluate the prognostic power of previously reported prognostic biomarkers, keywords including "Diffuse large B-cell lymphoma" or "DLBCL," "gene expression," and "survival" or "prognosis" were used in the PubMed search engine. In total, 23 biomarkers were collected, and the prognostic values of these reported DLBCL biomarkers were analyzed by OSdlbcl.

| Clinical characteristics of DLBCL datasets used in OSdlbcl
To establish OSdlbcl web server, we downloaded one TCGA cohort and six GEO cohorts with gene expression profiles and clinical follow-up information (Table 1). [15][16][17][18][19][20][21][22][23] A total of 1100 unique DLBCL cases were collected, all of which have available gene expression profiling data and clinical followup survival information. Overall survival (OS) is the most important endpoint for the clinical outcomes in OSdlbcl. 11 Moreover, we also collected survival terms including progression-free survival (PFS), disease-specific survival (DSS), disease-free interval (DFI), and progression-free interval (PFI) for the "survival" option in OSdlbcl web server. Before survival analysis, users could choose the relevant clinical characterizations, such as Ann Arbor stage, age, ECOG performance status, gender, IPI group, or number of extranodal sites, to narrow the analysis in a subgroup of DLBCL patients. The main clinical characteristics of these cohorts in OSdlbcl are shown in Table 1. DLBCL samples in OSdlbcl were collected from various areas, but most of them were from North America and Europe. Most of the DLBCL patients are male and at an elder age with the median age over sixty. All of the 1100 patients have OS data with a median OS of 28.50 months, while 267 patients from TCGA and 29 patients from GSE21864 also have PFS data. In addition, the 267 patients from TCGA have DSS, DFI, and PFI data as well. The total number of death events is 437, which is 39.72% of the total patients in OSdlbcl (Table 1).

| Application of OSdlbcl web server
To evaluate the prognostic value of genes in OSdlbcl web server, users first input a gene symbol, choose either individual cohort or combined cohorts, select one of the survival outcome types (OS, DSS, DFI, PFI, or PFS), and designate a gene expression cutoff value that will be used to split the DLBCL patients for KM analysis 24 ( Figure 1A). Users could also limit survival analysis to focus on a subgroup of DLBCL patients by setting Ann Arbor stage, age, ECOG performance status, gender, IPI group, or number of extranodal sites ( Figure 1A). 11,25 Finally, users could click the "Kaplan-Meier plot" button to run KM analysis. And then, the OS, DSS, DFI, PFI, or PFS of the gene in query is determined and graphically displayed with HR (95% CI) and log-rank p on output web page (Figure 1).

T A B L E 1 Clinical characteristics of seven independent DLBCL datasets used in OSdlbcl
Data source

| Determination of the prognostic value of DNA mutation in OSdlbcl
In addition to gene expression, gene sequence variation is another common type of prognostic factor. 47 In order to implement the prognosis analysis based on gene sequence variation, we have collected the DLBCL DNA mutation data from TCGA and implemented them into OSdlbcl web server, by which users could perform the prognosis analysis based on DNA mutation for the input gene. For example, PTEN is a tumor suppressor and mutated in a large number of cancers at high frequency, and PTEN deletion, mutation, and loss of PTEN expression were of clinical significance in de novo DLBCL. 48 As a result, we investigated the prognostic performance of PTEN in OSdlbcl at both RNA and DNA levels, and the results showed that PTEN is an independent favorable prognostic factor for OS at the RNA level (HR = 0.67, P < .05) (Figure 2A), while PTEN mutation is the independent prognostic factor for poorer survival in DLBCL (HR = 0.11, P = .04) ( Figure 2B). In addition, the gene expression variation between DLBCL cases with wildtype (Wt) and mutation (Mut) gene types can also be investigated in OSdlbcl ( Figure 2C).

| DISCUSSION
DLBCL is a heterogeneous disease with highly variable clinical outcomes. 3 Efficient biomarkers can help predicting clinical outcomes and identifying high-risk patients. However, the biomarkers currently used can only reflect a small spectrum of DLBCL patients. Therefore, we developed a user-friendly online survival analysis web server for researchers and clinicians to assess and identify prognostic biomarkers in DLBCL in a big dataset. Compared to previously published prognostic biomarker tools such as OncoLnc, 49 KM plotter, [50][51][52] and UALCAN, 53 OSdlbcl has the following advantages. First, OSdlbcl is the first survival analysis web server specifically for DLBCL and contains largest DLBCL sample size (1100 samples) compared to the other databases. Second, the interface of OSdlbcl is very straightforward and easy for the users with no specific bioinformatics training to operate. Also, the survival analysis results can be graphically presented by the KM plot with HR and log-rank p value, which could be used to assess the prognostic value of gene of interest. Third, except for prognosis evaluation at the RNA level, OSdlbcl could also determine the prognosis value of DNA mutation for DLBCL patients. Fourth, OSdlbcl has incorporated the clinical covariates for DLBCL patients including Ann Arbor stage, gender, ECOG performance status, number of extranodal sites, and IPI. Last but not least, 23 previously reported prognostic biomarkers were confirmed in the OSdlbcl web server, which indicated the effectiveness of OSdlbcl, and these previously reported biomarkers may have the potential to be translated into clinical applications. The limitation of OSdlbcl is that the number of DLBCL cases used for DNA mutation survival analysis is too small. However, continuously updating the OSdlbcl database by adding latest gene variation profiles and expression profiles with accurate follow-up information will help to strengthen the performance of OSdlbcl.
In conclusion, OSdlbcl is a user-friendly online consensus survival analysis web server to efficiently identify prognostic biomarkers, and the OSdlbcl database will be regularly updated when new DLBCL data are available. Our web servers will well reveal the critical impact of RNA expression and gene variation on the prognosis of DLBCL and are fundamentally important for the future targeted therapy for improving clinical outcomes.