Integrated weighted gene co‐expression network analysis reveals biomarkers associated with prognosis of high‐grade serous ovarian cancer

Abstract Background Ovarian cancer is the gynecologic tumor with the highest fatality rate, and high‐grade serous ovarian cancer (HGSOC) is the most common and malignant type of ovarian cancer. One important reason for the poor prognosis of HGSOC is the lack of effective diagnostic and prognostic biomarkers. New biomarkers are necessary for the improvement of treatment strategies and to ensure appropriate healthcare decisions. Methods To construct the co‐expression network of HGSOC samples, we applied weighted gene co‐expression network analysis (WGCNA) to assess the proteomic data obtained from the Clinical Proteomic Tumor Analysis Consortium (CPTAC), and module‐trait relationship was then analyzed and plotted in a heatmap to choose key module associated with HGSOC. Subsequently, hub genes with high connectivity in key module were identified by Cytoscape software. Furthermore, the biomarkers were selected through survival analysis, followed by evaluation using the relative operating characteristic (ROC) analysis. Results A total of 9 modules were identified by WGCNA, and module‐trait analysis revealed that the brown module was significantly associated with HGSOC (cor = 0.7). Ten hub genes with the highest connectivity were selected by protein‐protein interaction analysis. After survival and ROC analysis, ALB, APOB and SERPINA1 were suggested to be the biomarkers, and their protein levels were positively correlated with HGSOC prognosis. Conclusion We conducted the first gene co‐expression analysis using proteomic data from HGSOC samples, and found that ALB, APOB and SERPINA1 had prognostic value, which might be applied for the treatment of HGSOC in the future.


| INTRODUC TI ON
Ovarian cancer is the gynecologic cancer with the highest mortality in the world, and ranks third in the incidence of gynecologic cancers, next to cervical cancer and endometrial cancer; 313,959 new cases and 207,252 deaths were estimated to have happened in 2020. 1 High-grade serous ovarian cancer (HGSOC), belonging to epithelial ovarian cancer (EOC), is the most common and fatal type of ovarian cancer, and the first leading cause of ovarian cancerrelated deaths. [2][3][4] Since the ovaries are located deeply in the pelvic cavity, and there are no specific symptoms and effective screening methods of the early stages of HGSOC, most patients are often in an advanced clinical stage (stage III/IV) with metastasis at the time of diagnosis. The standard treatment of HGSOC is similar to other ovarian cancers, which is to perform primary debulking surgeries followed by chemotherapy combining platinum and paclitaxel. 2 But 80% of patients with advanced HGSOC will experience recurrence and chemotherapy resistance, leading to a 5-year survival rate of 30%, 5 and more and more evidences indicate that the metastasis-prone characteristics of HGSOC play an important role in its relatively poor prognosis. 6,7 In advanced HGSOC, tumor cells have spread out from the ovaries and pelvic organs to the peritoneum and abdominal organs, which could impede the normal function of vital organs in the abdomen by uncontrolled proliferation, as well as function of circulatory and respiratory system by generating large amounts of ascites and increasing intra-abdominal pressure, 5 Over the past two decades, with the breakthrough of highthroughput sequencing technology, a huge amount of nextgeneration sequencing data of various human tissues has been accumulated, which has greatly promoted the progress of biomedicine research, such as the screening of probable cancer prognostic biomarkers. Specifically, in the field of ovarian cancer research, a variety of mRNA-or lncRNA-based signatures have been identified for survival prediction in patients with ovarian cancer. For example, one study analyzed the differentially expressed genes in samples of ovarian cancer at different clinical stages and discovered specific gene co-expression modules related to the clinical stage and finally identified COL3A1, COL1A1, COL1A2, KRAS and NRAS as potential prognostic genes for ovarian cancer. 9 Another study showed that 5 lncRNAs (LINC00664, LINC00667, LINC01139, LINC01419 and LOC286437) could be used as independent risk factors for recurrence of ovarian cancer. 10 Meanwhile, other researchers have proved that changes in expression levels of genes were related to platinum resistance in ovarian cancer. 11 But almost all of these integrated analyses are performed at the transcriptome level, not at the proteome level.
It is well known that protein is the main executor of life activities, and all life activities depend on the correct function of protein.
And with the advancement of research methods for proteomics and the accumulation of public proteomic data, performing integrated proteomics analysis is completely feasible at present.
In the post-genomic era, the consensus that the occurrence of complex diseases such as cancer is not determined by a single gene that has gradually gained the approval of most researchers. Gene coexpression network research can provide us with information about gene expression correlations and potential functional relationships, thereby assisting us comprehend biological systems and explore the relationship between the relevant functional genes. 12 And weighted gene co-expression network analysis (WGCNA) is a method for concretizing this idea, which has been comprehensively applied to multiple cancer-associated studies to identify hub genes related to various traits.
Recent proteomics studies using ovarian tissue samples from HGSOC patients by Clinical Proteomic Tumor Analysis Consortium (CPTAC) have aided us better understand the mechanism of tumorigenesis from a novel insight and have also identified some candidate therapeutic targets. 2,13,14 In this study, we applied WGCNA to reanalyze these published proteomic data in order to discover proteins and pathway related to occurrence and development of HGSOC and identified a significant correlation between the brown module and the HGSOC, clinical stage, histological grade and patient survival time. Ten hub genes were selected from this module and verified by survival and relative operating characteristic (ROC) analysis. Finally, it was identified that ALB, APOB and SERPINA1 might be the potential biomarkers related to the prognosis of HGSOC. To our knowledge, this is the first study of prognostic biomarkers for HGSOC applying proteomic data, which provides some new insights into the occurrence and progress of HGSOC.
And the unshared peptides expression matrices analyzed through the Common Data Analysis Pipeline were used for subsequent data analysis. Samples lacking information about clinical stage, tumor histological grade and survival time and proteins with missing value of relative abundance among all samples were excluded from our study. Other sample inclusion criteria and data processing procedures were described in previous studies, 2,13,14 in short, that were (1) five samples were removed causing without TP53 mutation; (2) the median values of relative protein abundance over all proteins in every sample were calculated and re-centered to value of 0; (3) the normalized relative protein abundances of overlapping samples were averaged and used as their protein abundances. Finally, 2,892 proteins were identified to perform WGCNA analysis.

| Co-expression network construction and module detection
We used the WGCNA package (Version 1.70, https://CRAN.R-proje ct.org/packa ge=WGCNA) to construct the co-expression network for the identified proteins. 15 First, sample cluster analysis was carried out by the function hclust of WGCNA package to assess whether there were any significant outliers in the selected sample.
Next, a suitable soft threshold power for scale-free network construction was calculated and chosen with the function pickSoft-Threshold of WGCNA package. After, an adjacency matrix was built to bring about weighted separation of co-expression with the chosen soft threshold power value. 16 Co-expression similarity for paired proteins from adjacency matrix was calculated by measuring the topological overlap dissimilarity, and then we got a topological over-

| Identification of module-trait correlations and module preservation
The correlations between modules and clinical traits including sample type, clinical stage, pathological grade and survival time were assessed by the Pearson correlation coefficients and a heatmap was plotted to demonstrate the correlation value of interaction between modules and traits. The student t-test was used to get the p value of the correlation, and a p value of < 0.05 was considered statistically significant. The brown module with the highest value of correlation coefficients was mainly focused on and the correlation between gene significance (GS) for HGSOC and module membership (MM) in brown module was checked to identify module-trait associations.

| Identification of hub genes
Genes closely connected to the intramodular nodes are regarded as hub genes which usually have more important biological function than other nodes. 19 Protein-protein interaction (PPI) network analysis was performed via the online database Search Tool for Retrieval of Interacting Genes (STRING, Version 11, https://strin g-db.org/), 20 then the result of PPI analysis was imported to Cytoscape software (Version 3.8.0) to screen out top 10 hub genes ranked by degrees in the network of key modules using CytoHubba plug-in (Version 0.1). 21

| Gene set enrichment analysis
Gene set enrichment analysis (GSEA) of the biomarkers was performed with GSEA software (Version 4.1.0). 24 The package "h.all. v7.4.symbols.gmt"of the Molecular Signature Database (MsigDB, https://www.gsea-msigdb.org/gsea/msigd b/) was selected as reference gene set. 25 The normalized enrichment scores and p value were generated, and p value of < 0.05 was considered statistically significant.

| Identification of co-expression modules using WGCNA
It is believed that genes with comparable co-expression patterns are usually controlled by relative regulatory manner or have similar or parallel pathways of functional interaction. 12 In this study, we obtained the proteomic data from the CPTAC database according to Data Use Agreement. After the necessary quality control and manual check and screening, a matrix of relative abundance of 2,892 proteins from 25 normal fallopian tube and 235 HGSOC samples with clinical information were selected to construct the co-expression networks. To ensure the reliability of the coexpression network, sample clustering analysis was performed to investigate the outliers among all samples, and no outliers were detected (Additional file 1). Finally, the relative abundances of these 2,892 proteins and 260 samples were applied to identify the modules of co-expression genes (Additional file 2). To obtain scale-free topology, a value of 7 of soft threshold power was selected based on scale independence analysis (R^2 = 0.927), and the mean connectivity analysis was also relatively high under this soft threshold power (Figure 1). Thirteen modules were generated firstly, and 4 modules were merged into adjacent modules due to their high relevance of module eigengenes with adjacent modules, thus a total of 9 modules were included in our subsequent analysis ( Figure 2A, B). The hierarchical clustering dendrogram of proteins also showed the analogous results ( Figure 2C). Numbers of proteins in each module were displayed in Figure 2D, and the detailed result is summarized in Additional file 3. Afterward, the interactive relations among all modules and all proteins were visualized by a heatmap plot based on TOM ( Figure 3).

| Brown module significantly relates to HGSOC
To determine whether co-expression modules were associated with sample types and clinical traits, the module-trait relationship analyses were performed ( Figure 4A). The brown module was identified to be positively related to HGSOC with the top relevance and to survival time and negatively to stage and grade. In addition to brown module, the tan module also showed a lower correlation with above traits compared with brown module. Then, the correlation between GS for HGSOC and MM in brown module was analyzed, and the correlation coefficients value was of 0.82 ( Figure 4B).

| Functional enrichment analysis of proteins in brown module
To further explore the biological function of the proteins in brown module, GO and KEGG pathway enrichment analyses were per-

| Identification of biomarkers related to the prognosis of HGSOC
The 545 proteins in brown module were uploaded and analyzed by STRING database to identify the interaction between them, and then the top 10 proteins (ALB, AKT1, APOB, C3, APOA1, FGA, FGG, SERPINA1, MAPK1 and AHSG) ranked by degree of connectivity were selected as hub genes by cytoHubba plug-in of Cytoscape software. The network with neighbors generated by topological analysis of degree is shown in Figure 6. Furthermore, the relationships between the 10 hub genes and the overall survival time of patients with F I G U R E 1 Scale independence and mean connectivity of co-expression modules based on different soft threshold

| Gene set enrichment analysis
To further understand the biological function of ALB, APOB and SERPINA1 in HGSOC, we performed the GSEA based on our proteomic data. As shown in Figure 9, all of low expression of ALB, APOB and SERPINA1 were significantly associated with terms of "DNA repair," "G2 M checkpoint" and "MYC targets V2."

| DISCUSS ION
High-grade serous ovarian cancer remains the most common type of ovarian cancer with the highest incidence and the strongest fatality rate all over the world, and there is no definite research conclusion on its tumorigenesis mechanism. Meanwhile, due to the lack of effective early screening methods, most patients with HGSOC are diagnosed at the advanced stage, accompanied by extensive peritoneal metastasis, and furthermore, most patients will experience tumor recurrence, the In this study, we obtained proteomic data of HGSOC samples from the CPTAC database. Then, these data were assessed by WGCNA, and the brown module was identified to be significantly related to HGSOC. Interestingly, the brown module is not only significantly related to HGSOC but also significantly related to the pa- some new supports for these conclusions. In turn, these conclusions also support our findings. Moreover, the samples in our study were collected from patients with new-onset HGSOC, which may provide a certain research basis for the further understanding of the role of ALB in the occurrence of HGSOC. APOB is the main apolipoprotein of chylomicrons and low-density lipoproteins (LDL) and is the ligand for the LDL receptor, 35 and the low or absent levels of APOB in plasma usually lead to familial hypobetalipoproteinemia and abetalipoproteinemia. 36 Interestingly, increasing studies uncover that loss-of-function mutations of APOB frequently occur in multiple cancers including melanoma, liver cancer, stomach, esophageal, head and neck, uterine, and lung cancers. For liver cancer, Lee et al. 37,38 find that loss or inactivation of APOB in hepatocellular carcinoma is significantly associated with poor survival of HCC patients, whereas another group finds that elevated APOB predicts poor prognosis after surgery in patients with hepatocellular carcinoma, 39   The limitation of this study is that our results were left without verification due to the lack of other independent proteomic data.
Secondly, due to the limitation of mass spectrometry technology, the relative abundance of most proteins is missing, and in order to ensure the reliability of the results, these proteins are not included into our analysis, which may cause bias to our conclusion.
Furthermore, these findings need to be confirmed by further clinical practices in the future.
In summary, the HGSOC-associated module was revealed through WGCNA. ALB, APOB and SERPINA1 were identified as the prognostic biomarkers for HGSOC, the protein levels of which were positively correlated with the survival time of patients with HGSOC.

ACK N OWLED G EM ENTS
Data used in this publication were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH). We thank Haifeng Li from the Institutes of Shanghai Pudong Decoding Life for assistance and valuable discussion with the data analysis.