Multiple criteria optimization joint analyses of microarray experiments in lung cancer: from existing microarray data to new knowledge

Abstract Microarrays can provide large amounts of data for genetic relative expression in illnesses of interest such as cancer in short time. These data, however, are stored and often times abandoned when new experimental technologies arrive. This work reexamines lung cancer microarray data with a novel multiple criteria optimization‐based strategy aiming to detect highly differentially expressed genes. This strategy does not require any adjustment of parameters by the user and is capable to handle multiple and incommensurate units across microarrays. In the analysis, groups of samples from patients with distinct smoking habits (never smoker, current smoker) and different gender are contrasted to elicit sets of highly differentially expressed genes, several of which are already associated to lung cancer and other types of cancer. The list of genes is provided with a discussion of their role in cancer, as well as the possible research directions for each of them.


Introduction
According to the International Agency for Research on Cancer, the world's most commonly diagnosed cancer is lung cancer, with 1.8 million cases or 13% of total in 2012. Additionally, lung cancer was the first cause of death in the world, with 1.6 million deaths or 19.4% in 2012 [1]. This analysis was conducted in 184 countries. This work intends to facilitate uncovering new information related to cancer using publicly available lung cancer microarray data. The aim is to find those genes that changed their relative expression the most in order to propose potential lung cancer biomarkers.
Microarray experiments quantify the relative expression of tens of thousands of genes. These experiments have been highly utilized in the past decade to study a number of health conditions, including cancer [2,3]. These experiments, however, are often times measured in different units, thus making it difficult to analyze several of them simultaneously. Furthermore, because the measured level of expression is relative, a normalization process is commonly required. All of these have hampered the search for cancer biomarkers in the past.
The strategy to detect potential biomarkers utilized in this work is based on mathematical optimization.

ORIGINAL RESEARCH
Multiple criteria optimization joint analyses of microarray experiments in lung cancer: from existing microarray data to new knowledge Optimization can be defined as a decision-making process aimed to obtain the best possible values in a series of performance measures (PMs) of interest. The decision variables are habitually constrained to fall within specific ranges or to maintain mathematical relationships among them [4,5]. Mathematical optimization (MO) has been widely used in many fields, including Economics and Engineering, and clearly it can be applied to biological analysis. MO can make a system or design effective, functional, or in its most basic form, possible [6,7]. Multiple Criteria Optimization (MCO) is an optimization problem that finds a set of solutions corresponding to the best possible balances among two or more conflicting PMs under study [8]. These solutions are known as Pareto-Efficient solutions and are mathematically characterized by the well-established Pareto-optimality conditions. In general terms, then, the idea behind a MCO problem is to find the Pareto-Efficient Frontier formed by the Pareto-efficient solutions.
In this work the analysis of a publicly available microarray database for lung cancer is presented as a MCO problem. The genetic expression changes in this analysis were quantified using two metrics that do not have a perfect correlation and thus, are in conflict: difference of means and difference of medians. For the analysis, the MCO solutions will be those genes that have associated the greatest differences in the selected metrics. The solution genes are the ones that changed their expression the most between the compared conditions and could be potential biomarkers and can, after further study and confirmation, help with the diagnosis, prognosis, treatment, and recurrence prediction for the condition under analysis [9]. It can be appreciated that the method seeks to minimize the likelihood of false positives due to its focus on frontier analysis. This, expectedly, comes at the cost of false negatives in genes that might not appear in the Pareto-efficient frontier due to experimental error. A simple strategy of finding several consecutive frontiers is proposed to alleviate this issue.
The analysis strategy in this paper has, however, the advantage of providing objectivity, as it does not require the analyst to change or adjust any parameters, thereby fostering repeatability across analysts. It has also been shown to have a high discrimination power. The method used to solve the associated MCO problem is a full pairwise comparison scheme that effectively finds the genes that show high expression change across multiple PMs. This scheme is an improvement in terms of precision and convergence over the Data Envelopment Analysis approach presented by our group in [9]. The genes identified this way are located on the Pareto-efficient frontier of the MCO problem, that is, they are demonstrably Paretoefficient [10] and are, in consequence, proposed as potential biomarkers.

Literature Review
Microarray experiments have been very popular among researchers [11]. In the Gene Expression Omnibus (GEO)  as of May 2015 there are 3848 databases with 1,392,278 samples. Microarray experiments are sufficiently accepted as a reliable technology where the most common use is to find differentially expressed genes between two experimental conditions or samples [12]. Moreover, microarrays have been used to study how different biological processes or pathways work in several organisms [13]. To analyze the experimental data, statistics have been used for these types of studies [14,15]. However, producing a standard method for analysis has never been accomplished.
In the literature there are many methods to find highly differentially expressed genes to characterize them as potential biomarkers. Most of them focus on statistical procedures [15,16]. This research adopts multiple criteria optimization and Pareto conditions to find biomarkers following the direction of our research group [9,17], and proposes extending the application to this end through simultaneous analysis of multiple independent experiments, that is carrying out meta-analysis. In 2010, in our group, Sanchez-Peña [9] used a combination of two performance measures (two P-values) obtained from a single-microarray database to cast the MCO problem and Data Enveloped Analysis (DEA) to solve it. The pairwise comparison scheme in the present work yields a more precise Pareto-efficient frontier than DEA, as it can deal with nonconvexity from the onset.
An important direction of this work is to use the proposed method for meta-analysis of high throughput biological experiments, starting with microarrays. Glasser and Duval [18] provide the definition: "Meta-analysis refers to methods for the systematic review of a set of individual studies or patients within each study, with the aim to quantitatively combine their results." Meta-analysis is a method capable of taking independent, but associated studies to obtain a set of solutions through all studies. It is possible to find different applications and examples about meta-analysis. Li and his research group led a systematic review and meta-analysis to determine whether two polymorphisms (V89L and A49T) are associated with the risk of prostate cancer. They found 31 articles and reviews related to such risk [19]. On the other hand, R makes available a tool for microarray meta-analysis called MetaOmics. MetaOmics integrates Quality Control (Meta QC), Differentially Expressed (Meta DE), and Pathway (Meta pathway) [20]. Also, Zhuohui et al. (2014) research developed a tool, "MAAMD" [21]. They carried out metaanalysis using Affymetrix microarray data. The tool automates the process to analyze microarrays and requires normalization and several statistical methods to detect K. I. Camacho-Cáceres et al. Joint Analyses in Lung Cancer differential gene expression. To this end, they used Kepler, AltAnalyze and Bioconductor software packages. The parametric approaches in these works differ from our nonparametric approach. Therefore, it is clear that multiple criteria optimization differs from the reviewed approaches and constitute a novelty in meta-analysis. It must be emphasized, however that meta-analysis is a study that comprehends a larger area than afforded by the use of a single technique and that it requires a methodical design to be reliable. Especial care, for instance, must be given to the selection of studies to be included [22], as well as their heterogeneity [23]. Meta-analysis has become a cornerstone for evidence-based medicine [24] and follows widely accepted standards for its realization [25][26][27].
As noted earlier, the MCO problem has been approached in our group by Sánchez-Peña, et al. [9] through Data Envelopment Analysis (DEA). This work approaches the larger problem of analyzing multiple microarray databases simultaneously that is, to carry out meta-analysis, formulating the analysis as an MCO problem and solving it through a pairwise-comparison scheme that facilitates the evaluation of Pareto-efficiency conditions. In the literature, the authors of [28] have successfully applied Paretoconcepts for gene selection coupled with the use of a series of parametric statistical methods [28]. It is the intention of this work to keep the analysis strategy as nonparametric as possible, so as to not depend heavily on statistical assumptions or -in a different sense of nonparametric-the adjustment of parameters by the user that might bias the analysis results. Figure 1 shows the elements of the graphical representation of the MCO problem. G denotes the universe of solutions that comprises the n genes to be analyzed with g i representing each gene under analysis, (i = 1, 2, … n). Figure 2 shows the space defined by two criteria or PMs under analysis, m 1 and m 2 . In the generalization of this figure, m k i is the value for the i-th gene in the k-th PM. Then k = 1, 2, … C, where C is the number of PMs considered in the analysis. The Pareto-efficient frontier in Figure 2 is formed by the genes g * i . These genes have indeed the best possible balances among the two PMs to be minimized and are the ones proposed as potential biomarkers.

Method: MCO Problem Formulation
When it comes to microarray analysis, the PMs of choice are usually related to the difference of gene expression measured in two distinct states for comparison purposes. Looking for the most differentially expressed genes is akin to looking for potential biomarkers, and it is a problem that can be casted as described up to this point.
According to Deb [29] and Ehrgott [30] the Paretoefficient solutions must meet the Pareto-optimality conditions. In practical terms, this relates to finding nondominated solutions in the following sense: a solution X (1) is said to dominate the other solution X (2) , if both conditions 1 and 2 are true: 1. The solution X (1) is no worse than X (2) in all PMs. 2. The solution X (1) is strictly better than X (2) in at least one PM.
These conditions can be evaluated for every single pair of genes to find those that are not dominated by any other gene. These are the Pareto-efficient genes that form the Pareto-efficient frontier of the MCO problem at hand. As stated previously, in the search for the most differentially expressed genes, the expressions of all candidate genes are measured in two states to be then further  A new matrix γ is then build by assessing the values α ij . For C=2, for example, the following assessment applies: In general for any value C ≥ 2 (2) Thus, in summary, this process will result in the γ matrix: In order to find g * i , a vector β is built containing the sums of each row of matrix γ as follows: It is common, then, to use the difference of the means or the medians of the relative gene expression in these two states, for example. In this work, each of the C experiments will contribute one difference of medians between two states termed "control" and "cancer." This translates into each gene being evaluated through C PMs. The absolute value of these differences will then be transformed to follow a minimization direction to match the illustration in Figure 2, where the following notation is introduced: Let us represent the i-th gene in terms of its values on each of the C PMs as Then, the objective of the analysis is to find the set of Pareto-efficient solutions: . This is accomplished through a full pairwise comparison among the n genes as explained next.
First, a matrix δ k is built for the k-th PM resulting in C squared matrices of size n built as follows: where: for i = 1, 2, … , n j = 1, 2, … , n; k = 1, 2, … ,C ; and W is defined as a large positive integer number used as a penalty. In this work, W = 1000 is used. Next, a summation matrix is computed with elements . This is exemplified in Table 1 when C = 2: The Pareto-efficient frontier will, then, contain all solutions that meet equation (4): With this last step, the Pareto-efficient solutions, , are clearly identified. This algorithm identifies all the solutions of the Paretoefficient frontier. The maximum number proved and coded in this work is five PMs. The MatLab code is available in Appendix A1. In addition, Appendix A2 contrasts the proposed method with the use of a volcano plot to detect differentially expressed genes. Indeed, the mathematical description provided here is sufficient for the interested reader to code the method. The next illustration should help in this endeavor.

Implementation of method
The next example will explain the application of the method. The objective is to find the Pareto-efficient solutions g * i for the minimization of two PMs (C = 2). Let G = {g 1 , g 2 , g 3 , g 4 , g 5 , g 6 } be a set of n = 6 genes. The values for the PMs per gene are g 1 (1, 4); g 2 (3,4); g 3 (5,6); g 4 (7,5); g 5 (3,2); g 6 (4,1). This leads to having  Figure 3 shows the MCO problem for the case of minimization of both performance measures and its mathematical solution.
Finally, applying equation (4) the Pareto-efficient solutions implies comparing the beta values to a threshold of 2000. The solutions g * i , for this MCO problem are g * 1 ,g * 5 ,g * 6 . These solutions are graphically shown in Figure 4.

Analysis and Results of Lung cancer Microarray
In this analysis, the database with GEO identifier GDS3257 was used. This database was first reported by Landi MT and collaborators [31]. The database contains measures of relative expression for 22,283 genes from 107 samples: 49 control and 58 cancerous tissues. The age of the donors was between 44 and 79 years old. Samples were from never smokers, former smokers, and current smokers (See Fig. 5).

Case 1: Comparative analysis between different pairs of subgroups
For the first analysis the group of never smokers was considered and the comparison was between controls and cancer samples. There were fifteen controls (HNS) and sixteen cancer (CNS) samples. The absolute value of the differences of means and medians for each gene were calculated. The analysis in MatLab tool was run in a computer with 4 GB of memory RAM and 2.66 GHz CPU. Due to this memory constraint, the Pareto-efficient frontier was found in a tournament fashion [32] as explained next. The 22,283 genes were divided into three groups: two groups of 7500 and one of 7283 genes. The MatLab tool was used to find the locally efficient frontier in each group. Finally, the genes in each one of the three efficient frontiers were analyzed together to find the global Pareto-efficient frontier. It is important to point out that the order of the partition and input of the data does not affect the final efficient frontier, as this is a case of explicit full comparison. In one criterion, the process would be similar to finding the tallest person in a room by picking the tallest one in different subgroups and comparing the local winners in the end to find the global winner. With enough computing memory, partitioning the data is not necessary. For each group, the locally nondominated subset was identified (Fig. 6). Then the locally nondominated subsets were used to obtain the globally optimal Pareto-efficient frontier, as seen in Figure 7. For this first analysis RAGE and SPP1 are the genes in the global Pareto-efficient frontier. It is important to recall that the user does not need to normalize or use a threshold value to achieve this result.
For the second analysis the selected group was the one for the current smokers, and again the comparison was between control current smokers (HCS) and cancer current smokers (CCS). The group had 16 samples for HCS and 24 samples for CCS. The process was performed as in the previous analysis. In this case the global Paretoefficient frontier had just one gene, the SPP1.
A third analysis compared groups HNS and CCS. There were 15 HNS samples and 24 CCS samples and RAGE  In a fifth analysis, the 15 HNS samples are compared with the 16 samples of HCS, resulting in three genes in the efficient frontier: RPS4Y1, CYP1B1, and XIST. When, in the sixth analysis, the comparison is done for the cancer group between nonsmokers (CNS, 16 samples) and currentsmokers (CCS, 24 samples) there is only one gene present in the solution: XIST. Figure 8 shows a summary of the six analyses between never smoker versus current smoker in cancer and control tissues. The circles on the left side represent the controls never smoker (HNS) and controls current smoker (HCS) tissues, while the circles on the right hand side represent the cancer never smoker (CNS) and cancer current smoker (CCS) tissues. Additionally, the upper circles represent never smoker tissues, whereas the lower circles symbolize current smoker tissues.
Case 2: Analysis of lung cancer in women: never smoker versus current smoker in cancer and control tissues Figure 9 shows the result with the same analysis described before, but selecting only for women's tissues. For this representation, the only efficient solution is RAGE, which showed a large change when controls (HNS and HCS) were compared to cancer.
Case3: Analysis of lung cancer in men: never smoker versus current smoker Figure 10 shows the results with an analysis similar to the one described before, but using only men samples. For this representation, as in previous cases, RAGE and SPP1 showed significant changes when controls (HNS or HCS) were compared to cancer. Table 2 shows the scientific names of genes obtained in the Pareto-efficient frontier from all previous analyses.
Case 4: The possibility of meta-analysis with four performance measures: a prototype for meta-analysis In the previous analyses two PMs (absolute value of differences in means and absolute value of differences in medians) were used. In this analysis, MCO meta-analysis is carried out using four PMs, which were the absolute value of differences in medians for each group [16]. The medians were used for their nonparametric characteristics,      Figure 11.
In this way, the four PMs were calculated and MCO was applied to find the genes with high variation levels of the relative expressions throughout all PMs. Among all the 22,283 genes and using four PMs, the genes with high variation were RAGE and SPP1. This analysis supports the potential of the proposed method for meta-analysis. Table 3 presents the summary of genes obtained from eighteen analyses of the lung cancer database. The first group consists of the genes obtained from an analysis   from both women and men. The second group is obtained from a group analysis of only women, and the last group is the results of a group analysis of only men. The common genes for all groups are RAGE, SPP1, and SCGB1A1. The products of these three genes are associated with inflammatory processes and different cancer types including lung [23,[33][34][35][36][37][38]. From this table, three important conclusions are obtained. First, those genes found in the literature as biomarkers such as CYPIB1 [39] and FABP4 [40] validate our method. Secondly, those genes found in the literature as associated with other types of cancer, such as, XIST (a nonprotein coding gene) [41], among others, could eventually be validated and proposed as lung cancer biomarkers with the precursor that they are important genes for other types of cancer and could uncover relations between different cancer types. Also, these genes could possibly have a relation with lung cancer biomarkers in a pathway to be researched. Third, the genes that do not have any evidence found in literature indicating or any identification as biomarkers in other types of cancer, are the opportunities for discovery and thus, offer the potential for a larger contribution.

Conclusions
The method applied in this study could be used to analyze data related to biological health care research where microarrays and other -omics are the driving experiments for exploration.
The tool coded in MatLab can currently analyze five criteria, that is, it can be used to meta-analyze up to five different datasets in one run. The discrimination rate makes the analysis very manageable. Also, the results will be friendly and conveniently available to physicians or biological researchers, as this analysis does not require normalization, preference of objectives, parameter adjustments by user, or the definition of a threshold value. Importantly, the mathematical treatment is easy to translate into a functional code of the analyst's choice.
In It should be taken into consideration that SCGB1A1 was found in this study to have an over expression in both Cancer Never Smoker and Cancer Current Smoker. However, SCGB1A1 expression has been found to be reduced in current smokers [60]. Further biological studies should be performed to validate the results obtained by the methodology applied in this study.
Currently we are working on improving the usability of the code to make the method more amicable to the users. Future work should include further investigation of the potential biomarkers proposed in this document and experimental validation. It is certainly also envisioned the future tests of the proposed method with different -omics.
Joint Analyses in Lung Cancer K. I. Camacho-Cáceres et al.
to build the Volcano plot. As mentioned previously, the Volcano plot requires the user to define thresholds for two parameters: p-value and FC to select genes. A 3 2 factorial experiment was used to explore these parameters as shown in Figure A1. The results are shown in Table A1.
From Table A1, it can be seen how the results depend highly in the user's selection of thresholds. The combinations that exactly match the output of MCO in this instance are the ones with FC = 24 at any of the values chosen for P-value in this experiment.  P-value = 10 -12 ; Fold change = 2 P -value = 10 -12 ; Fold change = 8 P-value = 10 -12 ; Fold change = 24 Figure A1. Continued