Explaining the optimistic performance evaluation of newly proposed methods: a cross-design validation experiment

,


Introduction
In the literature on data analysis methods, including statistical journals, machine learning journals, and conference proceedings, most articles describe new methods, thus contributing to an increasing multitude of potential methods addressing various data analysis problems.It is commonly claimed by the authors proposing these new methods that they perform better than existing ones in some sense.For anecdotal evidence in the context of supervised classification see Boulesteix et al. (2013).The fact that new methods are typically claimed to be better than existing ones does not necessarily imply that these statements are wrong.In fact, this is what we would expect if we assume continuous scientific progress.However, the recurrent character of these claims, combined with the requirement of journals and reviewers to make these sort of claims regarding the superiority of the proposed methods, make them somewhat suspicious (Norel et al., 2011;Boulesteix et al., 2013).A recent survey of papers that compare pre-processing methods for a special type of high-throughput molecular data indicate that, at least in this specific context, the paper introducing a method is indeed more optimistic regarding its performance than subsequent papers that are more neutral towards the method (Buchka et al., 2021).
In a different but related approach, several studies demonstrate that it is relatively easy to make a method appear better than it actually is (Jelizarow et al., 2010;Ullmann et al., 2022;Pawel et al., 2022;Nießl et al., 2022;Sonabend et al., 2022;Boulesteix and Strobl, 2009).These studies suggest that over-optimistic statements regarding a method's performance may be partly attributed to the non-neutral attitude of the authors, who are naturally interested to present their method in a positive light.More precisely, it is argued that the non-neutrality may translate into a conscious or subconscious optimization of the method and the study design in which it is evaluated (e.g., by selectively reporting the considered data sets or simulation parameters) such that the proposed method shows good performance.
Imagine there are two methods, method A and method B, available to address a specific data analysis task.We set the number of methods to two for the sake of simplicity, but the following arguments can be extended to a setting with more than two methods.Further, imagine the typical situation in which the authors of A and the authors of B both claim that their method performs well.The study designs they use to support their claims are different.We will call them design A and design B, respectively.Following the conjecture discussed in the previous paragraph that study designs used by authors of methods may overfit their methods and vice-versa, a natural question is how method A would perform when re-evaluated using design B and how method B would perform when re-evaluated using design A. In the present paper, we put this idea into practice by conducting a systematic experiment, which we call "cross-design validation of methods".More precisely, we consider two exemplary data analysis tasks, namely multi-omic data integration for cancer subtyping and differential gene expression analysis, and for each exemplary task, we select two papers that propose a new method.For each of these two methods, we reproduce the evaluation shown in the paper that introduced it, and then re-evaluate it on the design used by the authors of the other paper.In this context of methodological research, where data analysis methods are considered as research objects, the study design includes data sets (with a focus on real data for the first task and simulated data for the second task), competing methods, and evaluation criteria.
The goal of this cross-design validation experiment is two-fold.Firstly, it allows dissection of the variability of study designs and its impact on the results in the context of methodological research-in a similar way as so-called multi-analyst experiments (Silberzahn et al., 2018) do in application fields of statistics.Secondly, the cross-design validation experiment provides insights into the mechanisms leading to performance discrepancies such as those observed by Buchka et al. (2021) between the original paper (i.e., the paper that introduces the method of interest) and subsequent papers (i.e., papers that propose another method and include the method of interest as competitor or papers that are dedicated to method comparison itself).Im-portantly, these are real-world observations-as opposed to the previous experiments by Jelizarow et al. (2010), Ullmann et al. (2022), and Pawel et al. (2022), mimicking the behavior of fictional researchers who wish to present their method in a favorable light.Finally, our experiment also provides insights regarding the reproducibility of results and the difficulty of performing fair method comparisons as a by-product.
Because the authors of our selected papers made code and data available for the purpose of reproducibility, our experiment can be performed without involving them personally, which considerably simplifies its organization and execution.Moreover, while we could also gain insights from re-evaluating the methods on a study design selected by ourselves, the cross-design character of the experiment guarantees a certain degree of neutrality of our comparisons.
The remainder of this paper is structured as follows.The general structure of our cross-design validation experiment is outlined in Section 2. Section 3 and 4 present the two data analysis tasks, while a discussion of the mechanisms leading to the observed performance differences along with a summary can be found in Section 5. We conclude the paper in Section 6.
2 Preliminary remarks and design of the experiment

Terminology
Before describing the experiment in more detail, we briefly clarify the terminology used throughout the paper.Similar to Klau et al. (2020) and Buchka et al. (2021), we define the term method not just as the statistical testing or modeling approach, but the full analysis pipeline, potentially including steps such as data normalization.All methods considered in the experiment have several parameters that can be set by the method user (e.g., the maximum number of clusters or the type of multiple testing correction), which we refer to as method parameters.
Moreover, we define the study design as the combination of all components that contribute to the performance assessment of the method of interest.The study design consists of three main components, namely data sets (real or simulated), competing methods (including their respective method parameters), and evaluation criteria (in our exemplary analysis tasks referring to the evaluation metric and the way the results are aggregated across the real data sets or simulation repetitions).Note that data pre-processing can be seen both as part of the method or part of the data component.In our experiment, we consider pre-processing steps as belonging to the data component if performed for all methods and belonging to the method (i.e., the method of interest or the competing methods) otherwise.

Selection of the papers
As a preliminary step, we first have to select appropriate papers for both exemplary data analysis tasks that we consider in our experiment, namely cancer subtyping using multi-omic data and differential gene expression analysis.Both are applications from the field of biostatistics at the interface with bioinformatics.Apart from the requirement that the paper must introduce a new method, there are two eligibility criteria related to reproducibility: (i) the code to reproduce the results presented in the paper is publicly available and can be run without errors, and (ii) the code is written in R.
While the restriction to R as programming language (ii) excludes some papers, the majority of papers fail criterion (i).In many cases, authors only provide the code to use their method (e.g., an R package) but not to reproduce the results shown in the paper.In other cases, the link to the code is broken, the code supposedly included in the supplement cannot be found, or some of the files needed to reproduce the code are missing (e.g., the file containing the empirical data the simulation shown in the paper is based on).Note that we purposely refrain from contacting the authors if the code is not publicly available to make the selection of papers independent of the authors' willingness to respond and provide the code.Although we do not restrict the number of papers to two per data analysis task in advance, the above-mentioned difficulties lead us to stop the search after finding two eligible papers, resulting in 2 × 2 = 4 papers included in our experiment.The conclusion of this informal search process is that the practice of making code and data openly available is far from being the standard in the methodological literature beyond positive exceptions such as the Biometrical Journal (Hofner et al., 2016).The four papers included in our experiment (Nguyen et al., 2019;Rappoport and Shamir, 2019;Zhou et al., 2021;Osabe et al., 2021) should thus be seen as rare positive examples for open research practices in methodological research.

Design of the experiment
While all four papers evaluate their respective method extensively in various settings, our experiment includes only the results that (i) are presented as figures or tables and appear in the main paper, i.e., excluding the supplement (to keep the experiment feasible) and (ii) compare the method of interest to competing methods (since we can only compare the relative performance of a method if the considered papers use different evaluation metrics that do not allow a direct comparison).If the results are based on both real and simulated data, we only consider the results of the data type that is predominantly employed in the paper.In some cases, we exclude more results, which is reported and justified in Section 3.1 and 4.1.
For each of the four papers, we first compare the results obtained by running the available code to the results presented in the corresponding paper.We use the same R and R package versions and do not modify the code in a way that would change the results, even in cases where we notice discrepancies between the code and the procedure described in the paper (referred to as "design-implementation-gap" by Lohmann et al., 2021 in the context of simulation studies).Exceptions to this rule are explicitly reported in Section 3.2 and 4.2.
For both data analysis tasks, we then re-evaluate each method on the study design used by the authors of the other paper and compare the resulting performances.Our experiment can thus be seen as "cross-design validation of methods" (see Table 1).As stated above, the study design consists of three main components, namely data sets, competing methods, and evaluation criteria.We also vary these components individually, which allows us to assess their individual impact on the performance of the selected methods.Some challenges arise when re-evaluating the methods on the new study design, in particular the choice of method parameters, which we set before viewing the performance results to avoid the risk of favoring one of the methods.Moreover, while we generally adhere to the code used to reproduce the results when "crossing" the designs, some modifications are necessary.Details on how we address these challenges for each data analysis task can be found in Section 3.2 and 4.2.
The R code and data to reproduce the experiment are openly available at https://doi.org/10.6084/m9.figshare.20754028.v1 3 Data analysis task I: Cancer subtyping using multi-omic data The first exemplary data task we consider in our experiment is cancer subtyping through clustering of patients based on multi-omic data, an active research field with many newly proposed methods in recent years (see Duan et al., 2021 for an overview).The aim of these methods is to identify clusters (in this context referred to as subtypes) with common biological characteristics or clinical phenotypes (e.g., survival time or drug response).This process helps to understand the etiology of the disease and to develop better diagnostic tools and personalized treatment strategies (Subramanian et al., 2020;Duan et al., 2021;Tepeli et al., 2020).Recently developed cancer subtyping methods are usually able to integrate multiple types of high-dimensional molecular data such as genomics, epigenomics, transcriptomics, or proteomics (hence the term multi-omic data; Subramanian et al., 2020).The two methods selected for our experiment are PINSPlus and NEMO, which were proposed by Nguyen et al. (2019) and Rappoport and Shamir (2019), respectively.Information on where to find the original codes provided by the authors is listed in our code documentation.We will abbreviate Nguyen et al. (2019) and Rappoport and Shamir (2019) by N19 and R19.

Study design in the original papers
In the following, we outline and compare the study designs that are used to assess the performance of PIN-SPlus and NEMO in their respective original papers and that meet the inclusion criteria of our experiment (see Table 2 for an overview).We also report the authors' justifications of the design choices.For this purpose, we will also refer to Nguyen et al. (2017), which propose PINS, the predecessor method of PINSPlus, and to Rappoport and Shamir (2018), a benchmark study intended as neutral that has been previously conducted by the authors of NEMO.All results of NEMO's competing methods originate from this benchmark study, i.e., the results of NEMO were simply added to the results of the previously published benchmark study.Since both R19 and N19 mainly use real data sets to evaluate their methods, we do not further consider the simulation study presented by R19.
Data Both R19 and N19 use data sets from The Cancer Genome Atlas Research Network (TCGA; https: //www.cancer.gov/tcga),where each data set corresponds to a different cancer type (e.g., kidney renal clear cell carcinoma or acute myeloid leukemia).The two author teams also consider the same three types of omic data (gene expression, methylation, miRNA expression), but use different numbers of data sets (34 in N19 vs. 10 in R19).Neither N19 nor R19 explicitly comment on the number of data sets and the selected cancer types, although 34 seems to be close to the maximum number of available data sets for the three considered types of omic data at the time of publication.Moreover, neither N19 nor R19 discuss their choice of omic data types, which seems to be a general issue in papers proposing new cancer subtyping methods, as criticized by Duan et al. (2021).
Although the ten cancer types included by R19 are also considered in N19, the corresponding data sets have different numbers of patients and omic variables.This is mainly caused by the different pre-processing steps applied by N19 and R19 (see supplement Section A.1 for details).In addition, the two papers probably also use different data set versions (it was not possible to identify the data version used by N19).
Note that N19 also consider two breast cancer data sets that do not originate from TCGA and exhibit different omic types.However, we exclude them from our experiment since some evaluation criteria of R19 require six clinical variables (see below), which are either not available or cannot be clearly identified for these two data sets.Moreover, we do not include the partial data sets (i.e., data sets where some patients do not have any measurements for one or more omic data types) used in R19 to demonstrate NEMO's ability to analyze this kind of data since PINSPlus assumes complete data and would require potentially suboptimal Table 2 Overview of the study design components used for performance assessment of PINSPlus and NEMO.Included are only components (i) for which the corresponding results are presented as figures or tables in the main paper (i.e., not in the supplement), (ii) that compare the method of interest to other competing methods, and (iii) that correspond to the performance assessment based on real data.In addition, some components are not included in the experiment, which are indicated by asterisks (*).Competing methods and evaluation criteria for data sets not included in the experiment are not shown.

Study design component
PINSPlus (Nguyen et al., 2019) NEMO (Rappoport and Shamir, 2019) Competing methods R19 and N19 use different numbers and types of competing methods to assess the relative performance of their proposed methods.While R19 use nine competing methods, N19 only consider three methods.The only method that is included in both papers is Similarity Network Fusion (SNF; Wang et al., 2014).The difference in the number of competing methods is not surprising given that the performance evaluation of NEMO is, in contrast to PINSPlus, based on a benchmark study with a focus on method comparison itself (Rappoport and Shamir, 2018).Such studies typically aim to compare as many methods as possible to generate comprehensive guidelines for method users.Interestingly, R19 include PINS, the predecessor method of PINSPlus, as a competing method.PINSPlus itself is not included since it did not exist yet when Rappoport and Shamir (2018) conducted their benchmark study.Concerning the choice of competing methods, Rappoport and Shamir (2018) report that they aim to represent diverse multi-omic clustering approaches, and that within each approach, they choose widely used methods with available software and clear usage guidelines.N19 refer to the selected competing methods as established subtyping methods.
Regarding the parameter selection of the competing methods, NEMO's authors state in Rappoport and Shamir (2018) that they choose the method parameters following the guidelines given by the authors of the respective method (which involves performing a parameter search if suggested) and construct parameter selection methods by themselves if there are no available guidelines.N19 do not have a comparable statement except for the number of clusters for the method Consensus Clustering (Monti et al., 2003), which, as stated in Nguyen et al. (2017), is determined as suggested by Monti et al. (2003).For SNF, the only method that is considered as competing method for both PINSPlus and NEMO, N19 and R19 both normalize the omic variables to have a mean of 0 and a standard deviation of 1 (which, as stated in Section 2.1, we consider as a method parameter since it is not applied for all methods in both papers).However, they choose different values for the number of neighbors (20 vs. number of samples/10), the number of iterations (10 vs. 30), the number of clusters (estimate according to eigen-gaps vs. rotation cost), and the maximum number of considered clusters (5 vs. 15).See Table 4 for the method-specific pre-processing steps as well as N19 and R19 for all other parameters of the remaining methods.
Note that we have to exclude two competing methods (rMKL-LPP and MultiNMF) considered by R19 from the experiment since we are not able to run them (see supplement Section A.2 for details).
Evaluation criteria With regard to the evaluation criteria, N19 focus on the methods' ability to identify clusters with significant survival differences using the logrank test.Note that in this context, the logrank test is equivalent to performing a Cox regression (which is the term used by N19), but we will refer to it as logrank test since this seems to be the more commonly used term in cancer subtyping methodology.Nguyen et al. (2017) note that the same logrank test was also used by the authors proposing SNF (Wang et al., 2014), which can be seen as a justification for their choice.For each method, N19 highlight the data sets with significant (i.e., p < 0.05), and most significant (i.e., the smallest significant p-value across all methods) p-values by color.
In R19, the assessment of significant survival differences is also based on the logrank test.In addition, the authors assess "clinical enrichment" by testing the association between the identified clusters and six clinical variables (gender, progression of the tumor, cancer in lymph nodes, metastases, total progression, age at initial diagnosis), although not all variables are available in each clinical data set.R19 employ the χ 2 test for independence for discrete and the Kruskal-Wallis test for continuous variables, and additionally adjust for multiple testing using the Bonferroni correction.In contrast to N19, R19 estimate the logrank p-values using a permutation procedure, arguing that in the cancer subtyping context, the χ 2 distribution assumed for the logrank test statistic is often an inaccurate approximation and leads to increased type 1 errors (see supplement Section A.3 for details).The same applies to the χ 2 test for independence and the Kruskal-Wallis test.For each method, R19 aggregate the individual results of each data set by reporting the number of data sets with significant logrank p-values, the number of data sets with at least one enriched clinical variable, the mean −log 10 logrank p-value, and the mean number of enriched clinical variables per data set.R19 thus consider four evaluation criteria regarding survival and clinical enrichment.Note that one of these criteria (number of data sets with significant logrank p-values) is very similar to the criterion used by N19 (number of data sets with [most] significant logrank p-values), the only difference being the estimation of the p-value (approximation-based vs. permutation-based) and the inclusion of the number of data sets with the most significant p-values as second order ranking criterion in N19.
In addition to analyzing survival differences and clinical enrichment, R19 also report the number of clusters and the runtime of each method.However, we do not consider these criteria in our experiment since the number of clusters has no clear optimal value and runtime is not comparable due to different computational resources.

Challenges when conducting the experiment
Reproducibility The results presented in N19 are fully reproducible, except for one p-value of iCluster+.In contrast, the results presented in R19 cannot be exactly reproduced.Besides the two methods that cannot be run at all, the performance results of the remaining methods are slightly different compared to the original paper, especially for the clinical enrichment criteria (the difference between original and reproduced results with regard to NEMO's performance is reported in Section 3.3).Interestingly, 76 of the 80 clustering solutions (8 methods × 10 data sets) are equal to the clustering solutions provided by Rappoport and Shamir (2018), with two of the remaining four solutions only differing in one and three individuals, respectively.This means that the reproducibility problems (also observed for some of the 76 settings yielding identical clustering solutions) might be caused by the permutation tests.Moreover, the provided code is probably not the exact code used by R19, as indicated by the fact that R19 refer to Rappoport and Shamir (2018) for the code to reproduce the results, but also mention that the implementations for MCCA, LRAcluster and k-means were slightly changed compared to Rappoport and Shamir (2018).
When reproducing the results, we do not modify the code provided by the authors in a way that would change the results.However, we have to set a different number of cores in some settings and use a different R version for running the permutation tests by R19 due to different computational resources (see our code documentation for details), which may contribute to the reproducibility issues.
Crossing the designs Evaluating the performance of PINSPlus and NEMO using each other's data sets, competing methods, and evaluation criteria poses a number of challenges.The most important one being the choice of parameters both for the two methods of interest, PINSPlus and NEMO, and the competing methods.Whenever a method is applied to a new (set of) data set(s), the method user needs to carefully select its parameters or a corresponding parameter selection method, which of course also applies to our experiment.Since both N19 and R19 use the same three types of omic data from the same source (TCGA), we set the parameters of PINSPlus and NEMO as in their respective original paper, which corresponds to their default parameter setting.Note that we also do not change the range of possible values for the number of clusters, a parameter that can be specified for both methods and is set to [2,5] for PINSPlus and to [2,15] for NEMO.We also attempt to use the same parameters for the competing methods when applying them to the new data sets.However, for two competing methods of N19 (iCluster+ and Consensus Clustering), the optimal number of clusters has to be selected by the user according to plots generated by the method when run on a specific data set.When applying these two methods on the data sets by R19, we thus have to manually choose the optimal number of clusters for every data set, and although we try to imitate the decisions of N19 on their data sets, a clear determination is not always possible (an issue that is also noted by Duan et al., 2021).Moreover, some refinements regarding the method-specific pre-processing steps are necessary for two competing methods of R19 (see Section A.1 in the supplement).
In addition to the choice of method parameters, some challenges arise when applying the evaluation criteria by R19 on the data sets by N19.Specifically, the logrank permutation test by R19 does not converge for some methods on two data sets by N19, resulting in a p-value of 0 in 15 method-data-combinations.In these cases, we use the approximation-based logrank p-values.Moreover, clustering solutions resulting from the data set UCS (N19) are not tested for clinical enrichment (R19) since it only includes one of the six clinical variables ("gender") with only one value ("female").

Results
Performance based on the original study design The upper panels of Figure 1 and 2 show the reproduced performance results of PINSPlus and NEMO based on their original study design.Note that the representation in thes figures slightly differs from the original papers to achieve a comprehensive and yet clear summary of the results.The most important difference is that the papers also report the individual performance results for each data set (we provide the individual performance results in Table 6 and 7 in the  supplement).
When evaluated based on its original design, PINSPlus seems to be clearly superior to the three competing methods.It has the most significant p-values (p < 0.05) regarding survival, with 21 of the 25 significant p-values being the smallest across all methods.NEMO also shows good performance in its original study design, although its performance is not as clearly superior to the competing methods as the performance of PINSPlus.It achieves the highest numbers of data sets with significantly different survival and at least one enriched clinical variable (although there are two competing methods that achieve the same number of data sets with clinical enrichment).Moreover, none of the competing methods achieves both a higher mean −log 10 logrank p-value and a higher mean number of enriched clinical variables.Only MCCA obtains a higher mean −log 10 logrank p-value than NEMO but has a lower mean number of enriched clinical variables.Note that despite the reproducibility issues, both the absolute (i.e., the values of the four evaluation criteria considered by R19) and the relative performance of NEMO (i.e., when comparing these values to the competing methods) correspond to the results shown in the original paper.The only difference affecting the relative performance of NEMO is that in the original paper, one of the two methods that could not be reproduced (rMKL-LPP), achieves a higher mean number of enriched clinical variables than NEMO but a lower mean −log 10 logrank p-value.
Performance based on the crossed design The performance results of PINSPlus and NEMO based on each others' study design (i.e., data sets, competing methods, and evaluation criteria) are presented in the lower panels of Figure 1 and 2. In the study design of R19, PINSPlus does not outperform the competing methods.It is only the fourth and sixth best method with regard to the mean number of enriched clinical variables and mean −log 10 logrank p-value, respectively.It belongs to the three worst methods with regard to the number of data sets with significantly different survival and only outperforms PINS, its predecessor method, with regard to the number of data sets with at least one enriched clinical variable.In contrast, NEMO still outperforms the competing methods in the design by N19, although its superiority is not as pronounced as for PINSPlus in the same design (PINSPlus achieves 25 significant p-values while NEMO only achieves 16 for the same 34 data sets).
We also analyze the performance of PINSPlus and NEMO when data sets, competing methods, and evaluation criteria are varied individually.Figure 3 shows the resulting ranks of PINSPlus and NEMO for all eight combinations of the three components, where each component can either be set to the original or the crossed version (2 3 = 8).For each criterion (one by N19 and four by R19), a rank of 1 corresponds to the best method.If more than one method achieves the same value for a certain criterion, the minimum, maximum, and average rank are reported.As can be seen from Figure 3, the ranks of PINSPlus and NEMO generally vary for each combination of data sets, competing methods, and evaluation criterion.Apart from its original design, PINSPlus achieves rank 1 for the evaluation criteria related to survival (i.e., number of [most] significant p-values and mean −log 10 logrank p-value) in all combinations where the data sets by N19 are used.However, PINSPlus belongs to the worst performing methods according to survival when  No. of data sets with significantly different survival [Nguyen] No. of data sets with significantly different survival [Rappoport] Mean  Nguyen et al., 2019).If more than one method achieves the same value for a certain criterion, the minimum, maximum and average rank are reported.The black lines correspond to the number of compared methods, i.e., the highest possible rank.
applied to the data sets by R19.As mentioned in Section 3.1, the ten data sets corresponding to different cancer types that are used by R19 are also included in N19.Interestingly, PINSPlus achieves a significant p-value for nine of these ten data sets in N19, indicating that the difference in performance for these data sets is mainly due to the different pre-processing steps.With regard to the clinical evaluation criteria, PINSPlus seems to have average performance, neither clearly performing better nor worse than the other methods.
In comparison to PINSPlus, the ranks of NEMO are more robust across the different study designs.For six of eight study designs, it achieves rank 1 or 2 for all evaluation criteria (if the minimum or average rank is considered).The only study design where NEMO's performance is considerably worse for two evaluation criteria is the design where only the data sets are taken from N19 while evaluation criteria and competing methods correspond to the original paper.Moreover, it can be noted that while the slightly different calculation of the number of data sets with significant logrank p-values in N19 and R19 does not have an impact on the ranks of PINSPlus, NEMO tends to achieve better ranks for the version of N19.For example, it achieves rank 1 instead of 2, for settings where data and competing methods are by N19.A comparison of approximation-based and permutation-based p-values for all methods and data sets can be found in the supplement (Figure 7), showing that the approximation-based p-values are indeed generally smaller.The supplement also provides a comparison of the two different parameter settings of SNF that are specified by N19 and R19 (Figure 8), which reveals a considerable but non-systematic performance difference between the two implementations.

Data analysis task II: Differential gene expression analysis
The second data task we consider in our experiment is differential gene expression analysis, which aims at identifying genes that show differences in their expression levels between two or more conditions (Soneson and Delorenzi, 2013).Of the many methods that have been proposed for this task (Seyednasrollah et al., 2013), the more recent ones usually expect RNA-Seq data as input, which means that gene expression is measured as non-negative counts (Rigaill et al., 2018).The two methods for differential expression analysis included in the experiment are SFMEB and MBCdeg, which have recently been proposed by Zhou et al. (2021) and Osabe et al. (2021) and require RNA-Seq data as input.As stated in Section 2, these papers are selected because they make the code to reproduce the results openly available (information on where the code can be found is reported in our code documentation).We will abbreviate them by Z21 and O21 in the following.

Study design in the original papers
In this section, we review the data sets, competing methods, and evaluation criteria that are used to asses the performance of SFMEB and MBCdeg in their respective original paper and that meet the inclusion criteria of our experiment (see Table 3 for an overview).We also report the justifications of the design choices provided by the authors.Since Z21 and O21 primarily use simulated data to evaluate their methods, we do not further consider their real data analyses.
Data Both Z21 and O21 generate simulated count data representing RNA-seq read counts of p genes in 2 × n obs samples from two groups.O21 also simulate count data from three groups, but we exclude these settings from the experiment because SFMEB does not seem to be intended for this type of data (all evaluations in the original paper by Z21 are based on two-group data).The simulation framework of Z21 and O21 is based on different code implementations (code by Robinson andOshlack, 2010 andcompcodeR R package, Soneson, 2014 vs. TCC R package, Sun et al., 2013) as well as different distributions to generate the count data (Poisson and negative binomial distribution vs.only negative binomial distribution).Moreover, the two papers choose different numbers of simulation repetitions (20 vs. {50,100}), different sample sizes per group ({1,2,5,8} vs. 3), and different numbers of genes ({15000,. . ., 29800} vs. 10000).
In contrast to O21, Z21 apply pre-filtering of the genes (e.g., filtering of genes with mean count ≤ 2) for all methods, although some of their considered methods additionally filter genes internally.Moreover, in some settings, Z21 consider heterogeneous data composed of two data sets with different simulation parameters (log 2 fold-change, number of genes, etc.).In the results included in the experiment, O21 only vary the proportion of DE genes and the proportion of up-regulated genes, but in a fully factorial manner which results in 6 × 4 = 24 simulation settings.However, it should be noted that O21 also vary other parameters (e.g., the log 2 fold-change) in settings not considered in our experiment since they did not meet the inclusion criteria (e.g., because the corresponding figures are shown in the supplement).In the simulation settings by Z21 that are included in our experiment (15 settings in total), more parameters are varied, but not in a fully factorial manner.More specifically, the 15 included settings originate from five "studies" (each consisting of 3 settings) with different simulation parameters.Within each study, one simulation parameter is varied (see Table 3).
Understandably, neither Z21 nor O21 provide a justification for every single simulation parameter, but often refer to similar parameter values observed in real data.Regarding the choice of the number of simulation repetitions, however, neither of the two papers provides a justification.As criticized by Morris et al. (2019), this seems to be a general issue in papers presenting simulation studies.
Competing methods Z21 compare SFMEB with five competing methods they consider as widely used.Two of these methods are referred to as edgeR and DESeq (Robinson et al., 2009;Anders and Huber, 2010; see below for more details), which closely corresponds to the methods selected by O21 (edgeR and DESeq2; Love et al., 2014).In addition to these two methods, O21 also consider the less well-known method TCC (Sun et al., 2013), arguing that it is not sufficient to compare a newly proposed method to the most commonly used methods (edgeR and DESeq2) as those might not be the ones best suited for the analysis.Moreover, they see TCC as the main alternative to their proposed method since the normalization algorithm used by TCC corresponds to the normalization algorithm used by one version of MBCdeg.
Interestingly, Z21 and O21 use different implementations of edgeR.While the implementation by O21 corresponds to one of the edgeR standard workflows, Z21 use three different implementation of edgeR across their simulation settings of which only one would be typically considered as edgeR (but still with different parameters than O21), while the other two are only edgeR-like.One reason for this choice is that some simulation settings in Z21 do not have biological replicates (i.e., n obs = 1 in each group), for which the standard edgeR implementation yields an error (see supplement Section B.2 for details).Regarding the implementation of DESeq/DESeq2, Z21 actually use both DESeq and DESeq2, although they generally refer to the method as DESeq, the predecessor method of DESeq2.This might be explained by the fact that, similar to edgeR, DESeq2 is not intended for settings without biological replicates and thus yields an error, which is why Z21 use DESeq in these settings.Note that it has been shown that DESeq and DESeq2 perform differently (Love et al., 2014).Both Z21 and O21 use the same parameters for DESeq2.For the parameters of the remaining methods see Z21 and O21 as well as the referenced code.
Evaluation criteria Both Z21 and O21 assess the methods' ability to correctly identify DE genes using the area under the receiver operating characteristic curve (AUC).They both justify this decision with the fact that the AUC does not require the choice of a threshold value as other popular measures.The AUC takes values from 0 to 1, where 1 corresponds to perfect discrimination of DE and non-DE (i.e., non-differentially expressed) genes, and 0.5 corresponds to random assignment.However, due to an unfortunate default option in the R package used by Z21 to calculate the AUC, the resulting AUC values are 1 minus the correct AUC for some repetitions (see supplement Section B.1 for details).Apart from the different R packages used to calculate the AUC, Z21 also employ a smoothed ROC curve to estimate the AUC in some of their simulation settings (study 5), which can lead to slightly different results compared the non-smoothed ROC curve.Regarding the aggregation of AUC values across the simulation repetitions, both Z21 and O21 use boxplots.

Challenges when conducting the experiment
Reproducibility When reproducing the results presented in O21 and Z21, we do not modify the original code in a way that would change the results, with one exception: We change the number of simulation repetitions from 10 to 20 (i.e., the number reported in the paper) in the code provided by Z21 since the results using 20 repetitions are more similar to the results shown in Z21 (note that we make this change before crossing the designs).As stated in Section 2, we also use the same R and R package versions as in the original papers.However, Z21 do not provide this information, which is why we use the most recent package versions available when conducting the experiment (see our code documentation for the exact version information).The code by Z21 also does not include a random number seed, which we accordingly set but which is most likely different from the seed used by Z21.Note that for reproducing the results of Z21, we use their AUC implementation potentially yielding incorrect results, but additionally calculate the correct version.
Based on these modifications, running the code of Z21 and O21 results in very similar but not exactly the same boxplots as shown in the original papers.More specifically, the relative performance of each method is the same in the original and reproduced versions, but some boxplots have e.g.different outliers.For Z21, this relatively high degree of reproducibility is noteworthy considering the fact that the provided code does not include a seed or version information.The only three settings that do not yield similar results are the settings from study 5 by Z21 (the differences between the original and reproduced results are described in Section 4.3).Apart from the aforementioned missing seed and version information, the different results in these settings could be due to the fact that the code might not have been provided in its final version.
Crossing the designs As already stated in the first example on cancer subtyping, conducting the crossdesign experiment implies that all considered methods are applied to new data sets (new in the sense that these data sets have not been included in the original paper).It is thus necessary to carefully specify the method parameters of SFMEB, MBCdeg, and all competing methods.Although the simulation settings of Z21 and O21 are less comparable than the real data sets of N19 and R19 in the cancer subtyping example, we nevertheless adopt the parameter values from the original papers because we consider the risk of running the methods with suboptimal parameter settings to be lower for the parameters used by Z21 and O21 than for parameters selected by ourselves (especially because we select the parameters before seeing the results to avoid the risk of favoring one of the methods, as stated in Section 2.3).However, both Z21 and O21 consider more than one parameter value for some methods and Z21 even use different methods across the simulation settings (i.e., DESeq and DESeq2).For all methods evaluated in Z21 (i.e., SFMEB and its competing methods), we adopt the parameters from study 5 since they are the most similar to the simulation settings considered in O21 (i.e., non-heterogeneous data, generated using the binomial distribution, with replicates).In all simulation settings of O21 included in our experiment, the authors evaluate two versions of MBCdeg, which are denoted as MBCdeg1 and MBCdeg2 and correspond to two different normalization options.Since MBCdeg1 and MBCdeg2 are also implemented separately in the code, we include both versions in the experiment but decide to focus on MBCdeg2, which was observed to be slightly more stable and accurate in O21, before seeing any results.Although O21 do not vary any other parameters of MBCdeg or the competing methods, we note that the main parameter of MBCdeg that is extensively discussed by O21 might not be ideal for some simulation settings of Z21.We thus conduct a sensitivity analysis using two different values for this parameter (see Section B.3 for details).
Since O21 and Z21 use the same evaluation criterion (i.e., boxplots representing the AUC of each simulation repetition), we only re-evaluate the performance of SFMEB and MBCdeg on each other's competing methods and data.When crossing the designs, we do not consider the AUC that is based on the smoothed ROC curve used by Z21 in some simulation settings.Of course, we also do not use the incorrectly calculated version of the AUC.
Note that not all design components of Z21 and O21 are compatible.More specifically, the DESeq2 and edgeR implementation in O21 results in an error when applied to the simulation settings without biological replicates by Z21.As stated in Section 4.1, this is because DESeq2 and edgeR are not intended for settings without biological replicates and O21 do not use a (possibly non-ideal) workaround solution as done by Z21.

Results
Performance based on the original study design The upper panels of Figure 4 and 5 show the reproduced performance results of SFMEB and MBCdeg2 with an additional dashed line corresponding to the median AUC of the corresponding method of interest over all simulation repetitions.Note that the method labels are adopted from the original papers although the competing methods DESeq and edgeR in Z21 do not exactly correspond to the actual method in some simulation settings as discussed above.
For SFMEB, we show both the reproduced AUC values that are potentially biased towards higher values and the correct AUC values.As stated in the previous section, we only observe a noteworthy performance difference between the reproduced results and the results shown in Z21 for three simulation settings (i.e., study 5).In these settings, two competing methods consistently show better performance in the reproduced version, leading to SFMEB being the second best instead of best performing method in two settings.However, these differences become irrelevant when looking at the correct AUC results.In fact, only the AUC values of the competing methods are in some settings affected by the incorrect AUC calculation, resulting in SFMEB outperforming its competing methods more clearly than initially claimed by its authors.The performance results of SFMEB based on the corrected AUC values are thus still consistent with the conclusion of Z21 that SFMEB outperforms its competitors in most settings (achieving rank 1 according to median AUC in 13 out of 15 settings).
MBCdeg2 also performs well in its original study design.As noted by O21, the method tends to achieve higher AUC values in the settings with a small (≤ 0.45) proportion of DE genes.In some settings where the proportion of DE genes is ≥ 0.55, however, the method seems to fail, often resulting in AUC values below 0.25 and not being able to outperform any of its competing methods (the same applies to MBCdeg1).O21 discuss the occasional failure of MBCdeg extensively and conclude that the identification of the non-DE gene cluster (which they state to be the key to the proposed framework) fails in these cases, which leads to an incorrect classification of DE and non-DE genes.However, MBCdeg2 generally performs better than the competing method TCC in settings where TCC performs well (the same applies to MBCdeg1).Given the fact that TCC could be expected to outperform other methods since the data sets are generated using the TCC R package and the normalization algorithm used by TCC was designed for settings with asymmetric (i.e., = 0.5) up-regulation as considered by O21, O21 see this as the main contribution of their study.
Performance based on the crossed design The lower panels of Figure 4 and 5 display the performance results of SFMEB and MBCdeg2 based on each other's simulation data and competing methods.In the study design of O21, SFMEB generally shows worse performance than in its original design, having lower median AUC values than all of its competitors in 17 out of 24 settings.However, in 5 out of the remaining 7 settings (the settings with a high proportion of DE genes that are mostly up-regulated in one group), SFMEB clearly outperforms the competing methods.Interestingly, this difference in relative performance is mainly caused by the varying AUC values of the competing methods edgeR, DESeq2, and TCC.SFMEB itself, on the other hand, shows very robust AUC values across all settings.However, with a median AUC of about 0.65 in each setting, SFMEB's absolute performance is worse than in the original study, where the lowest median AUC of SFMEB is 0.72.
Similar to SFMEB, MBCdeg2 generally performs worse compared to its original design.In 10 out of 15 settings, it is outperformed by all competing methods.However, it is the second best method in 4 of the remaining 5 settings (based on median AUC).In contrast to SFMEB, the absolute performance varies more across the settings and only reaches a value comparable to the original design (excluding the settings where the method failed) in two settings.Similar to its original design, there are four settings where MBCdeg2 shows extremely low AUC values, which again seems to be caused by the incorrect identification of the non-DE cluster (note that these are all settings where the proportion of DE genes is ≥ 0.6, which is consistent with O21's observation in the original paper).As stated in Section 4.2, we also conduct a sensitivity analysis where MBCdeg2's main parameter is set to a different value.However, this does not improve the AUC values (see Figure 9 in the supplement).
Figure 6 shows the resulting performance ranks of SFMEB and MBCdeg2 when data and competing methods are varied individually.For both SFMEB and MBCdeg2, the data sets and competing methods can either be set to their original or the crossed version, which results in four (= 2 2 ) different study designs.Within each study design, the ranks are calculated separately for each simulation setting based on the median AUC and are summarized as boxplots (i.e., each boxplot consists of 15 or 24 ranks, which corresponds to the total number of settings considered by Z21 and O21, respectively).All AUC values are calculated correctly.For both SFMEB and MBCdeg2, the performance mainly depends on which simulated data sets are considered.In contrast, using different competing methods has no considerable impact on the distribution of ranks, except that the maximum possible rank reflecting the worst method varies according to the number of competing methods.This is also due to the partial overlap of competing methods in both designs.
The results of MBCdeg1 based on the crossed design are very similar to the results of MBCdeg2 and can be found in the supplement (Figure 10).Zhou et al., 2021;MBCdeg2: Osabe et al., 2021) or crossed (SFMEB: Osabe et al., 2021;MBCdeg2: Zhou et al., 2021).Each boxplot consists of 15 or 24 ranks (additionally represented as black points), which corresponds to the number of data settings considered by Zhou et al. (2021) and Osabe et al. (2021), respectively.The ranks of SFMEB and MBCdeg2 in each data setting are calculated based on the median AUC value across all simulation repetitions.The black lines correspond to the number of compared methods, i.e., the highest possible rank.Note that for one combination of data and competing methods, not all competing methods can be applied to all settings, which is why the number of compared methods varies between 2 and 4.
Table 3 Overview of the study design components used for performance assessment of SFMEB and MBCdeg.Included are only components (i) for which the corresponding results are presented as figures or tables in the main paper (i.e., not in the supplement), (ii) that compare the method of interest to other competing methods, and (iii) that correspond to the performance evaluation based on simulated data.In addition, some components are not included in the experiment, which are indicated by asterisks (*).Competing methods and evaluation criteria for data sets not included in the experiment are not shown.In case of design-implementation-gaps, the table refers to the code for reproducing the results.

Summary of results and limitations
In this paper, we conducted a systematic experiment, which we refer to as "cross-validation of methods" and in which we re-evaluated methods based on the data sets, competing methods, and evaluation criteria of a paper proposing a method for the same data analysis task.We considered two exemplary data analysis tasks, namely cancer subtyping using multi-omic data and differential gene expression analysis.For each analysis task, we selected two methods, PINSPlus (Nguyen et al., 2019) and NEMO (Rappoport and Shamir, 2019) for cancer subtyping, and SFMEB (Zhou et al., 2021) and MBCdeg (Osabe et al., 2021) for differential expression analysis.
Although we did not conduct our cross-design validation experiment on a large scale, several interesting findings emerged.First, the difficulties in finding eligible papers showed that many papers are still being published without openly available code for reproduction.For the papers that were selected, running the provided code did not yield the exact same results as presented in the respective paper.The results of PINSPlus, however, were close to being fully reproducible with only one differing p-value in one of the competing methods.Nevertheless, the reproduced results of all four methods were consistent with the conclusion of the original papers that the respective method shows good performance.
Second, the experiment concretely illustrated the researcher degrees of freedom regarding the performance assessment of a method.Notably, all four study designs seemed well though-out and the authors provided justifications in most cases.Interestingly, even for the design components that were similar in both papers, the exact implementation was often different.For example, SNF and edgeR were included as competing methods in both papers of the cancer subtyping and differential expression analysis task, respectively, but were run with different parameters.
Third, the experiment showed how differences in the study design can affect the performance of a method.Three out of the four considered methods (PINSPlus, SFMEB, and MBCdeg) performed worse when assessed on the crossed study design.Only one method, NEMO, performed well when evaluated on the study design by PINSPlus and only showed slightly worse performance in some settings where data sets, competing methods, and evaluation criteria were varied individually.For both analysis tasks, using different data sets (real or simulated) had the largest impact on the performance results, which was particularly surprising for the real data sets of the cancer subtyping example where both papers used the same data type and source.
It is important to note that while the findings of our experiment might help to see the performance reported in the original papers from a different perspective, they cannot be seen as evidence of any of the four methods generally having good or bad performance.First, our experiment is limited in the sense that we did not include all study designs and corresponding results reported in the papers, which gives an incomplete picture regarding the study design of the papers and, importantly, the individual strengths and weaknesses of each method.This also includes qualitative evaluation criteria such as PINSPlus' user-friendliness regarding the choice of the number of clusters (which was also noted by Duan et al., 2021), NEMO's simplicity and support of partial data, the avoidance of potential error-prone data normalization when using SFMEB, and the high interpretability of MBCdeg's main parameter.Second, the method performances observed in the experiment clearly depend on (i) our own expertise regarding each method, and (ii) the respective new design we re-evaluated each method on (in principle, the 2 × 2 table in Section 2 [Table 1], could be extended to a K × K table by including further designs, and each of them would probably yield substantially different results).This can be seen as a limitation of our study.

Mechanisms leading to an optimistic performance evaluation and possible solutions
In our experiment, we observed that three out of four methods performed worse when evaluated on a new study design, which seems to be consistent with the general concern that the performance of newly proposed methods is over-optimistic (Buchka et al., 2021;Norel et al., 2011;Boulesteix et al., 2013).Although a "sample size" of four and the re-evaluation of the methods on only one new design does not allow generalization of these results (neither for the considered methods nor for methodological research in general), the experiment provides insights into the mechanisms that can lead to performance differences between original and subsequent studies.In the following, we will discuss four of these mechanisms, which have either been addressed frequently in the literature or are rarely mentioned in literature but seem to have been present in our experiment.In addition, we point to possible solutions that can help to avoid large performance discrepancies between original and subsequent studies.
Overfitting of study design to method Our experiment illustrated the many degrees of freedom existing in the assessment of a method's performance.This flexibility can tempt researchers to choose the study design in favor of their proposed method.This may happen both at the planning stage when researchers primarily select a study design in which their method is expected to perform well (e.g., leaving competing methods at their default parameters), and after seeing the results when they add and/or omit certain design components (e.g., simulation parameters or evaluation criteria; Ullmann et al., 2022;Nießl et al., 2022;Pawel et al., 2022).Focusing on advantageous designs at the planning stage is not necessarily a questionable research practice but becomes problematic if not clearly stated.Changing the study design after seeing the results may be legitimate in some cases as far as it is transparently reported, for example if the originally chosen evaluation criterion turns out to behave inadequately for all methods.But changing the study design is bad practice if it is performed in a cherry-picking fashion, i.e. excluding or including results depending on whether they convey the expected message or not.The "overfitting" of the study design to the method increases the risk of obtaining different, less optimistic conclusions in a subsequent comparison study in which the authors have less incentives to present the corresponding method in a favorable light.As already noted by Simmons et al. (2011) in the context of applied research, such optimizations most often do not reflect malicious intent.Instead, they are usually the result of self-serving interpretations of ambiguity convincing honest researchers that the decisions (in our case, regarding the study design) matching their expectations and hopes are the most appropriate ones for various other reasons.These mechanisms are certainly encouraged by publication pressure and publication bias (Boulesteix et al., 2017).Selective reporting after seeing the results can be largely avoided by pre-registering study designs and documenting all changes that have to be made subsequently (Morris et al., 2019;Pawel et al., 2022).However, it does not prevent authors from selecting advantageous designs from the start when planning their study.This pitfall could be avoided by adapting the designs from previous studies conducted by different authors, however, this might not always be suitable to demonstrate all features of the method.
For the papers considered in our study, we do not assume that any components regarding the data sets, competing methods or evaluation criteria have been optimized to make the corresponding method of interest appear better than it actually is.On the other hand, we cannot completely rule out this possibility, although it is especially unlikely for NEMO, which was evaluated using a study design adopted from a previously conducted comparison study (Rappoport and Shamir, 2018), similar to pre-registration where the design is fixed in advance.
Overfitting of method to study design Just as the study design can be "overfitted" to the method of in-terest, the method of interest can also be "overfitted", i.e., over-optimized to the study design.This was already noted by Jelizarow et al. (2010) and Ullmann et al. (2022) with a focus on overfitting to the considered data sets.Since method development is, in itself, an optimization process that usually consists of several improvements after seeing the performance results, it is difficult to determine the point where further optimization amounts to overfitting the method to the design used for performance assessment.Note that this issue not only concerns the method characteristics that are not intended to be changed by the user, but also the parameters that can be set by the method user and whose optimal values for specific applications might also be overfitted to the considered study design (Ullmann et al., 2022;Pawel et al., 2022).
To avoid overfitting of the method of interest to the study design, evaluating the method extensively is recommended.This includes using a large number of data sets and/or simulation settings and several evaluation criteria as well as checking the robustness of the method with respect to small changes in the study design since this makes it more difficult for the method to be artificially optimized (Norel et al., 2011;Boulesteix, 2015;Ullmann et al., 2022;Nießl et al., 2022).In principle, this is comparable to the classical context of regression where overfitting is less likely to occur if the number of observations is large.
Moreover, it may be helpful to re-evaluate newly developed methods using a different design after the termination of the trial-and-error process, which might yield slightly worse but likely more realistic performance results (in the sense that the performance discrepancy between original and subsequent papers decreases).Although previous literature usually focuses on evaluating the method on new data (Jelizarow et al., 2010;Norel et al., 2011;Ullmann et al., 2022), considering different competing methods and evaluation criteria could also be reasonable.To reduce the risk of choosing the new design in favor of the proposed method, one could apply the design of a previous study conducted by different authors.As discussed above, the design of a previous study might not be suited to present all features of the proposed method but this might be less relevant if the design is considered as an additional "external validation design".An external validation design could be, for example, the design of a neutral comparison study, or, similar to our experiment, a previously proposed method (e.g., a method that was included as competing method).This procedure is only feasible without much additional effort if the authors of the previous paper have made the code for reproducing the results openly available and does not protect against systematic manipulation (e.g., modifying the method after seeing the results and thus consciously biasing the external validation).
When reading a paper, it is typically not possible to identify whether the method of interest has been overfitted to the design used for method development and, unless explicitly stated, if there are any settings that have been separated from the development process.This also applies to the papers included in our experiment, which do not have a corresponding statement.However, MBCdeg is mainly based on an algorithm that was developed by different authors for a different analysis task (i.e., clustering of genes that have already been identified as differentially expressed), which means that this part of the method could not be overfitted to the design of Osabe et al. (2021).
Different levels of expertise While the mechanisms discussed above are mostly attributed to the nonneutrality of the authors proposing their new method, there are also other potential mechanisms leading to deteriorating performances in subsequent papers.One of them originates from the fact that, as already noted by Duin (1996), the performance of a method is not just dependent on the design it is evaluated on but also on the skill of the person who applies the method.The difference in performance between original and subsequent papers may also be due to the lower expertise level of the subsequent authors whose parameter choice when applying the method on the new data is likely to be less optimal than the parameters that the authors of the method would have selected.Of course, the degree to which the performance deteriorates due to the lack of expertise may be different for each method (Boulesteix et al., 2017) and also depends on how much the new design in which the method is applied differs from the design of the original paper.
As described in Section 3.2 and 4.2, we also faced the challenge of choosing appropriate method parameters when applying the methods of our experiment to the new data sets and we cannot rule out that these decisions might have led to a worse performance than if the authors of the original papers had chosen the parameters themselves.In the first example on cancer subtyping, we note that although the data sets in both papers had the same data type and originated from the same source, the authors of NEMO and PINSPlus might have set different parameters (including method specific pre-processing steps) for their respective method since the data sets have a different distribution of samples and omic variables (due to the different pre-processing steps and number of data sets).For example, the authors of PINSPlus might have normalized the data when applying it to the data sets of NEMO.The same applies to the differential expression analysis example, where we decided to set SFMEB's parameters for the crossed simulation data as in the simulation setting of the original paper that seemed to be the most similar to the new simulation.It is possible that the authors of SFMEB who are experts for this method might have used a different parameter setting.For MBCdeg, we also cannot rule out that our low level of expertise has contributed to the deteriorating performance of the method.Although we evaluated different values for one parameter of MBCdeg as a sensitivity analysis, we only did that to a limited extent and the considered values may still be suboptimal (for example, the authors did not specify how the parameter should be set in the presence of uniquely expressed genes, which are not considered in their simulation settings).It also has to be noted that we are non-expert users for many of the competing methods used for each paper, and for instance the performance of Consensus Clustering and iCluster+ (competing methods of PINSPlus) is certainly dependent on the expertise level of the user since the optimal number of clusters has to be specified manually based on different types of plots and is thus very subjective.However, the difference in expertise (i.e., comparing our expertise vs. the expertise of the authors of the four papers) is probably less drastic with regard to the competing methods than for the methods of interest, and is thus not of equal relevance.
One possibility to avoid the systematic deterioration of performance in subsequent studies due to a lower level of expertise is to involve the authors of the method in the respective study (Boulesteix et al., 2017;Morris et al., 2019;Pawel et al., 2022).This can be realized if they implement their method themselves, as done for example in the study by Zapf et al. (2021) that involved the authors of all considered methods as co-authors or in benchmark studies that are organized as challenges such as the DREAM challenges (https://dreamchallenges.org/).Alternatively, the authors of a method can be contacted to make sure that their method is implemented correctly as done in the comparison study by Herrmann et al. (2021).However, while the authors of a method could potentially be involved in the majority of comparison studies that assess their method, they will not be able to verify the correct implementation of their method in every applied study.While there is value in studying the performance of a method when used by an expert, it might thus be even more important to assess the performance when it is applied by non-experts (Duin, 1996;Boulesteix et al., 2017), as we did in this experiment.Note, however, that even among the non-experts of a method, there are different levels of expertise -or a different willingness to gain expertise by getting more familiar with the method (which may apply in particular to authors that use the method as competitor for their own method).
In general, it might thus be advisable for authors to make the performance of their method less dependent on user expertise, e.g., by focusing on concrete guidelines on how to choose optimal parameter values in different applications or by implementing automated parameter selection.The latter, although not always feasible, would protect against the above-mentioned tendency to leave method parameters of competing methods at default values.Moreover, reporting the robustness of the method performance with respect to different parameter values (as done by all four paper considered in the experiment) allows method users to gain understanding on which parameters need to be carefully specified (Ullmann et al., 2022).Of course, reducing the effect of different levels of expertise also requires efforts from the authors of subsequent papers who need to consider the available guidelines and information on how to set the method parameters.
Different fields of application An insight we gained from the experiment that seems to be rarely addressed in the literature but plays an important role for the optimistic performance evaluation of newly proposed methods is related to the appropriate field of application of a method and its individual strengths within this field.If a method performs worse in a subsequent paper, this can indeed be due to the mutual overfitting of method and design or the lack of expertise, as discussed above.However, the deteriorating performance could also be explained by the fact that the field of application of the subsequent study does not exactly match the field of application the method is intended for.Unfortunately, our experiment suggests that it often hard to assess if this is the case.
For example, although NEMO and PINSPlus obviously have the same general field of application (i.e., cancer subtyping using multi-omic data), it was clear that PINSPlus, in contrast to NEMO, is not intended to be used on partial multi-omic data sets (i.e., data sets where some patients do not have any measurements for one or more omic data type), which is why we excluded them from our experiment.On the other hand, PINSPlus was initially (i.e., in its original paper) only evaluated based on its ability of finding subtypes that have significantly different survival while NEMO was additionally assessed based on the enrichment of certain clinical variables such as tumor stage.We did not exclude the clinical enrichment criterion, although it could be argued that PINSPlus is only intended for applications where it is relevant to find subtypes with different survival.
Similarly, in the differential expression analysis example, we excluded the three-group simulated data used to assess the performance of MBCdeg in the original paper since the authors of SFMEB did not explicitly mention that their method is intended for this type of application.On the other hand, we did not exclude the settings without biological replicates (i.e., n obs =1 in each group) used by the authors of SFMEB from our experiment although the authors of MBCdeg did not explicitly state that settings without biological replicates belong to MBCdeg's field of application (and other popular methods such as edgeR and DESeq2 are explicitly not intended for this setting).Moreover, it is not clear whether MBCdeg can be applied in settings with uniquely expressed genes (i.e., genes with zero counts in one condition), which were included in most settings used to evaluate SFMEB.
These examples show that it is often not clear for method users what the method's exact field of application is, which consequently makes decisions on whether it is appropriate to apply the method to a new design more difficult and subjective.On the other hand, authors proposing a new method cannot be expected to provide an exact definition of the method's appropriate field of application that accounts for every imaginable design, and some authors explicitly state that the method simply requires more evaluation in certain designs to assess whether they belong to the method's appropriate field of application.For example, the authors of MBCdeg mention that their method still needs to be evaluated on additional simulation frameworks and real data with different experimental settings and organisms.
In general, authors proposing a new method should thus try to study and define its field of application as comprehensively as possible, while authors applying the method in a subsequent study should carefully check whether the inclusion of the method is appropriate.
An issue related to the field of application is that methods often have specific strengths or features within their field of application, which is typically reflected by the design and not problematic if reported transparently (as discussed above).However, the method's strengths and special features may not be highlighted to the same extent through the design of the subsequent study (which may be for instance selected to highlight the strengths of a different method), thus leading to a deteriorating performance.
We also observed this in our experiment.As mentioned above, a special feature of NEMO is that it can handle missing values in the omic data.However, this feature does not come into play in the study design of PINSPlus, which cannot handle missing values (so that its authors did not consider designs with missing data).Notably, NEMO outperformed the competing methods in the original paper even more clearly for the data sets with missing values than for the full data sets, and although NEMO showed good performance in the design of PINSPlus, its performance might have been even better if the crossed design had also included data sets with missing values.In the differential expression example, the authors of SFMEB emphasize its strength of not requiring data normalization, which is an essential step for most other methods that can mislead downstream analysis if not done correctly.The authors of SFMEB include several data settings where normalization can be errorprone, such as heterogeneous data sets with clearly different fold-changes between the conditions.This special strength is however not relevant for the settings of MBCdeg that are included in our experiment, which may have also led to SFMEB's deteriorating performance.
In contrast to the mismatch regarding the appropriate field of application discussed above, it is no necessarily inappropriate if a subsequent study disregards the strengths of a method, but it should be ideally mentioned.Note that the discussed mechanisms can also be applied to the competing methods of the original and subsequent papers, whose field of application and specific strengths might be more or less reflected by the study design.

Conclusion
Based on the insights gained from the cross-design validation experiment, we conclude that while the discrepancy between original and subsequent studies assessing the performance of a method may be, in part, attributed to the non-neutrality of the method's authors, there are also other mechanisms related to different levels of expertise and fields of application that can contribute to a deteriorating method performance.It is important that both the authors proposing a method and the authors applying the method in a subsequent study acknowledge and counteract these mechanisms.Moreover, a minimum requirement for all papers proposing and/or comparing methods should be to openly provide the code and, if possible, data to reproduce the results.This does not guarantee but at least facilitates the detection of potential over-optimistic statements in the original papers and non-appropriate use of the methods in subsequent papers.In the long run, these efforts will increase the reliability of studies proposing new methods.done for all methods in R19).Note that the information on pre-processing shown in Table 4 is based on the published code and, as far as early pre-processing that generates the data provided by the authors is concerned, on the text in the papers.This means that there could have been more pre-processing steps that are not reported.Table 5 shows the resulting number of patients and omic variables for N19 and R19 after applying the pre-processing steps.
As stated in Section 2.1, we consider all pre-processing steps that are performed for all methods as belonging to the data component and method-specific pre-processing steps as belonging to the respective methods.However, some refinements are necessary when crossing the designs.More specifically, we note that iClusterBayes and LRACluster (competing methods of R19) have very long runtimes when run on the data sets of N19.This is because N19 do not perform any variable selection as a general pre-processing step (only for iCluster+).Hence, when running iClusterBayes and LRACluster on the data from N19, we select the 2000 omic variables with the highest variance for each omic data type as it is done for k-means, spectral clustering, MCCA, and MultiNMF in R19.

A.2 Reproducibility issues for two competing methods
We have to exclude two competing methods of NEMO (rMKL-LPP and MultiNMF) from the experiment.In the README file accompanying the code of Rappoport and Shamir (2018), the authors state that reproducing the results of rMKL-LPP requires the source code of the method, which they report is only available on request from the authors of rMKL-LPP.It seems that the method can also be run on a web server by now (www.web-rMKL.org),which, however, is not available at the time of writing (last checked in August 2022).Moreover, we have to exclude MultiNMF since running the R code provided by Rappoport and Shamir (2018) (and thus by R19) requires that the user inserts MATLAB commands, which we are not able to specify correctly.Note that Tepeli et al. (2020) were also not able to reproduce the results of MultiNMF shown in Rappoport and Shamir (2018) A.3 Approximation-based vs. permutation-based p-values Rappoport and Shamir (2018) note that the χ 2 distribution assumed for the test statistics of the logrank, the χ 2 , and the Kruskal-Wallis test is not an accurate approximation for small sample sizes and unbalanced cluster sizes, especially for large values of the test statistic.Hence, Rappoport and Shamir (2018) (and thus also R19) estimate the p-values using permutation procedures, i.e., they randomly permute the cluster labels and calculate empirical p-values as the fraction of permutations for which the test statistic is greater or equal than the test statistic yielded by the original clustering.Rappoport and Shamir (2018) report that they observed large differences between approximation-based (i.e., assuming χ 2 distribution) and permutation-based p-values, with the former yielding increased type 1 errors.They conclude that at least for TCGA data sets, analyses that use approximation-based p-values might not be valid.In our experiment, the approximation-based p-values are indeed generally smaller, as can be seen from Figure 7 0.00 A.4 Reproduced performance results of PINSPlus and NEMO for each data set

A.5 Comparison of SNF implementations
The cancer subtyping method SNF is used as competing method for both PINSPlus and NEMO.However, N19 and R19 set different method parameters for SNF. Figure 8 shows the logrank p-values and number of enriched clinical variables resulting from the two different implementations, revealing a considerable but non-systematic performance difference.

B Additional information: Differential gene expression analysis B.1 Incorrect AUC calculation
Z21 use the pROC (Robin et al., 2011) to calculate the AUC.Z21 and O21 use different R packages for calculating the AUC, namely ROC (Carey and Redestig, 2021) and pROC (Robin et al., 2011), respectively.
In the pROC package, the function that calculates the ROC curve (roc) takes the argument direction, which determines whether values higher or lower than the threshold should be considered as cases (i.e., DE genes in this context).Per default, the package sets the direction automatically according to the medians of the predicted values (see argument direction in the roc function of the pROC manual), which implies that the ROC curves are biased towards higher AUC values if the direction argument is not set explicitly.More precisely, this means that if the automatically defined direction argument is not correct, the resulting AUC will be 1 minus the correct AUC.It seems as if Z21 were not aware of this unfortunate default option since they did not explicitly specify the direction argument, potentially leading to incorrect AUC values.

B.2 Different edgeR implementations
While the edgeR implementation used by O21 corresponds to one of the edgeR standard workflows, Z21 use three different versions of edgeR, of which only one can be considered as standard edgeR workflow (still using a slightly different version than O21).In six simulation settings, Z21 only use an edgeR-like implementation, which is not based on the negative binomial distribution that is usually considered for edgeR but on the Poisson distribution (presumably, this is done because the counts in these settings are generated using Poisson distribution).Since Z21 also include settings with no biological replicates (i.e., n = 1 in each group) where edgeR results in an error, they instead use a testing procedure involving a binomial test.While there are in fact several options suggested by the edgeR user manual (Section 2.12 -What to do if you have no replicates) for settings with no biological replicates (although it is stated that these options are not ideal), these do not include the procedure used by Z21.Instead, it is mentioned as an option for technical replicates (i.e., repeated measurements of the same sample that represent independent measures of the random noise associated with protocols or equipment; Blainey et al., 2014).

B.3 Sensitivity analysis of MBCdeg
The main parameter of MBCdeg (which is based on a clustering algorithm) is the number of clusters K to be found by the method.The number of clusters does not have a default value and is set to K = 3 by O21 in the simulation settings that we reproduce in our experiment.This reflects the assumption that there are three gene expression patterns: non-DE genes, DE genes up-regulated in group 1, and DE genes up-regulated in group 2 (where up-regulated in group j again means having higher expression in group j).However, O21 note that for settings where genes that are up-regulated in one group show different degrees of differential expression (i.e., fold-changes), allowing MBCdeg to generate a higher number of clusters could lead to more accurate results.This could apply to the settings of study 2 and 4 considered in Z21, which consist of two data sets with two different log 2 fold-changes (i.e., 2 and 3).As a sensitivity analysis, we thus set K = 5 for these settings, reflecting non-DE genes and the two different degrees of differential expression for both groups, which however does not result in higher AUC values (see Figure 9).Moreover, O21 state that for settings where all DE genes are up-regulated in one group, the true number of clusters is actually K = 2, reflecting non-DE genes and DE genes (all up-regulated in one group).Since this situation is present for the three settings of study 5 in Z21, we also run MBCdeg with K = 2, which, however, does not lead to improved results (see Figure 9).

B.4 Experiment results of MBCdeg1
Figure 10 presents the performance ranks of MBCdeg1, which, in contrast to MBCdeg2 uses the default normalization algorithm.(Osabe et al., 2021) or crossed design (Zhou et al., 2021).Each boxplot consists of 15 or 24 ranks (additionally represented as black points), which corresponds to the number of data settings considered by Zhou et al. (2021) and Osabe et al. (2021), respectively.The ranks of MBCdeg1 in each data setting are calculated based on the median AUC value across all simulation repetitions.The black lines correspond to the number of compared methods, i.e., the highest possible rank.Note that for one combination of data and competing methods, not all competing methods can be applied to all settings, which is why the number of compared methods varies between 2 and 4.

Figure 1
Figure 1 Performance of PINSPlus based on the original study design by Nguyen et al. (2019) (upper panel) and the study design by Rappoport and Shamir (2019) (lower panel).

Figure 4 Figure 5
Figure 4 Performance of SFMEB based on the original study design by Zhou et al. (2021) (upper panel) and the study design by Osabe et al. (2021) (lower panel).The boxplots correspond to n sim simulation repetitions, where n sim = 20 for Zhou et al. (2021) and n sim ∈ {50, 100} for Osabe et al. (2021).The red dashed line corresponds to the median AUC of SFMEB across all simulation repetitions.In the original paper byZhou et al. (2021), the AUC has not been calculated as intended by the authors, which is why the correct AUC values are provided in addition to the reproduced and potentially incorrect AUC values.

Figure 6
Figure6Performance ranks of SFMEB and MBCdeg2 based on data and competing methods that either correspond to the original (SFMEB:Zhou et al., 2021; MBCdeg2: Osabe et al., 2021)  or crossed (SFMEB:Osabe et al., 2021; MBCdeg2: Zhou et al., 2021).Each boxplot consists of 15 or 24 ranks (additionally represented as black points), which corresponds to the number of data settings considered byZhou et al. (2021) andOsabe et al. (2021), respectively.The ranks of SFMEB and MBCdeg2 in each data setting are calculated based on the median AUC value across all simulation repetitions.The black lines correspond to the number of compared methods, i.e., the highest possible rank.Note that for one combination of data and competing methods, not all competing methods can be applied to all settings, which is why the number of compared methods varies between 2 and 4.

Figure 7
Figure 7 Comparison of approximation-based and permutation-based p-values.Each point refers to the logrank p-value of a method when applied to a data set.All methods and data sets considered by N19 and R19 are included, resulting in 528 points (12 methods × 44 data sets).

Figure 8
Figure 8 Logrank p-values and number of enriched clinical variables resulting from the two different SNF implementations specified by N19 and R19.The left panel includes 88 points, representing the two different p-value estimation procedures on all 44 (34 +10) data sets considered by N19 and R19.The right panel includes 44 points since the p-values for clinical enrichment are only calculated based on permutation tests.

Figure 9
Figure 9 Performance results for MBCdeg1 and MBCdeg2 when using different values for K.

Figure 10
Figure10Performance ranks of MBCdeg1 based on data and competing methods that either correspond to the original(Osabe et al., 2021) or crossed design(Zhou et al., 2021).Each boxplot consists of 15 or 24 ranks (additionally represented as black points), which corresponds to the number of data settings considered byZhou et al. (2021) andOsabe et al. (2021), respectively.The ranks of MBCdeg1 in each data setting are calculated based on the median AUC value across all simulation repetitions.The black lines correspond to the number of compared methods, i.e., the highest possible rank.Note that for one combination of data and competing methods, not all competing methods can be applied to all settings, which is why the number of compared methods varies between 2 and 4.

Table 1
Illustration of the cross-design validation experiment Performance of NEMO based on the original study design by Rappoport and Shamir (2019) (upper panel) and the study design by Nguyen et al. (2019) (lower panel).

Table 5
Number of patients and omic variables (gene expression, methylation,miRNA expression) after all pre-processing steps (except method-specific pre-processing) have been performed.
Table 6 and 7 display the reproduced results of NEMO and PINSPlus for each data set.

Table 6
Reproduced performance results (logrank p-values) of PINSPlus and its competing methods for each data set based on the original study design by N19.

Table 7
Reproduced performance results (number of enriched clinical variables / −log 10 logrank p-values) of NEMO and its competing methods for each data set based on the original study design by R19.