A multiple testing framework for diagnostic accuracy studies with co-primary endpoints

Major advances have been made regarding the utilization of machine learning techniques for disease diagnosis and prognosis based on complex and high-dimensional data. Despite all justified enthusiasm, overoptimistic assessments of predictive performance are still common in this area. However, predictive models and medical devices based on such models should undergo a throughout evaluation before being implemented into clinical practice. In this work, we propose a multiple testing framework for (comparative) phase III diagnostic accuracy studies with sensitivity and specificity as co-primary endpoints. Our approach challenges the frequent recommendation to strictly separate model selection and evaluation, that is, to only assess a single diagnostic model in the evaluation study. We show that our parametric simultaneous test procedure asymptotically allows strong control of the family-wise error rate. A multiplicity correction is also available for point and interval estimates. More-over, we demonstrate in an extensive simulation study that our multiple testing strategy on average leads to a better final diagnostic model and increased statistical power. To plan such studies, we propose a Bayesian approach to determine the optimal number of models to evaluate simultaneously. For this purpose, our algorithm optimizes the expected final model performance given previous (hold-out) data from the model development phase. We conclude that an assessment of multiple promising diagnostic models in the same evaluation study has several advantages when suitable adjustments for multiple comparisons are employed.

the data quantity is severely limited.Bias induced by selection (e.g., data source, data splitting modalities, performance measure, comparator, etc.) is also a recognized issue.7][8][9] This is especially important because trying out a wide variety of learning algorithms is common practice in applied machine learning.This approach is somewhat justified by the no free lunch theorem of statistical learning which states that no single algorithm gives universally best results for all prediction tasks. 10Modern algorithms additionally involve the tuning of several hyperparameters.For instance, the performance of neural networks depends on depth (number of layers), width (neurons per layer), activation function(s) and further hyperparameters. 11In effect, usually dozens, hundreds or even thousands of candidate models are trained and compared.3][14][15][16][17] In the machine learning community, the according data sets are commonly denoted as training, validation and test set.
9][20][21][22] Relevant sources for overoptimistic conclusions are selection of nonconsecutive patients, analysis of retrospective data and focus on severe cases and healthy controls. 204][25] This may prevent other research teams to replicate the findings on similar problems (replicability) or even on the same data (reproducibility). 26,27he focus in this work are statistical methods for diagnostic accuracy studies.While the literature often targets traditional diagnostic or prognostic tests, the majority of concepts translate to the evaluation of (machine-learned) prediction models for diagnostic or prognostic purposes.In the terminology of Knottnerus and Buntinx (chapter 2) 28 our work is mainly tailored toward cross-sectional phase III (comparative) diagnostic test accuracy studies which ask the question: "Among patients in whom it is clinically sensible to suspect the target disorder, does the level of the test result distinguish those with and without the target disorder?" 28anslated to our scenario, the goal of such a study is to provide evidence that a novel diagnostic or prognostic model (test) outperforms either a given comparator or, if no such competing model (test) is available, a given performance threshold.Performance, that is, sensitivity and specificity, is measured with regard to the (defined) ground truth which is in the optimal case derived by a so-called gold standard.It may however be infeasible to implement such a gold standard in the study or there may not even exist one.Thus, typically a reference standard is defined as the best available approximation of the gold standard.Common reasons to utilize an new model (test) in clinical practice are either lower invasiveness or costs.This balances the fact that, by definition, the index model (test) cannot have better performance than the reference standard.][25] A recent example of a diagnostic accuracy study in the context of applied machine learning covers the IDx-DR device which allows diagnosis of diabetic retinopathy in diabetic patients via deep learning based on retinal color images. 32iagnostic accuracy was assessed in an extensive observational clinical trial involving 900 patients, which lead to a marketing permit by the FDA*.An overview over the current regulatory framework for machine learning empowered medical devices in the EU and the United States from the viewpoint of radiology has recently been provided. 33Important developments are in particular the new Medical Devices Regulation (MDR) and the new In Vitro Diagnostic Medical Device Regulation (IVDR) in the EU. 34,35These new regulations are certainly relevant for the application of machine learning methods for medical testing purposes and apply from May 2020 and May 2022, respectively.In the IDx-DR study, only a single model was (successfully) prospectively evaluated on the independent test dataset. 32This default strategy is often advised in machine learning and diagnostic research to avoid any selection induced bias.It is reasonable if previously available data for model training and selection is (a) large in number and (b) representative of the intended target population.However, when potentially hundreds of modeling approaches are compared on (quantitatively and/or qualitatively) modest datasets, the model selection process can rarely be concluded with confidence.While the default approach enables an unbiased estimation and simple statistical inference, one is thus bound to this one-time model choice under uncertainty.In effect, this strategy is quite inflexible as it is impossible to retrospectively correct a flawed model selection without compromising the statistical inference in the evaluation study.
To address this issue, an established multiple testing approach 36 was recently adapted to explicitly take into account that multiple models are assessed simultaneously on the same evaluation dataset. 37In effect, model selection can be improved with help of the test data. 38The employed simultaneous test procedure is based on the (approximate) multivariate normal distribution of performance estimates.An advantage of this procedure is that the multiplicity adjustment needs to be less strict when candidate models give (highly) similar predictions.This approach allows approximate control of the overall type I error rate and construction of simultaneous confidence regions as well as corrected (median-conservative) point estimates.Moreover, it was found that selecting multiple promising models can increase statistical power for the evaluation study compared to the default approach where only the best (cross-)validation model is evaluated.
Multiple testing methodology is not particularly popular, neither for medical test evaluation nor in machine learning. 38s indicated above, the recommendation to evaluate a single final model or test on independent data has prevailed in both domains.In this case, no adjustments for multiplicity are necessary as the final model is selected independently of (i.e., prior to) the evaluation study.Our approach has some overlap to so-called benchmark experiments in predictive modeling.Several authors showcase different approaches for the comparison of learning algorithms over multiple (real or artificial) datasets and inference regarding their expected performances. 39,40In contrast, we aim to evaluate prediction models (conditional on the learning data), which is arguably more important at the end of the model building process-right before implementation of a specific model in (clinical) practice.In medical testing applications, it is natural to consider only a single index (candidate) test.This is particularly the case, when different tests are based on separate biological samples.It would then often not be considered ethically justifiable to assess more than a single index test and the reference test on the same patients.In contrast, when all index test are derived from the same data sample, for example, retinal images, the number of index tests (diagnostic models) should be primarily guided by statistical considerations as long as the type I error rate can be controlled.
This work covers two distinct contributions.Firstly, we extend the existing multiple testing approach 37,38 to diagnostic accuracy studies with co-primary endpoints sensitivity and specificity.Secondly, we describe a novel Bayesian approach to determine the optimal number of models to include in the final evaluation study based on preliminary data.The remainder of this work is structured as follows.In Section 2, we motivate our work by a real-world application, namely the prediction of abnormal fetal health status with machine learning methods.In Section 3, we introduce core assumptions and notation and establish our inference framework.In Section 4, we describe a novel Bayesian approach to determine the optimal number of models to include in the final evaluation study based on preliminary data.In Section 5, we show several numerical results.We apply our methods to the real data introduced in Section 2 (Section 5.1).We show results from a large simulation study in the machine learning context with the aim to compare different approaches for model selection prior to the evaluation study (Section 5.2).Lastly, we assess our multiple testing approach with regards to finite sample type I error rates under least-favorable parameter configurations (Section 5.3).Finally, in Section 6, we summarize our results, point out limitations of our framework, and give an outlook on possible extensions.

MOTIVATING EXAMPLE: SCREENING FOR ABNORMAL FETAL STATE
Cardiotocography (CTG) measurements capture the fetal heart rate and uterine contractions during pregnancy and delivery.The use of CTG can contribute to improve the management of high-risk pregnancies.In the following, we utilize the publicly available Cardiotocography Data Set † to illustrate the approach introduced in this work as well as its use case in clinical research. 41 The dataset consists of 2126 observations and is labeled with a measurement date which allows us to emulate a prospective evaluation study.For that matter, the chronologically first n  = 1594 observations are utilized for model development.The prediction target is an abnormal (suspect or pathologic) fetal state which has a prevalence of 24% in the learning data.The reference standard is a consensus vote of medical experts.Five different classes of machine learning models are utilized, namely penalized logistic regression (or elastic net) models, decision trees, random forests, support vector machines, and gradient boosted trees.
For each learning algorithm, 200 distinct hyperparameters are investigated which are sampled randomly as implemented in the R package caret. 42To assess model performance, a hold out validation set with n  = 318 observations is used to estimate out-of-sample sensitivity and specificity of each of the 1000 candidate models.Empirical validation performance is quite high with 82 models exceeding 90% sensitivity and specificity of which 48 models are random forests (18 elastic net models, 2 support vector machines, 14 gradient boosted trees).
Assuming a sensitivity of 80% and a specificity of 70% have been declared as the absolute minimal acceptance criteria for implementation in clinical practice, a typical research question is now: "Which of the developed prediction models should be evaluated with regards to these co-primary endpoints in an independent, prospective accuracy study?"To counter the uncertainty in the preliminary validation estimates, in this work, we will rather select multiple promising models instead of a single model, as often recommended (Section 4).The selected models are then evaluated simultaneously while adjusting for multiple comparisons (Section 3).The evaluation study for our example utilizing the new methodology is illustrated in Section 5.1.

Prerequisites
We consider predicting a binary label Y ∈ {0, 1} based on P ∈ N features X ∈ R P .Our canonical example in this work will be the distinction between diseased (having a given target condition, Y = 1) and healthy (Y = 0) subjects.In medical applications, the target Y should be obtained by the (best available) reference standard and forms the ground truth for learning and evaluation.The goal of supervised machine learning is to provide a prediction model f ∶ x  → ŷ with high performance.In practice, this is achieved by a learning algorithm (e.g.stochastic gradient descent) which trains a prediction model f (e.g., a neural network) based on learning data  = {(x i , y i )} n  i=1 .We emphasize that we are concerned with the assessment of a specific model f trained on specific data  and not with the properties of the algorithm that was used to learn f .Finally, we note that, from the viewpoint of predictive modeling, disease diagnosis and prognosis essentially only differ regarding the time lag between capturing the features x and the label y-at least if this lag is approximately equal in the population of interest.Otherwise, a time to event analysis may be more appropriate.For the sake of brevity, we will focus on diagnosis tasks in the following.In medical diagnosis, sensitivity (Se) and specificity (Sp) are often both assessed simultaneously.We thus consider the tuple  = (Se, Sp) as the performance measure of interest where Se = P( f (X) = 1|Y = 1) and Sp = P( f (X) = 0|Y = 0).This is advantageous compared to just considering the overall accuracy Acc = P( f (X) = Y ) = Se + (1 − )Sp in the sense that neither the accuracy in healthy or diseased alone dominates our perception of the predictive performance if the disease prevalence  = P(Y = 1) is small.
As we have outlined in Section 1, it is usually recommended to conduct a final performance assessment on independent evaluation or test data  = {(x i , y i )} n  i=1 .Performance estimates on the learning data are generally a bad indicator of the true performance, in particular when a highly complex model might have overfitted the learning data.We assume that  is an i.i.d.sample from the unknown joint distribution  (X,Y ) of X and Y .Moreover, we assume that M ∈ N models f m , m ∈  = {1, … , M}, have been initially trained in the learning phase via different learning algorithm.A subset  ⊂  of these initial candidate models is selected for evaluation.To simplify the notation we assume that the candidate models are ordered such that  = {1, … , S} with 1 ≤ S ≤ M. Different approaches to identify  are discussed in Section 4. This is implemented in practice by training models first only on the so-called training data  ⊂ , a subset of the learning data.The remaining validation data  =  ⧵  is than used for a performance assessment of the resulting preliminary models f − m (only trained on  ).The model(s) selected for evaluation are than re-trained with all available data  =  ∪  before going into the evaluation study with the expectation that this (slightly) increases their performance.This procedure is know as (simple) hold-out validation.There exist several variations and alternative strategies such as cross-validation or bootstrapping. 12,16They can stabilize model selection, are however computationally more expensive as models have to be retrained several times which is not always feasible in practice.For simplicity, we will focus on simple hold-out validation in this work.

Study goal
Our overall goal is to identify a prediction model with high sensitivity and specificity from the M initially trained models.Additionally, we aim to show superiority of at least one model compared to prespecified thresholds  0 = (Se 0 , Sp 0 ) ∈ (0, 1) 2 .Alternatively,  0 = ( f 0 ) is the unknown performance of a comparator f 0 , that is, an established diagnostic testing procedure which is also estimated in the evaluation study.In the following, we will focus on the former scenario for brevity.Throughout this work, we assume that the evaluation study is declared as successful if and only if superiority in both endpoints for at least one candidate model m ∈  relative to the the comparator  0 can be demonstrated.This so-called co-primary endpoint analysis is the standard approach in confirmatory diagnostic accuracy studies. 43,44Other approaches have also been discussed in the literature. 45The system of null hypotheses is thus given by This corresponds to an union-intersection test over models and an intersection-union test over parameters (Se, Sp).
Equivalently, the hypothesis system can be expressed as The goal of the evaluation study is to estimate the unknown sensitivities and specificities, provide confidence bounds for these estimates and obtain a multiple test decision for the hypothesis system (1).
A multiple test  = ( 1 , … ,  S ) depends on the evaluation data  and results in a binary vector  ∈ {0, 1} S of test decisions whereby H m is rejected if and only if  m = 1.In a regulatory setting, it is usually required that  shall control the family-wise error rate (FWER) strongly at significance level , i.e. we require for all possible parameter configurations Hereby,  0 =  0 (,  0 ) ⊂  is the index set of true null hypothesis, that is, we have either Se m ≤ Se 0 or Sp m ≤ Sp 0 for all m ∈  0 .Any parameter  for which FWER  () becomes maximal is called a least favorable parameter configuration (LFC).In this work, we will restrict our attention to asymptotic control of the FWER, i.e. ( 2) shall only hold as n = n  → ∞.
We will investigate the finite sample FWER in realistic and least-favorable settings in section 5. We extend  to a multiple test for the actually relevant hypothesis system concerning all initial candidate models f m , m ∈ , by setting  m = 0 for all m ∈  ⧵ . 37 That is to say, a model m cannot be positively evaluated ( m = 1) when it is not selected for evaluation.This natural definition has two consequences.Firstly, the extended test retains (asymptotic) FWER control as only non-rejections are added.Secondly, we can compare different model selection strategies because the extended multiple test always operates on  =   and not only on   . 37

Parameter estimation
We assume that a subset of promising models  = {1, … , S} ⊂  has been selected prior to the evaluation study.Suitable strategies for that matter are presented in Section 4. The observed feature-label data  = {(x i , y i )} n i=1 from the evaluation study is transformed to the actual relevant binary similarity matrices Q Se ∈ {0, 1} n 1 ×S and Q Sp ∈ {0, 1} n 0 ×S for the diseased (y = 1) and healthy (y = 0) subpopulation, respectively.Hereby, n 1 and n 0 are the number of diseased and healthy subjects of the n = n  = n 1 + n 0 evaluation study observations.The entry of Q in row i and column m is equal to one if the prediction of the ith observation by the mth model is correct and zero if it is wrong.The sample averages can thus be calculated as the column means of these similarity matrices.Moreover, we can estimate the covariances  Se = cov( Ŝe) and  Sp = cov( Ŝp) as the sample covariance matrices of the similarity matrices divided by factors of n 1 and n 0 , respectively.A problem arises when a sample proportion of one (or zero) is observed.In this case, say if Ŝe m = 1, the plug-in variance estimate Ŝe m (1 − Ŝe m )∕n 1 collapses to zero.A secondary effect are confidence intervals of zero width which are difficult to interpret.In this work, we tackle this problem by utilizing a shrinkage estimator which is based on the posterior distribution of a multivariate Beta-binomial model. 46A detailed description is provided in Appendix S1.8][49] For finite sample sizes, the influence of this adaptation is conservative as mean estimates are slightly shrunk toward 0.5 and variance estimates are slightly inflated.In addition, our approach induces shrinkage of the empirical correlation matrix toward the identity matrix.Asymptotically, the adapted estimators are equivalent to (4) and the asymptotic properties described in the next section thus remain unchanged.

Multiple comparison procedure
Our goal is to apply the so-called maxT-approach to hypothesis system (1). 36This simultaneous test procedure has previously been utilized for the assessment of the overall classification accuracy of several binary classifiers. 37,38This multiple test enables us to reduce the need for multiplicity adjustment in case several similar models are evaluated.Similarity is measured by the correlation of (empirical) performances which depends on the probability of a common correct prediction in our context. 37The maxT-approach relies on determining a common critical value c  , which depends on the empirical correlation structure, such that each hypothesis H m is rejected if and only if The test statistics are defined in a standard manner via The test statistics T Sp m are defined accordingly.When only a single model is evaluated with regard to co-primary endpoints the tests regarding sensitivity and specificity each need to be conducted at local level .This is because the LFC is of the form that is, one parameter (Se or Sp) is equal to one and the other parameter lies on the boundary of the null hypothesis.
Similarly, for our case of S candidate models, the two cases described in (6) are possible for each dimension m ∈  = {1, … , S}, resulting in 2 S potential LFCs.We could attempt to identify the single least favorable configuration in terms of multiple testing adjustment which depends on the true correlation matrix.However, the computational burden of this extensive approach increases exponentially in S as 2 S critical values would have to be computed.We will instead focus on a more direct approach.Assume we knew if whereby I S is the S-dimensional identity matrix.Finally, we can define the S-dimensional vector of test statistics Assuming This result is the foundation for the application of the maxT-approach and proven in the supplementary material.Together with the consistent estimator of the correlation matrix R b , we can calculate a common critical value c  such that under the 'estimated LFC'  b.Hereby, b and B are simply defined by plugging in Ŝe for Se and Ŝp for Sp. S and Φ S denote the density and distribution function of the S-dimensional standard normal distribution with mean 0 and covariance matrix Rb .In practice, c  can be found by numerical integration, for example, with help of the R package mvtnorm. 50nally, we can define our actual test statistic T with entries T m = min(T Se m , T Sp m ) for which T ≤ T b holds (deterministically).This allows us to define a multiple test  via  m = 1 ⇔ T m > c  which asymptotically controls the FWER (2) at level .A rigorous formulation of this result is provided in Appendix S1.
We can also construct (e.g.) one-sided simultaneous comparison regions for Se and Sp via This is why CR = (CR Se , CR Sp ) should be reported as a comparison region and not as a confidence region or interval. 51 regular, wider confidence region may however also be obtained via the maxT-approach and reported in addition.For this, one simply needs to replace Rb in (7) with the block correlation matrix composed of RSe and RSp .When setting  = 0.5, the lower bounds of ( 8) can be used as corrected point estimators Se, Sp with the property that the overall probability that both Se m and Sp m overestimate their target for any m ∈  is asymptotically bounded by 50%. 37

Study conclusion
Our framework provides a multiple test decision for the hypotheses system (1).After the evaluation study, one might therefore partition the set of candidate models  =  ∪  into positively and negatively evaluated prediction models In the case  = ∅, no performance claim can be made and the evaluation study has failed in a strict confirmatory sense.When || = 1, only a single model meets the requirements and the situation is clear.However, if || > 1, the investigator has the choice which of the models m ∈  should be implemented in practice.
When applying the maxT-approach, m * = argmax m∈ T m is an obvious final model choice.In the co-primary endpoint setting, this corresponds to m * = argmax m∈ min(T Se m , T Sp m ), that is, we select the model for which the evidence that both endpoints are significant is maximal.It might also be reasonable to select the final model according to further criteria not directly tied to the discriminatory performance.For instance, the calibration of predicted probabilities is a distinct important attribute of prediction models.3][54][55] Moreover, the interpretability of a prediction model is also recognized as an important requirement for implementation in clinical practice. 56,57

MODEL SELECTION PRIOR TO THE EVALUATION STUDY
A prominent recommendation in machine learning and in medical testing is to strictly separate model development (training and selection) and model evaluation (performance assessment), compare Section 1.The most straightforward implementation of this strategy is to select a single final model prior to the evaluation study.Often times this is done by choosing the model with the highest empirical performance on a (hold-out) validation dataset out of the M initially trained models, as described in Section 3.1.When evaluating binary classifiers regarding their overall accuracy, previous work has shown that it is beneficial to evaluate multiple promising models simultaneously, for example, all models within one standard error of the best validation model. 37,38This so-called within 1 SE selection rule can lead to evaluation studies with higher power and improved (final) model selection.Upward biased accuracy estimation can be corrected in a conservative fashion.To obtain multiplicity adjusted test decisions and corrected point estimates, the so-called maxT-approach [36][37][38] may be applied which we adopted to the co-primary endpoint setting in Section 3.4.
To arrive at a suitable selection of prediction models for the evaluation study, more work is necessary when sensitivity and specificity are assessed simultaneously.The within 1 SE rule 37 can be adopted to the co-primary endpoint scenario by basing it on the weighted accuracy wAcc = wSe + (1 − w)Sp, w ∈ (0, 1) instead of the overall classification accuracy.We will investigate the performance of this approach with the weight w = ( 1−Se 0 1−Sp 0 + 1) −1 in Section 5.This specific choice for w can be motivated by requiring that the two parameter configurations (Se, Sp) = (Se 0 , 1) and (Se, Sp) = (1, Sp 0 ) are assigned an equal wAcc value.If sensitivity and specificity are deemed equally important (Se 0 = Sp 0 ), this implies a weight of w = 0.5.While the heuristic within 1 SE approach is intuitive and has proven to work empirically, it lacks a throughout theoretical justification.In particular, the question which multiplier k ≥ 0 results in the best within k SE rule remains open.
In the following, we introduce a more elaborate algorithm which aims to derive the models to be evaluated in an optimal fashion.We are dealing with a subset selection problem, that is to say that our goal is to find the best subset  ⊂ , whereby "best" is yet to be defined.This topic has been extensively studied in the literature, e.g. from the viewpoint of Bayesian decision theory [ 58, chapter 9].From this perspective, the usual approach is to pick the decision (here: the subset ) from an appropriate decision space (here: {0, 1} M ) which minimizes the posterior expected loss, or equivalently maximizes the posterior expected utility.Several loss functions for the subset selection problem have been proposed in the literature [ 58, p. 548].For instance, we may define the utility of picking subset  ⊂  given the true parameter  = (Se, Sp) and  = min(Se − Se 0 , Sp − Sp 0 ) (understood component-wise) as In ( 9), we balance out the best selected model performance max m∈S  m against the subset size S = ||.Note that, the second term in ( 9) is independent of .This trade-off is necessary as we could easily maximize max m∈S  m by just selecting all models  = .The parameter c thus has to be adjusted to guide this trade-off.This is also the case for most other popular utility functions proposed in the literature. 58This will be problematic in practice, because there is (again) no clear-cut solution how to specify the "hyperparameter" c.
To overcome this issue, we will use a utility function which has no hyperparameter in the above sense that needs to be specified.For this, it is important to realize that we are really dealing with a two-stage selection problem.First, our goal is to select a suitable subset  ⊂  of models before the evaluation study.Secondly, from  we aim to select a final model m * ∈  with help of the evaluation study for implementation in practice.As our ultimate goal is to obtain a model or medical test with high performance, that is, high sensitivity and specificity, we propose to optimize the expected final model performance depending on the subset of models  ⊂  selected for evaluation.The final model m * = m * ( θ) is chosen based on the empirical performances in the evaluation study and thus a random variable, unlike the fixed parameter  = (Se, Sp).Our default final model choice will be m * = argmax m∈ min(T Se m , T Sp m ), compare Section 3.5.The expectation in ( 10) is taken with respect to p( θ | ), the distribution of θ in the evaluation study with n observations which depends on .Assuming the prediction models are deterministic, p is a deterministic transformation of  n (X,Y ) , the distribution of the feature-label data (X, Y ) as described in the beginning of Section 3.3.Of course, both distributions are unknown in practice, because  is not known.
In formulation (10), EFP depends on  ⊂  for which in principle there are 2 M choices.In order to simplify the decision problem which models to evaluate and its numerical optimization, we rank our models before choosing  which leaves us with only M choices instead of 2 M , namely  = {1},  = {1, 2}, …or  = .Consequently, we may write EFP(S) instead of EFP() after ranking the models.How to rank models is not as obvious as in the single endpoint case concerned with the overall classification accuracy.Our default choice will be a ranking according to min(T Se m (), T Sp m ()), whereby the test statistics are defined as in (5) but based on the validation data.In other words, the models are ranked according to the evidence before the evaluation study that H m is false.
The main idea behind our novel optimal EFP selection rule is to maximize an estimate ÊFP(S) of EFP(S) before the evaluation study.Taking the viewpoint of Bayesian decision theory, we aim to maximize the posterior expected utility at the time point of decision-making, right before the evaluation study.The expectation is taken with regard to the posterior distribution  = ( | ) of  given the validation data .Different Bayesian models could be applied to obtain said posterior distribution.In this work, we assume that  = ( Se ,  Sp ) is composed of two multivariate Beta (mBeta) distributions, Se ∼  Se and Sp ∼  Sp .which allows to model arbitrary dependency structures between different Beta variables. 46s an initial prior distribution () of , we use a vague (uniform) prior by default.This approach allows us to sample  = (Se, Sp) from  and subsequently to sample estimates θ = ( Ŝe, Ŝp) from the multivariate Binomial likelihood p = p( θ | ).Utilizing this Bayesian model allows us to forecast both, true parameter values and their respective estimates including the dependency structure of both.This enables a simulation of the selection process in the evaluation study and the inherent selection-induced bias problem.In our numerical implementation, we iteratively repeat this process until our computational budget has been exhausted and finally average the results to approximate the (double) expectation in (11)  for all S. Hereby, we specify a maximal number of models S max for evaluation in advance to reduce the computational burden of the optimization.A solution to the subset selection problem in dependence of the validation data  is then simply  = {1, … , S * } with Given the assumed prior distribution, the initial model ranking, the statistical modelling assumptions (concerning  and p), the procedure results in an approximate Bayes action as we approximately maximize the posterior expected utility EFP(S).The numerical implementation of the selection rule is described in more detail in Appendix S1.In Section 5, the optimal EFP approach is applied to the real data example from Section 2 and to synthetic data from an extensive simulation study.

Model evaluation study for motivating example
We continue the real-world data example from section 2. We can now apply the optimal EFP selection rule from Section 4 to answer the previously stated question which prediction models to evaluate in the prospective accuracy study.
Based on empirical hold-out validation performance, S = 16 models are selected for evaluation.All of them are support vector machines as this model class was apparently best suited to train models with higher sensitivity than specificity which aligns with our study goals (Se 0 = 0.80, Sp 0 = 0.70).This can be explained by the fact that the particular caret implementation utilizes a hyperparameter to adjust for different misclassification costs.From the 16 selected models, no model empirically performs worse than Ŝe() = 0.96 and Ŝp() = 0.80 in terms of hold-out validation performance.
In the subsequent evaluation study with n  = 532 observations, the best model has an empirical performance of Ŝe() = 0.929 and Ŝp() = 0.771.Adjusting for multiple comparisons using the maxT-approach from section 3.4, we obtain lower multiplicity adjusted comparison bounds of 0.863 and 0.722, respectively.As both bounds are greater than the respective thresholds Se 0 = 0.80 and Sp 0 = 0.70 the null hypothesis (1) can be rejected.This result can also be deducted from the corresponding multiplicity adjusted P-value which is .001= max(0.001,4 ⋅ 10 −6 ) for the best model and as such smaller than the (one-sided) significance level of  = 0.025.
Interestingly, all other models fail to replicate their good performance on the evaluation data, at least there is no further significant result.In particular, the seemingly best model before the evaluation study ( Ŝe() = 0.973, Ŝp() = 0.881) now fails to accurately classify healthy subjects as such ( Ŝe() = 0.936, Ŝp() = 0.614).In summary, in this illustrative example, the selection and evaluation of multiple promising models, saved the evaluation study from not being able to demonstrate a sufficient screening accuracy for at least one prediction model.

5.2
Simulation study: machine learning and evaluation

Setup
The goal of this first simulation study is to assess the properties of different model selection rules and our statistical inference approach under realistic parameter settings in the machine learning context.To this end, we investigate different methods on a large simulation database which was generated for related methodological comparisons. 38While the overall binary classification accuracy was our only performance measure of interest in earlier work, we turn to the investigation of sensitivity and specificity as co-primary endpoints in the following.The simulation database consists of 144 000 instances of the complete machine learning pipeline.As shown in Figure 1, a single instance consists of training ( ), validation (), and evaluation () datasets which are all comprised of feature-label observations (x, y).Different binary classification models were trained on the so-called learning data  =  ∪ , compare Section 3.1.For that matter, we employed four popular learning algorithms (elastic net, classification trees, support vector machines, extreme gradient boosting) with help of the R package caret. 12,42More concretely, we trained M = 200 initial candidate models, 50 models obtained from randomly sampled hyperparameters per algorithm, on the training data  ⊂ .The distinct validation data  =  ⧵  is then used for model selection.The models  ⊂  = {1, … , M} selected for the evaluation study are then retrained on the complete learning data  =  ∪  as this can be expected to slightly increase their predictive performance.The selected models  ⊂  then undergo a final assessment on the independent evaluation data  which mimics a diagnostic accuracy study.The goal of this final evaluation study is to estimate sensitivities and specificities of the selected models and to obtain a test decision regarding hypotheses system (3).Ultimately, a final model m * ∈  is selected to be implemented in practice, if positively evaluated, compare Section 3.5.
For each individual simulation instance, the distinct data sets  and  are sampled from the same joint probability distributions  (X,Y ) .To generate the whole database, different distributions  (X,Y ) with varying characteristics have been specified.The database consists of linear and nonlinear tasks (ratio 1:2), independent and dependent feature distributions (ratio 1:1) and different disease prevalences  ∈ {0.15, 0.3, 0.5} (ratio 1:1:1).The learning data  is of size n  ∈ {400, 800} (ratio 1:1).The model selection was always conducted on a hold-out validation dataset  ⊂  of size n  = n  ∕4.Each learning data set (including the trained models) was used twice, once in connection with each of the considered evaluation data set sizes n  ∈ {400, 800} (ratio 1:1).
The true model performances  = (Se, Sp) of all models are approximated with high precision on a large population data set (n  = 100, 000).This dataset was not used for any other purposes.The truly best model is defined in line with the study goal outlined in Section 3.2 as The parameter Δ 0 is set to 0 in this simulation study which expresses that sensitivity and specificity are deemed to be equally important.The corresponding maximal performance is denoted as  ⊕ =  m ⊕ .The final model m * ∈  is chosen based on the evaluation data as m * = argmax m∈ T m , compare Section 3.5.In case of a tie, m * is chosen randomly from the set argmax m∈ T m .R packages and scripts related to the simulation study are publicly accessible.§ We initially divided the simulation database into two parts via a stratified random split with regard to the above mentioned factors.The smaller subset with 24 000 instances was used in advance to perform a few methodological comparisons with the goal to fine-tune certain aspects of the optimal EFP selection rule described in Section 4. For instance, we altered the way models are ranked initially and several details associated to numerical complexity of the algorithm (number of iterations, convergence criterion) as outlined in the supplementary material.The final algorithm was then fixed and investigated on the main part which consists of 120 000 instances, as described in the following.

Goals
Our overall goal is an assessment of important operating characteristics of the employed evaluation strategies.An evaluation strategy is comprised of a selection rule (before evaluation) and a statistical testing procedure (for evaluation).The latter will always be conducted via the maxT-approach as introduced in Section 3.4.The primary objective of this simulation study is therefore the comparison of different selection rules as summarized in Table 1. 59,60e are mainly interested in the performance of our novel Bayesian approach designed to optimize the EFP, compare Section 4. This approach is referred to as optimal EFP selection rule.Its main competitor is the within 1 SE rule for which  is defined as all models m ∈  with preliminary balanced accuracy bAcc m () = ( Ŝe() + Ŝp())∕2 (estimated on ) not more than one standard error below the maximal bAcc m ().The default rule selects the single best model in terms of bAcc() for evaluation and will serve as a secondary comparator.Additionally, we consider the oracle selection rule which cannot be implemented in practice but gives an insight on the theoretically achievable performance (of selection rules).The oracle rule always selects the truly best model  = {m ⊕ } as defined in (12).To reduce the computational complexity of the simulation study we hard-thresholded the number of evaluated models to S max = √ n  .Our main research hypothesis going into the simulation is that the optimal EFP improves the expected final performance (EFP) and statistical power compared to the within 1 SE rule while having comparable estimation bias.Moreover, we require that the FWER is still controlled at the desired level which is set to  = 2.5% (one-sided) in all scenarios.

Results
Firstly, we investigate how well the feature-label relationships can be learned by the considered algorithms.Figure 2 (left) shows the distribution of the truly best model performance  ⊕ over all unique 60 000 learning instances.The results are stratified for disease prevalence  and learning sample size n  .We observe that  ⊕ is increasing in both these factors, as expected.Overall, the median of  ⊕ lies between 79% and 86% which we consider to be a very realistic range.Moreover, Figure 2 (right) shows the distribution of Δ ⊕ = Se ⊕ − Sp ⊕ which indicates how balanced the accuracy of the best model is across the diseased and healthy subpopulations.Note that for the oracle rule, the model selection is  = {m ⊕ } and thus m * = m ⊕ is also the final model.Figure 3 illustrates the true performance of the final selected model m * ∈  whereby  depends on the employed selection rule.Depending on the threshold value  0 = Se 0 = Sp 0 , Figure 3 shows the (simulated) probability P( m * ≥  0 | ) that the final chosen model has at least a performance of  0 =  ⊕ − ,  > 0. The general picture is that the optimal EFP and within 1 SE rules both clearly outperform the default rule.Concerning our main comparison, we observe that the optimal EFP approach uniformly outperforms the within 1 SE rule.The difference is however rather small.For instance, using the optimal EFP rule leads to a probability of 86.1% to select a final model with sensitivity and specificity higher than 0.75 compared to 80.9% with the within 1 SE rule (difference 5.2%; 99% CI: 5.0%-5.5%;average over all scenarios).Further detailed results are provided in Appendix S1.The case  = 0 or  0 =  ⊕ corresponds to the smallest threshold value  0 such that the global null hypothesis (none of the initially trained models has a high enough performance) is true, compare (3).Thus, for  ≤ 0 the global null is true and the rejection rate rr() should be bounded by  = 2.5% for all selection rules.Note that this parametrization via  is necessary as  ⊕ is not fixed over all simulation instances as illustrated in Figure 2. The results clearly show that FWER control is given for all rules in all scenarios.For  ≤ 0 the rejection rate rr() is actually not zero but positive as reported in table 1 in Appendix S1.In the same table, we also show the FWER when the (pre-)selection process is not incorporated in the test decision.More precisely, we are then looking at the type I error rate of the final statistical test for hypothesis system (1) which is equal to Rejection rate for the global null hypothesis (all prediction models have performance  m ≤  0 ) in the evaluation study after models have been selected for evaluation by specified selection rule. =  ⊕ −  0 specifies if the global null is true ( ≤ 0) or false ( > 0).Each scenario of learning and evaluation sample sizes (n  , n  ) is based on 30 000 repetitions of the complete machine learning pipeline.The significance level is  = 2.5% (one-sided, dashed horizontal line) P( m * = 1 |  0 =  m * ).We refer to this quantity as conditional FWER which is necessarily larger than rr(0) but still controlled, that is, smaller than  in all cases.Strictly speaking, it is only an upper bound of the proper conditional FWER P( m * = 1 |  0 = max m∈  m ) which was not "recorded" in our simulation database.
In the situation  > 0, the threshold value  0 is small enough such that the global null is no longer true, that is, there is at least one model m ∈  with true performance  m >  0 which we seek to detect.Larger rejection rates are thus positive as they correspond to an increased expected statistical power.The expectation hereby is taken over the generative distribution defined by the simulation design.Concerning expected power, the ranking of selection rules is the same as in the EFP comparison (Figure 3).The multiple testing approaches optimal EFP and within 1 SE outperform the default rule clearly.The expected power increase is up to 30%, depending on the scenario and .The optimal EFP rule outperforms the within 1 SE rule for all investigated values of .The difference in expected power is more pronounced in the case n  = 800 compared to n  = 400.
Additional results are reported in Appendix S1.For instance, the properties of the corrected point estimators θm * = ( Se m * , Sp m * ) are assessed, compare Section 3.4.When only a single model is evaluated, θ corresponds to θ, the raw (slightly regularized) point estimate, compare Section 3.3.For example, the combined mean absolute deviation is decreased when multiple models are evaluated simultaneously (optimal EFP: 0.040; within 1 SE: 0.048, averaged over both cases n  ∈ {400, 800}) compared to the default rule rule (0.077).This might seem counter-intuitive at first, but is in line with previous findings. 38This observation can be explained with the on average increased final model performance  m * = min(Se m * , Sp m * ) which results in a decreased variance of the binomially distributed estimators.In contrast, the estimation bias is more pronounced when multiple models are evaluated.Note that the bias E( θm * −  m * ) is negative (by construction) as we use the corrected estimator θm * = min( Se m * , Sp m * ), compare Section 3.4.
Finally, we note that the optimal EFP rule on average selects fewer models for evaluation when n  = 400 compared to the within 1 SE rule.On the other hand, when n  = 800 samples will be available in the evaluation study, the optimal EFP rule selects slightly more models on average, compare table 1 in Appendix S1.

Simulation study: LFCs
In the previous section, the operating characteristics of our simultaneous inference framework were assessed in a variety of realistic scenarios in the predictive modeling context.In contrast, we investigate the worst case type I error rate of the maxT-approach for different sample sizes in the following.For that matter, we simulate synthetic data under LFCs as described in Section 3.4 and also close to LFCs.We assume that S ∈ {1, 10, 20} models are assessed on the evaluation data and consider the case when Se 0 and Sp 0 are identical.For each simulation run, we randomly draw a binary vector b such that |b| = S∕2.Hereby, the parameter  ∈ {0, 0.001, 0.002} expresses how close we are to a true LFC which corresponds to  = 0.For instance, for S = 10, Se 0 = Sp 0 = 0.8,  = 0.001, a possible (nearly least favorable) parameter configuration is given by If  were 0 instead of 0.001 in (13), all elements in Se b, and Sp b, different from 1 would change to Se 0 = Sp 0 = 0.8 (assuming the same b).We mainly consider a (fixed) disease prevalence of  = 0.2.Hence, there are n 1 = n diseased and n 0 = n − n 1 healthy subjects, whereby n ∈ {200, 400, 800, 4000, 20 000} in our simulation.Additionally, the scenarios with  = 0 were also repeated with a balanced class distribution ( = 0.5).In all simulations, the true correlation between all estimators which belong to nontrivial variables (true mean not equal to one), for example, Ŝe 1 and Ŝe 2 in the example (13), is set to 0.5.Results from a sensitivity analysis regarding this specification are reported at the end of this section.Any empirical sensitivity is of course independent from any empirical specificity.Data generation, that is, sampling of the similarity matrices Q Se and Q Sp (Section 3.3) based on the above specifications, was conducted with help of the bindata package in R. 61,62 The number of simulation runs was set to N sim = 10 000 which results in a (point-wise) upper bound for the SE of the simulated FWER of 0.5∕ √ 10.000 = 0.005.Figure 5 shows the simulation results stratified by all relevant factors.In all scenarios, the results match the theoretical findings from Section 3.4 in that the asymptotic FWER under LFCs converges to  = 0.025 as a function of the sample size n.When the disease prevalence is increased from  = 0.2 to 0.5, the FWER is (slightly) decreased.In comparison, the number of models S and the parameter  have a stronger influence on the FWER.In the most extreme case, the FWER is close to 14% for S = 20 and Se 0 = Sp 0 = 0.9 for n = 200 and reaches its target  = 0.025 not before n = 20 000.Remember that there are only n 1 = n diseased subjects, so the "relevant" sample size is much smaller, particularly in the case  = 0.2.When we relax the LFCs ( = 0) to only slightly less favorable parameters ( = 0.001) such as in (13), the FWER declines much faster in n.For Se 0 = Sp 0 = 0.8 the target significance level is met already at around n = 400 ≈ exp(6) total or n 1 = 80 diseased observations.For Se 0 = Sp 0 = 0.9, this number more than doubles to roughly n = 1100 ≈ exp (7) for S = 10 and n = 1800 ≈ exp(7.5)for S = 20.
In an ancillary simulation (N sim = 5000), we also investigated results under different true correlation structures of the estimators Ŝe, Ŝp (independence, equicorrelation, autocorrelation) and strengths.None of these changes resulted in qualitatively different results than those shown in Figure 5. Additionally, we also investigated the cases Se 0 = 0.8 and Sp 0 = 0.9 and vice versa in the LFC scenario (N sim = 5000).Hereby, it became apparent that the FWER development depends almost solely on the sensitivity value(s), at least in the low prevalence case ( = 0.2).This can be explained due to the lower number of diseased subjects, and hence lower "relevant" sample size.When  = 0.5, the problem becomes symmetric and the FWER curves for  0 = (0.8, 0.9) and  0 = (0.9, 0.8) are thus not distinguishable and lie in between those shown in Figure 5, that is, between the cases  0 = (0.8, 0.8) and  0 = (0.9, 0.9).
The worst-case projection of parameter values to one, for example, of Sp 1 , Sp 2 , Se 3 , … in (13), is not really realistic.For instance, in the case Se m = 0.9, Sp m = 1 and  = 0.2, the overall accuracy is Acc m = 0.98.In real-world problems, such a high accuracy is rarely attainable.We therefore also considered an alternative data-generating scenario where ( 13) is modified such that the overall accuracy for each model m is bounded by 0.95.This change decreased the FWER by 0.5

F I G U R E 5
Family-wise error rate (FWER) in dependence of the total sample size n = n 1 + n 0 (logarithmic scale) under different parameter configuration (specified by  0 , ) and disease prevalences  = n 1 ∕n.In all cases, S∕2 (randomly selected) models have Se m = Se 0 and Sp m = 1 and vice versa for the remaining S∕2 models.For S = 1, Se = Se 0 and Sp = 1.The target significance level is  = 2.5% (one-sided) N sim = 10 000 to 2.0 percent points depending on the exact scenario.Otherwise, the results from this sensitivity analysis do not deviate noticeably from those shown in Figure 5.

Summary
In this work, we have investigated statistical and study design related methods to improve diagnostic accuracy studies with co-primary endpoints sensitivity and specificity.The main idea is to allow that multiple promising candidate models are assessed simultaneously on the evaluation data, thereby increasing the likelihood of identifying one that performs sufficiently well.In modern medical testing applications this is necessary because often hundreds or thousands of model architectures and preprocessing pipelines are compared on preliminary data, making a final decision before the evaluation study difficult and prone to error.Our approach allows to wait for the test data until a final decision is made and thereby possibly correcting an erroneous model ranking.
The main advantages of the multiple testing approach are the on average increased final model performance and the vastly improved statistical power.As multiple comparisons are now conducted on the final dataset, inferences need to be adjusted via a multiple testing procedure.For that matter, we employed a variation of the so-called maxT-approach which is based on a multivariate normal approximation. 36This framework also enables a corrected, conservative point, and interval estimation.On the downside, the unbiased estimation of the default single model evaluation strategy is lost.An important question of practical interest is how models should be selected for evaluation.Besides the heuristic within 1 SE rule, we also employed the novel optimal EFP rule.This Bayesian approach aims to optimize the EFP before the evaluation study.The underlying model for that matter is based on the hold-out validation data from the model development phase which is readily available before the evaluation study.The comparison regarding all important criteria (performance, power, estimation bias) turns out in favor for the more elaborate optimal EFP approach.However, the advantage relative to the within 1 SE rule was small to moderate in most scenarios.We conclude that the optimal EFP rule yields a vast improvement over the default approach but the conceptually and implementation wise simpler within 1 SE rule yields almost as good of results.
As was laid out in Section 4, an advantage of the optimal EFP selection rule is that the it does not involve any tuning parameter, in contrast to other potential decision theoretic approaches and the within k SE rule (k > 0).Another major advantage is that the optimal EFP rule takes into account the evaluation sample size, i.e. it selects more models when more data is (expected to be) available in the evaluation study.This is not the case for the simpler within 1 SE rule which only takes into account estimation uncertainty in the validation stage, before the evaluation study.A general advantage of evaluating multiple models simultaneously is that other decision criteria such as model calibration can be used to guide the final selection once superiority with regard to model accuracy has been established for multiple models.A disadvantage is the increased statistical (and computational) complexity.

Limitations
Our simultaneous test procedure guarantees asymptotic strong control of the FWER.As illustrated in Section 5.3, the finite sample performance under LFCs can be unsatisfactory, depending on the exact scenario.As expected, the FWER depends primarily on the threshold values Se 0 and Sp 0 because the employed normal approximation performs worse near the boundary of the unit interval.Control of the FWER is also worse when the number of models S is large or when the relevant smaller subsample size, usually the number of diseased, is small.In effect, our framework may not always be suitable in situations where only a very small number of diseased is available for evaluation purposes.We could aim to improve this limitation.However, as indicated in the main simulation (Section 5.2), the average FWER under realistic conditions is far below the significance level.Note that in contrast to the simulation study presented in Section 5.3, the analysis in Section 5.2 is an average risk assessment.The average is taken with respect to different generative distributions over the relevant parameter values, that is, model performances and their dependency structure.These generative distributions are not known but rather implicit in the sense that the relevant parameters depend on the (feature-label) data generating distribution(s)  X,Y (which are known) and properties of the learning algorithms.Their resulting characteristics are partially described in Figure 2 and deemed to be realistic for real-world prediction tasks.Note that the results from Section 5.2 can be generalized to other prediction tasks and learning algorithms which result in similar generative distribution over parameters  (and dependency structure thereof).
Together, the results from both simulation studies, realistic scenarios in Section 5.2 and the worst-case assessment in Section 5.3, indicate that parameter configurations in reality are rarely least favorable.An associated observation was made at the end of Section 5.3.When the LFC is made only slightly less unfavorable, the FWER declines faster in n.In the machine learning context, it would be reasonable to assume that the larger the number of models S, the less likely least favorable parameters become.After all, the LFC corresponds to the setting that from M initial candidate models, S have the exact same performance at the boundary of the null hypothesis and are all selected for evaluation.While this reasoning does apply to the practitioner, it does not apply from a purely frequentist viewpoint as probability statements over true parameters are irrelevant when (worst-case) FWER control in the strong sense is the goal.
The multiple testing approaches certainly increase the flexibility in the evaluation study.However, prediction models still need to be fixed entirely before the study.Verifying this could pose challenges in a regulatory context as merely specifying the learning algorithm and hyperparameters would not be enough in this regard.As the exact resulting models, that is, their weights, often depend also on the random initialization in the training process, only a full specification of model architecture and weights can be considered sufficient in that regard.
It can be seen as a limitation of this work that the learning algorithms implemented in the real data example and simulation study are not necessarily optimal for the considered classification tasks.On the other hand, this also applies to the real world as researchers rarely know which learning algorithms perform poorly, before applying them to the problem at hand.While we expect that the present numerical results generalize to more complex problems (e.g., imaging tasks) and algorithms (e.g., deep neural networks), this was not demonstrated in the present work and should thus be investigated in the future.

Extensions
Our framework can be extended to classification problems with more than two classes, for example, different disease types or severity grades.More generally, it can be employed in arbitrary prediction tasks for which not only the overall performance is important but also the performance in different subpopulations.So far, we have not considered such problems in simulation studies.It can however be expected that the operating characteristics of our multiple testing framework will largely depend on the sample size of the smallest subpopulation.This could render the approach practically infeasible for problems with many or an unbalanced distribution of subclasses.An adaptation which might be considered in the future is hypotheses weighting.The employed maxT-approach can rather easily be adopted for that matter. 63That is to say, rather then splitting up the significance level  equally for each of the S hypotheses, one could spend more on the most promising models.A natural source of information to determine the  ratio is of course the empirical data from the model development phase.Further research on an optimal weighting and its impact on, for example, statistical power is however necessary.A step further in connecting the evidence before and after the evaluation study would demand Bayesian methods.
Our optimal EFP subset selection algorithm is designed to maximize the expected final model performance.While this is a natural goal, there are many other potential utility functions that could reasonably be optimized before the evaluation study.In particular, utility functions based on statistical power, estimation bias or a weighted combination of different criteria appear reasonable.Our novel approach could also be used here to make the specification of the utility function independent of the subset size and thereby simpler for the practitioner.In its current form, our implementation uses several approximations and can still be improved regarding its numerical efficiency.
For simplicity, we assume that Δ Se m and Δ Sp m are not exactly equal in the following.Define the indicator variables b m = 1(Δ Se m < Δ Sp m ), b = (b 1 , … , b S ) and the corresponding diagonal matrix B = diag(b).Moreover, we define ,m together do not define a 1 −  confidence region in the full (2S)-dimensional parameter space as the coverage probability P  (Se ∈ CRSe   1− ∧ Sp ∈ CR Sp 1− ) may be smaller than 1 − .Instead, we have asymptotically that

F I G U R E 1
Machine learning pipeline of model training, selection and final evaluation of diagnostic or prognostic accuracy.This depiction shows a single problem instance in the simulation study which has the main goal to compare (subset) selection rules to improve the planning of model evaluation studies

F I G U R E 2
Distribution of  ⊕ and Δ ⊕ stratified for disease prevalence  and learning sample size n  .Each boxplot is based on 10 000 distinct points which amounts to 60 000 unique simulation instances in total F I G U R E 3 Probability of the event that the final model performance  m * = min(Se m * , Sp m * + Δ 0 ) exceeds a given threshold  0 =  ⊕ − . ⊕ is the true performance of the best candidate model.Each scenario of learning and evaluation sample sizes (n  , n  ) is based on 30 000 repetitions of the complete machine learning pipeline

Figure 4
Figure4shows the rejection rate rr() = P( m * = 1 | ) as a function of  =  ⊕ −  0 .The case  = 0 or  0 =  ⊕ corresponds to the smallest threshold value  0 such that the global null hypothesis (none of the initially trained models has a high enough performance) is true, compare (3).Thus, for  ≤ 0 the global null is true and the rejection rate rr() should be bounded by  = 2.5% for all selection rules.Note that this parametrization via  is necessary as  ⊕ is not fixed over all simulation instances as illustrated in Figure2.The results clearly show that FWER control is given for all rules in all scenarios.For  ≤ 0 the rejection rate rr() is actually not zero but positive as reported in table 1 in Appendix S1.In the same table, we also show the FWER when the (pre-)selection process is not incorporated in the test decision.More precisely, we are then looking at the type I error rate of the final statistical test for hypothesis system (1) which is equal to When S = 1, we set b = (1).The true parameters Se and Sp are defined as Se b, m = b m (Se 0 − (m − 1)) + (1 − b m ), Sp b, m = b m + (1 − b m )(Sp 0 − (S − m)).
Key features of simulation study Emulation of the machine learning pipeline under different scenarios (w.r.t.sample size, dependency structure, type of feature-label relationship, disease prevalence), compare Figure1 TA B L E 1