Permutation tests for ASCA in multivariate longitudinal intervention studies

Permutation tests are the standard technique for significance testing in Analysis of Variance Simultaneous Component Analysis. However, there is a vast number of alternative approaches for permutation testing, and the number of choices grows in relation to the complexity of the study design. In this paper, we focus on longitudinal intervention studies with multivariate outcomes, a relevant experimental design in clinical studies where the outcome is an omics profile (such as in genomics, metabolomics, and the like). We propose a new technique to derive power curves tailored to the size and (un)balanced nature of the data set in the study. This technique is useful to identify misleading permutation tests, with lack of power or overly optimistic outcomes. We found that choosing the best permutation approach is far from intuitive and that there is a significant risk of deriving incorrect conclusions in real‐life analyses. Our approach avoids this risk and can be extended to other complex designs of interest. The code is available for free use.

believe that the methods and results provided here can be of wider interest in the application of ASCA to complex studies. A typical longitudinal intervention study includes at least two kinds of subjects, for example, those that receive a treatment and a control group, and experimenters would like to test the effect of said treatment in comparison to the baseline/starting point. 8 Therefore, the most interesting significance test is the one for the interaction between time (difference between baseline and post-treatment) and treatment (difference between treatment and control). Furthermore, the experimental design and corresponding ANOVA model are moderately complex if we want to get rid of the non-negligible inter-subject variance.
In a related publication to assess differential responses to cancer treatment with metabolomics, 9 we observed that different permutation tests provided inconsistent outcomes in the significance of the interaction between time and treatment. This is a fundamental flaw in the testing ability of permutation tests in ASCA and should be treated carefully since inconsistencies imply totally contrary conclusions in the study.
Motivated by the previous observation, we propose in this paper a new technique for power analysis in the permutation testing of ASCA models. This technique allows to derive power curves tailored to the size and (un)balanced nature of the data set in the study, and it is useful to identify misleading permutation tests, which have a lack of power or overly optimistic outcomes. The application of the proposed technique to the data sets in Díaz et al 9 demonstrates that the choice of the best permutation approach is far from intuitive and that there is a significant risk of deriving incorrect conclusions in real-life analyses.
The rest of the paper is organized as follows. Section 2 discusses the typical structure of factors and interactions in a longitudinal intervention study and its analysis with ASCA. Section 3 reviews a number of variants of permutation tests that are of interest for this paper. Section 4 presents our approach for the computation of power curves. Section 5 describes the data sets considered in the experimental part. Section 6 presents the simulation results, and Section 7 draws the conclusions of the work.

| ANALYSIS OF THE LONGITUDINAL INTERVENTION STUDY DESIGN WITH ASCA
A typical longitudinal intervention study includes three factors and one interaction: • Treatment factor, noted in the remaining of the paper as A, measures the difference between the treatment and control groups. This is a fixed factor. • Time factor, noted as B, measures the variance during the study, in two or more measurement points in the longitudinal study. This is also treated as a fixed factor. • Subject factor, noted in the remaining of the paper as CðAÞ, measures the inter-subject variance. This is a random factor nested in factor A. However, in this paper, we treat it as a fixed factor for simplicity, after finding similar results with alternative methods 8 that do consider its random nature. • Interaction of treatment X time, noted as AB, measures to which extent the application of the treatment caused a differential evolution in time between the treatment and control groups.
Let us call X the N Â M data matrix of outcomes retrieved in a longitudinal intervention study. The data in X can be decomposed as follows using ASCA: where 1 is a vector of ones of suitable length, m represents the overall mean, and A, B, and CðAÞ represent the factor matrices, AB the interaction matrix, and E the residual matrix. In this paper, we use the technique referred to as ASCA+ 10 to account for the study's unbalancedness. Basically, the decomposition is derived as the least squares solution of a regression problem, where X is regressed onto a coding scheme, D, based on the experimental design: where D is defined following Thiel et al 10 and Θ and E are obtained from: Thus, the effect matrix associated to factor/interaction F is computed as F ¼ D F Θ F . This solution minimizes the variance in E.
Understanding the interdependence among the factors and the interaction is relevant to derive the proper testing structure and statistic. Following Anderson and Braak, 7 the treatment factor and time factor have order 1; the subject factor has order 2, since it is nested in treatment; and the interaction also has order 2. The complete ordering structure is depicted in Figure 1, inspired by the use of Hasse diagrams in Marini et al. 11 In particular, this structure is useful to compute F ratios, exchangeable units and type III sum of squares. 7,10

| PERMUTATION TESTING FOR MULTIVARIATE EXTENSIONS OF ANOVA
Permutation testing in the context of ASCA can be performed by randomly shuffling the rows of X in Equation 3, yielding a new set of regression coefficients: where superscript * stands for permuted. Then, the permuted partition of variance of any factor/interaction F is recomputed as F * ¼ D F Θ * F and the error as E * ¼ X * À DΘ * . Equivalently, we can permute the rows in D instead of X or simply modify the values of D, which is sometimes necessary for some forms of permutation tests in unbalanced designs (e.g., permutation of AB cells of different size as exchangeable units).
Permutations tests are carried out by comparing a given statistic computed from the partition of variance in the real data set with the corresponding statistics computed from hundreds or more permutations. P values are obtained as where S refers to the statistic computed from the real data and S * stands for the statistics of the permutations. There are several choices for the statistic. The (type I) sum of squares of the factor matrix kFk 2 F was proposed as the original permutation statistic in ASCA. 5 Shortly after that, and given that ASCA is also an exploratory technique to visualize the data, and that this is typically done for the first 2 Principal Components (PCs), Zwanenburg et al 6 have proposed testing the sum of squares of the first two PCs of the factor matrix. This is also the method used in Thiel et al. 10 More recent variants of ASCA 11,12 employ the F ratio, computed as the ratio of the mean sum of squares of the factor/interaction and the suitable next order factor/interaction, often the residuals. The F ratio is also the statistic used in PERMANOVA. 3,4 Finally, for the ASCA extensions to unbalanced data, 10,12 the type III sum of squares is proposed. The latter is computed from the difference between the residuals in the reduced and full models.
Aside from the statistic under analysis, there is also a handful of permutation tests variants. Anderson and Braak 7 discuss four main types of permutation tests: (i) exact tests and (ii) approximate tests by unconstrained permutation of raw observations, (iii) of residuals of the reduced model, and (iv) of residuals of the full model. F I G U R E 1 Ordering structure among factors and interaction in the longitudinal intervention study Exact tests provide an expected type I error equal to the significance level regardless the sample size. Anderson and Braak 7 provide a description of an exact test for the F ratio,;please refer to that reference for a detailed explanation on this concept. For the test of a given factor/interaction, exact tests permute raw observations (Equation 5) by constraining the permutations to occur within levels of the factors/interactions of the same or higher order. As an example and following Figure 1, to compute the exact test of factor B, we should constrain the permutations within levels of factor A, so that permutations maintain unaltered the columns in the coding scheme D that correspond to A. Furthermore, in an exact test, the exchangeable units (permuted blocks) need to match the factor/interaction in the denominator of the F ratio. Continuing with the example, to compute the exact test of factor B, we should use AB cells as the exchangeable units, so that each complete set of observations belonging to the same cell is permuted to another (but the same) cell. Unfortunately, only factor B has a clear exact test following the description above, and we will show that this test is powerless. Therefore, the concept of exact test is not of relevance for this paper.
Approximate tests may be asymptotically exact 7 and can be a more powerful alternative for some factors/interactions. The two approximate approaches that are considered in this work are unconstrained permutations of raw observations, traditionally employed in ASCA, 5,6 and unconstrained permutations on the residuals of the reduced model, the most promising approximate test following Anderson and Braak. 7 In the latter, we use the residuals of the reduced model to compute the factor/interaction matrix of interest and then permute this computation. For instance, in the case of the interaction, the interaction matrix AB is computed as follows: and then permutations are only performed on E r . This approach is intuitively similar to constrained permutations, in the sense that the variance of the other factors/interactions is subtracted in the first equation, and we only permute the variance of interest.

| APPROACH FOR THE GENERATION OF POWER CURVES
To generate power curves, we need to simulate data that progressively rejects the null hypothesis of the different factors/interactions in the test. For that, we generate a set of basic matrices following a multinormal of 0 mean and unit standard deviation. In the supporting information, we repeat the same computations using the multivariate simulation tool SimuleMV, 13 which allows to simulate a multivariate data set with a certain level of correlation. * Our approach follows these steps: 0. INPUT: Set characteristics from the real data set * S the number of subjects of the experiment * T the number of time points * N the number of total observations * M the number of variables * y the allocation of the experimental units to levels of factors Analyst choice * R the number of repetitions * P the number of permutations * δ incremental steps in the abscesses * α the imposed significance level 1. Initialize power θ ¼ 0 for the effect size θ from 0 to 10δ in δ steps *Controlling the level of correlation can be useful in some cases, for example, to simulate background correlation or a treatment effect that is correlated in time. Please note that in either case that the basic matrices are correlated or uncorrelated, the resulting final datasets for each permutation will show correlation due to the induced effects in the algorithm.
2.4. Normalize the Frobenous norm of X s , X t and X r . 2.5. For each observation n from 1 to N • If the corresponding subject belongs to class "treatment" then: with x n alt the nth row of X alt , and x n s and x n t the corresponding rows of X s and X t , respectively, according to y. • Otherwise (class "control"): 2.6. For θ from 0 to 10δ in δ steps: 2.6.1. Yield the simulated data: 2.6.2. Compute ASCA+ partition and p value (through permutation with P permutes) 2.6.3. If the p value is below α do 3. Normalize power θ ¼ power θ =R for θ from 0 to 10δ in δ steps The algorithm works as follows. In step 0, we set the general characteristics of the data set under study and make a number of choices that are discussed below. In step 1, the algorithm initializes the values in the power curve, which is the output of interest. In step 2, we iterate through a number of repetitions to compute the curve. Each repetition consists on the simulation of a new data set, its factorization with ASCA+, and the statistical inference through permutation testing. If the computed significance is below the imposed significance level, the power is increased in one. Finally, the power is normalized by the number of repetitions.
A power curve starts from a point in absence of effect to a point in which a detectable effect size is present. The crucial part of the algorithm is to simulate data sets that lack significance at alpha 0 but are clearly significant for increasing alpha (in all factors and interactions). For that, we generate matrices X s , X t , and X r that form the structure for the subjects, the time points, and the background, respectively. In Equation 12, we define the structural data for subjects that receive a significant treatment. The average effect of the treatment is modeled by X t , with differential information in time. The lack of treatment in Equation 13 is modeled by the absence of this differential information. In Equation 14, we control the trade-off between pure background (i.e., lack of significance) and structure.
We can use the previous algorithm to simulate data with the same structure (inputs S, T, N, M, and y) of the real data at hand, to compute power curves and to infer the best permutation test approach tailored to the data characteristics. Our expectations are that any test performing adequately will show a power curve shape similar to the one of the optimal test in Figure 2: starts at the imposed significance level and goes as quickly as possible to 1 for a small effect size. The figure also depicts an overly optimistic test with power above the imposed significance level at alpha 0 as well as a test that is less powerful than the optimal. The simulation does not provide the exact shape of the power curve in the optimal test; that is, we do not know a priori the optimal parameter δ, yet the approach is useful to identify overoptimistic tests and to compare the power of a set of permutation approaches.

| DATA SETS
The data sets considered in this work were recently gathered in a precision medicine study of breast cancer using metabolomics. In the treatment of the disease, getting into surgery after successive combination of drugs has been considered as the gold standard for assessing the tumor response. 14,15 Neoadjuvant chemotherapy (NACT) is the systemic presurgery therapy. 16,17 The improvement of breast-conserving surgery (BCS) rates and the opportunity to search for prognostic or predictive response biomarkers are some of the advantages that NACT has had in the clinical practice. 16,18,19 However, not all the breast cancer patients benefit from NACT and it is critical to differentiate between those patients that will positively respond and those who will not in order to choose alternative and more effective therapies. Liquid chromatography coupled with high resolution mass spectrometry (LC-MS) is studied as a fast and non-invasive approach to allow this differentiation. A total of 92 breast cancer patients at the Medical Oncology Unit of the University Hospital of Jaén (Spain) were included in the study in order to detect metabolomics changes associated with the NACT efficiency. Forty-eight patients diagnosed with HER2À and ER+ with Ki67 > 15% were defined as the Luminal B (LB) group. Twenty-one patients who neither expressed hormone receptors (PRÀ and ERÀ) nor overexpressed human epidermal growth factor 2 (HER2À) were considered as triple negative breast cancer patients (TN). The rest of the patients were not considered in this work. Samples for metabolomics analysis were taken at three time points: before the first cure therapy with anthracyclines (basal, t1), once the patients received the treatment with taxol (pre-surgery, t2), and after the breast conserving surgery (post-surgery, t3). All the samples were analyzed using liquid chromatography (Agilent series 1290) coupled to high resolution mass spectrometry (TTOF5600 and SCIEX), in different batches for TN and LB. After quality control, LB and TN data sets included 117 and 112 variables, respectively.
Samples obtained during surgery underwent a pathological analysis in order to determine the Myller and Payne (MP) graduation post-surgery. 20 Treatment response was divided into two groups according to the tumor reduction percentage: response and non-response. In the LB molecular subtype, 26 out of 48 patients had a response (54.16%), while 22 did not respond after treatment (45.84%). The TN phenotype showed 13 out of 21 patients with a response to the NACT (61.9%) and 8 patients with no response (38.1%).
To compare the results of the permutation tests between balanced and unbalanced data, we artificially created a balanced TN data set by discarding 5 randomly selected patients from the response group. In the following section, we describe the results in separate sections for the different terms (factors and interaction) of the longitudinal intervention study. Given that most permutation approaches can be performed using different models which consider all or a sub-set of factors/interactions, we evaluated tests over what we call the Tree Model, which includes all terms in Figure 1, the branch model, which includes the terms in a given branch of Figure 1, or the minimum model, which includes the minimum terms possible. Furthermore, since some approaches use a full model, others use a reduced model, and others use both, we describe each considered approach in the tables below. For the sake of simplicity, we do not show results for all potential variants of permutation tests but only those that we consider potentially interesting.

| Interaction time X treatment
As already discussed, a principal goal of a longitudinal intervention study is to assess the significance of the interaction "time X treatment." Statistical significance of the interaction can be interpreted as a temporal response associated to the treatment under study, such as a positive reaction in the patient's condition due to the administration of a medicine. Thus, the adequate evaluation of the significance of the interaction is central in the analysis of the results of the study, as the assessment of the interaction could be the reason why the intervention study was carried out in the first place.
Following the data generation procedure discussed in Section 4, we used the data size and labeling of the TN dataset, the LB dataset, and the balanced TN dataset to generate and compare the power curves associated to different significance tests based on permutations. The results are displayed in Figure 3, and the different permutation approaches are described in Table 1. Given that the interaction cannot be constrained (see discussion on Anderson and Braak 7 on the derivation of exact tests), we assess the difference among unconstrained permutations computed: (i) on the residuals of the reduced model and on the raw observations; (ii) on the tree and branch models; and (iii) using the type I sum of squares of the interaction matrix, its projection onto the first two PCs, the F ratio between the mean sum of squares of the interaction and the residuals, and the type III sum of squares.
The comparison among power curves is fairly consistent between the two data sets (TN and LB) and the unbalanced and balanced versions of TN. The "Res TreeRM" method, which corresponds to the permutation of residuals on the tree reduced model (X ¼ m þ A þ B þ CðAÞ), leads to an extremely optimistic test. As a matter of fact, it was this finding that originally motivated the research of this paper: We found statistical significance on the interaction in the real datasets TN and LB, an incorrect result due to the use of the "Res TreeRM" misleading test. The power curve starting around 1 means that roughly all data sets with no real significance in the interaction are incorrectly classified as significant. This would lead to the conclusion that a disease treatment is effective when it is actually not. The type III error version of the same test ("Res TreeRM III") shared the same problem. However, if we use the branch reduced model instead (X ¼ m þ A þ B, "Res BranchRM''), which does not model the subject variability, or if the F ratio is used ("Res TreeRM F"), the test provides an adequate power curve. All tests based on the raw observations ("Obs *") † performed adequately and with small differences. Among those, the tests based on the tree full model and the F ratio yielded the best performance, but results depend on the specific data set.
It can be useful to provide a more detailed comparison of tests performance for null effect size, in order to check whether some of the tests that seem to behave correctly present mild biases with respect to the imposed significance level. For that, we can take into account the variability of the results, for example, through confidence intervals, box plots, or significant difference plots computed thorough multiple testing corrections. To assess the variability of the results, we need replicates of a power curve for the same conditions: same data characteristics, same permutation test, and same factor/interaction. We can obtain these replicates by repeating several runs of the algorithm in Section 4, but this is generally time consuming. ‡ Another possibility, which was our choice, is to use bootstrapping from a single run. This approach led to Figure 4, in which we compare the performance of a selected set of permutation tests for the TN dataset in absence of effect (α ¼ 0). The figure shows that most tests are slightly below the imposed significance level in † Symbol * should be understood as any test which label includes the text that precedes it, in this case any test that includes the word "Obs" in its label. ‡ A single run of our algorithm to compute, say, Figure 3A, takes approximately half an hour in a regular computer. If we would like to obtain 100 replicates, this would mean roughly two days of computation, which in any case can be speed up through parallelization. average at the baseline but a non-negligible number of outcomes (for some tests a 50%) are above this level. In general, we can see that care should be taken in real experiments to draw conclusions when the p values are close to the imposed significance level.
The results of this section show that the identification of a correct test can be unexpectedly complex and that the risk of deriving incorrect tests and conclusions is not negligible. Note that the approach of this paper, based on the F I G U R E 3 Power curves for permutation tests in ASCA and in the interaction between time and treatment (AB) of repeated measures studies, using the labeling scheme of the TN dataset (A), the LB dataset (B), and the balanced TN dataset (C). The testing approaches are described in Table 1 T A B L E 1 Testing approaches for the permutation tests in ASCA and in the interaction between time and treatment (AB)

Test
Permuted on Model Constraints Exch. units Statistic

Res TreeRM
Model residuals Note: SS AB * stands for the sum of squares of the projection onto the first two PCs of the factor matrix.
derivation of power curves from the size and labeling of a data set using simulated data, was key to identify the best testing approaches (any of the tests in Figure 4 can probably do the job with similar performance) and, more importantly, to identify which tests are misleading. In the supporting information, we can see the corresponding results with SimuleMV as the random number generator, which generally agree with the ones presented here, except maybe for the results at the baseline that present slightly optimistic outcomes.

| Subject factor
The test for the subject factor is often the least relevant in the study, but the factor itself should be adequately modeled to capture inter-subject differences and prevent these from confounding the effect of the other factors. The power curve results are displayed in Figure 5, and the different permutation approaches are described in Table 2. We assess the difference among permutations computed: (i) on the raw observations and constrained within levels of high-order factor(s) A and B and unconstrained on the residuals of the reduced model and on the raw observations; (ii) on the tree and branch models; and (iii) using the type I sum of squares of the factor, its projection onto the first two PCs, the F ratio between the mean sum of squares of the factor and the residuals, and the type III sum of squares.
The approaches based on the residuals of the reduced model ("Res *") are overly optimistic or of reduced power. The approaches based on unconstrained permutations ("Obs *" without constraints k) are also suboptimal or even powerless (see also results in Figure S3), exception made on the one that employs the F ratio and the Tree model ("Obs TreeFM F"). The latter and all variants that constrain the permutations within the levels of both the treatment and the time ("Obs k(AB) *") seem to work adequately. Among those, we do not find significant gains by using the type III sum of squares or the F ratio. Constrains only in the treatment factor ("Obs k(A) *") yield less powerful outcomes and can be even powerless (see Figure S3). If we look at the results at the baseline (α ¼ 0) of a selected set of permutation tests for the TN dataset ( Figure 6), we find that methods are slightly optimistic in average and that the variability is wide, which again suggests careful determination of statistical significance in real experiments at the limit. Any of the selected tests seems suitable for the specific case of TN.

| Time factor
The time factor test is relevant to identify time-changing responses in the multivariate outcome that are consistent in both the treatment and control cohorts. Regardless of if the time X treatment interaction is significant, we should study the significance of time, since the multivariate response data may be partially significant for both time and the interaction due to differential behavior in the multiple variables. F I G U R E 5 Power curves for permutation tests in ASCA and in the subject factor (CðAÞ) of repeated measures studies, using the labeling scheme of the TN dataset (A), the LB dataset (B), and the balanced TN dataset (C). The testing approaches are described in Table 2 The power curve results are displayed in Figure 7, and the different permutation approaches are described in Table 3. We assess the difference among permutations computed: (i) on the raw observations and constrained within A or CðAÞ and unconstrained on the residuals of the reduced model and on the raw observations; (ii) on the tree, branch, and minimum models (X ¼ m þ A þ B); and (iii) using the type I sum of squares of the factor matrix, its projection onto the first two PCs, and the F ratio between the mean sum of squares of the factor and of the interaction. We do not show F I G U R E 6 Box plots for the power for permutation tests in ASCA and in the subject factor (CðAÞ) of repeated measures studies, using the labeling scheme of the TN dataset, and for a null effect size F I G U R E 7 Power curves for permutation tests in ASCA and in the time factor (B) of repeated measures studies, using the labeling scheme of the TN dataset (A), the LB dataset (B), and the balanced TN dataset (C). The testing approaches are described in Table 3 results of type III sum of squares since, as in previous analysis, we do not find significant improvements when using them instead of type I sum of squares. Finally, we also investigate whether it would be convenient to use the AB cells as exchangeable units instead of individual observations. We found several unexpected results in this factor, which again support the use of the methodology of the paper. First, contrarily to what has been suggested for univariate responses, 7 using the suitable exchangeable units (the AB cells) rather than individual observations is not only not necessary but counter-productive in this example, since the tests results that permute AB ("eu(AB) *") cells were powerless. This may be the consequence of the limited number of possible permutations. 7 Moreover, we found that the permutation test on the residuals of the tree reduced model ("Res TreeRM") is overly optimistic, again an unexpected outcome that can lead to misleading results in practical problems. Unconstrained permutations ("Obs *" without constraints k) perform as expected, but with less power than the best approaches with the notable exception (again) of the one that employs the F ratio and the tree model ("Obs TreeFM F"). The best approaches were generally the residuals on the minimum Reduced Model (X ¼ m þ A, "Res minRM"), with the downside of being slightly optimistic at the baseline (see Figure 8), and the permutations of raw observations on the minimum full model constrained within the individual ("Obs minFM k(C(A))"), which would be the method of choice for the TN dataset but not necessarily for the other two datasets. The latter is an interesting result given that, in principle, 7 the time factor is higher in order than CðAÞ and permutations are usually constrained within levels of factors F I G U R E 8 Box plots for the power for permutation tests in ASCA and in the time factor (B) of repeated measures studies, using the labeling scheme of the TN dataset, and for a null effect size of higher or equal order. At alpha equal to 0, most tests tended to be slightly optimistic, and therefore, care should be taken not to over-interpret results close to the significance level.

| Treatment factor
The test for the treatment factor is relevant to identify sustained differences between the treatment and control cohorts. For the same reason as before, regardless of if the time X treatment interaction is significant, we should study the significance of treatment.
The power curve results are displayed in Figure 9, and the different permutation approaches are described in Table 4. We assess the difference among permutations computed: (i) on the raw observations while constrained to factor B and using different exchangeable units, on the residuals of the reduced model and unconstrained on the raw observations; (ii) on the tree, branch, and minimum models (X ¼ m þ A); and (iii) using the type I sum of squares of the interaction matrix, its projection onto the first two PCs, and the F ratio between the mean sum of squares of the factor treatment and the interaction or the individual. Again, we do not show results of type III sum of squares since we do not find any improvement when using them.
We found that tests based on the residuals ("Res *") do not generally work well for this factor and the TN data set, either for being optimistic at the baseline or because they are powerless. However, for LB they do perform well and may be the option of choice. This differential behavior of a test when applied to different data sets can be a problem when analyzing real data. Again, the approach of this paper, which characterizes the behavior of the test to the specificity of the real data set, can be useful to avoid risks. Constrained tests ("Obs k(B) *") gave only good results for the F I G U R E 9 Power curves for permutation tests in ASCA and in the treatment factor (A) of repeated measures studies, using the labeling scheme of the TN dataset (A), the LB dataset (B), and the balanced TN dataset (C). The testing approaches are described in Table 4 subject as exchangeable unit ("eu(C(A))"). Unconstrained permutations on raw data ("Obs *" without constraints k) were generally more powerful and would be the choice for TN. Among those, the one that employs the F ratio and the tree model ("Obs TreeFM F") shows again a more powerful curve, but it is slightly optimistic at the baseline ( Figure 10).

| CONCLUSION
In this paper, we demonstrate that the adequate choice of the permutation approach for significance testing in ANOVA Simulaneous Component Analysis can be challenging, especially for data coming from a complex design, in particular a longitudinal intervention study. Some permutation approaches can lead to overly optimistic results, which can cause analysts to wrongly conclude the significance of a factor or interaction and which can put at risk the whole interpretation of the study. Moreover, we can have the opposite problem, that is, that the test is powerless and therefore not useful to find significant factors. We provide a simulation methodology to compute power curves tailored to the characteristics of a real data set under analysis, which are useful before performing inference on the real data so that the analyst can be sure that the permutation strategy of choice is suitable and safe. F I G U R E 1 0 Box plots for the power for permutation tests in ASCA and in the treatment factor (A) of repeated measures studies, using the labeling scheme of the TN dataset, and for a null effect size We applied this approach to two real data sets from a precision medicine clinical trial in cancer using metabolomics. Our results show that significant performance differences exist among the overwhelming number of potential tests applicable to each single factor or interaction, in terms of power and/or accuracy in the approximation to the imposed significance level. These differences do not seem to follow a general trend, with the notable exception that most tests tend to be slightly optimistic, and therefore, care should be taken not to over-interpret p values close to the imposed significance level. We also found a general good performance of the traditional method used in ASCA (unconstrained permutations on the raw observations) provided that the performance index to permute is the F ratio, rather than the more popular sum of squares. However, we cannot be sure whether these two trends will be generally true in other data sets or experimental designs. Furthermore, we cannot provide any other general rule for practitioners, since observed results depend on the specific factor/interaction to test and in the particularities of the data set, such as its size and whether the study is balanced or unbalanced.
As principal conclusion, we believe it is hopeless in practice to forecast the optimal test of a factor/interaction without a numerical approach like the one proposed in this paper. Rather, we believe that such an approach should be common practice when using permutation testing in complex multivariate study designs for ASCA. To simplify its application, we put our code available online for the community (at https://github.com/josecamachop/ PowerCurvesASCA). Our results may also be extended to other analysis tools such as PERMANOVA.

ACKNOWLEDGMENTS
This work is partly supported by the Agencia Andaluza del Conocimiento, Regional Government of Andalucía, in Spain, and ERDF (European Regional Development Fund) funds through project B-TIC-136-UGR20. We acknowledge the comments by anonymous reviewers, which improved the quality of the final manuscript.

PEER REVIEW
The peer review history for this article is available at https://publons.com/publon/10.1002/cem.3398.

DATA AVAILABILITY STATEMENT
Data sharing not applicable -no new data generated. The scripts for the generation of the simulated data and the power curves are provided.