Testing for no effect in regression problems: a permutation approach

Often the question arises whether $Y$ can be predicted based on $X$ using a certain model. Especially for highly flexible models such as neural networks one may ask whether a seemingly good prediction is actually better than fitting pure noise or whether it has to be attributed to the flexibility of the model. This paper proposes a rigorous permutation test to assess whether the prediction is better than the prediction of pure noise. The test avoids any sample splitting and is based instead on generating new pairings of $(X_i,Y_j)$. It introduces a new formulation of the null hypothesis and rigorous justification for the test, which distinguishes it from previous literature. The theoretical findings are applied both to simulated data and to sensor data of tennis serves in an experimental context. The simulation study underscores how the available information affects the test. It shows that the less informative the predictors, the lower the probability of rejecting the null hypothesis of fitting pure noise and emphasizes that detecting weaker dependence between variables requires a sufficient sample size.

Often the question arises whetherY can be predicted based on X using a certain model.Especially for highly flexible models such as neural networks one may ask whether a seemingly good prediction is actually better than fitting pure noise or whether it has to be attributed to the flexibility of the model.This paper proposes a rigorous permutation test to assess whether the prediction is better than the prediction of pure noise.The test avoids any sample splitting and is based instead on generating new pairings of (X i ,Y j ).It introduces a new formulation of the null hypothesis and rigorous justification for the test, which distinguishes it from previous literature.The theoretical findings are applied both to simulated data and to sensor data of tennis serves in an experimental context.The simulation study underscores how the available information affects the test.It shows that the less informative the predictors, the lower the probability of rejecting the null hypothesis of fitting pure noise and emphasizes that detecting weaker dependence between variables requires a sufficient sample size.

| INTRODUCTION
With the ubiquity of data often the question whether a response Y can be predicted based on predictors X arises.
The rise of highly capable machine learning and deep learning techniques increases the abilities to fit any kind of data.
However, the abilities to fit pure noise are increasing as well.We propose a method to test whether a model is only fitting noise.It extends testing for no effect from linear to nonlinear models.No sample splitting is performed so the power of the test can rely on the size of the whole sample.No nested sequence of models is needed, in fact, no alternative models are needed at all.
Our method is based on recombining the pairings between predictors and responses through permutations.In this way artificial reference datasets are created and the performance of the model on the original data can then be assessed by comparing it to the performances on the artificial reference datasets.The purpose of our test is to ascertain whether the model is capable of fitting the data more effectively than mere random noise.Our method is not restricted to linear models since it is not a test for specific parameters in the model.Rather it tests for the ability of a model to predict the responses.
The main contribution of this paper is a rigorous formulation of a permutation test for dependence between model predictions and responses.The test uses R 2 as test statistic but can be performed with any measure of goodness of fit in regression analysis.Because of its interpretability, R 2 is our test statistic of choice, but this can be adapted if necessary.The method generates new pairings of (X i ,Y j ) conditional on the X i for i = 1, ..., n and Y j for j = 1, ..., n.
This paper introduces a new formulation of the null hypothesis and provides a rigorous justification for a permutation test that has been described in various forms in the literature, for instance in the two-sample problem (Good (2002); Commenges (2003); Huang et al. (2006); Hutson and Wilding (2012)), the stochastic dominance problem (Arboretti Giancristofaro andBonnini (2008, 2009)) or the subgroup discovery problem (Duivesteijn and Knobbe (2011)).The main use case for this method is in the initial stages of the data analysis to test whether a given model does only fit noise or is able to capture some essential structure in the data.
The outline of the paper is as follows.Section 2.1 formulates the problem and introduces necessary notation.Section 2.2 contains the formal formulation of the null hypothesis, theoretical considerations and the succinct description of the method.Section 3.1 contains the application of the permutation test to sensor data of tennis serves in order to demonstrate the method in practice and showcase its power in a real-life scenario.Section A presents a simulation study, where the permutation test is demonstrated in various scenarios for predictors and responses.

| Problem description
Consider a regression setting.Given an observed pair (X ,Y ), where X is a random vector and Y is a real random variable.Y is modelled as: where ϵ is a centered random variable independent of X and (f (X ) ) f ∈F for some class of functions F. An example of F could be a set of all linear functions corresponding to a linear regression model with fixed number of variables or a set of functions that can be described by a neural network.Nonparametric classes of functions can also be considered, for instance a set of log-concave functions.For the remainder of the paper, we will focus on R 2 as goodness-of-fit measure.
Since the actual relationship between X and Y is not known in practice, a chosen class of functions F through which that relationship is described does not need to be appropriate.The class of functions F is misspecified if it does not contain the true f , while if it contains too many functions, the model might be overfitting by memorizing the noise ϵ.In a real world scenario, we are often facing datasets that feature high-dimensional, time-dependent or functional variables.The question whether there is a relationship between X and Y and which model to choose for describing it, is crucial.In this paper, we focus on the following aspect: • can a given class of functions F distinguish Y from pure noise?Consider this simple example.Let X 1 , X 2 be independent standard normal variables and Y = X 2 1 + X 2 2 + ϵ, where ϵ ∼ N (0, 0.01).Consider a multi-layered neural net as a model of choice to predict Y using X 1 and X 2 .For small sample sizes shuffling the vector of responses and applying our prediction model to this shuffled dataset can yield values of R 2 higher than values of R 2 calculated for the prediction model applied to the original dataset.Ten random samples of size 10 were drawn.Five yielded higher values of R 2 for at least one shuffled dataset than for the original pairing (we considered 200 shuffles of the sample).In applied settings, where the sample size is fixed and difficult to increase, this presents an inherent issue.Sample size has an immediate influence on the credibility of the model and needs to be taken into account.
Related problems have been addressed before in the literature in different settings and with a variety of solutions.
In this paper we focus on the permutation test.

| Permutation approach to testing for no effect
The main appeal of permutation tests stems from the fact that they do not require any distributional assumptions on the population.The lack of assumptions is increasingly more interesting to researchers as deep learning methods become more popular since they likewise do not rely on distributional assumptions.Permutation tests are completely data-driven as pointed out by Berry et al. (2002).This can be very appealing as the data is the main factor in shaping the distribution of the test statistic, i.e. the test statistic can be chosen to be more easily interpretable without focusing on its distribution.
Our work differs from previous works mostly in the formulation of the null hypothesis.In the literature different null hypotheses exist.Some involve the concept of exchangeability, e.g.Romano (1990); Schmoyer (1994); Good (2002); Commenges (2003); Huang et al. (2006); Hutson and Wilding (2012).Some involve equality of means, e.g.Zhang (2009) and some involve zeroing of the coefficients, e.g.Cardot et al. (2004).In contrast to this, we focus on the concept of independence, which is not widely used for permutation tests.Permutation tests of independence have existed before, e.g.see Bell and Doksum (1967).However, we do not test independence of two random variables X and Y , but rather we state the null hypothesis in terms of the model and whether it is able to capture the dependence.
The choice of the null hypothesis can also be directly connected to the model considered in the problem.For instance, it is natural to use zeroing of the functional coefficient as the null hypothesis when considering a functional linear regression model e.g.see Cardot et al. (2004).We do not restrict ourselves to any particular model in our work, we only consider the model as given and the null hypothesis is not specifically tailored to the model.
Our goal is to investigate whether a class of functions F can capture any dependence between X and Y .We consider a test with null hypothesis stated as follows: (1) H 0 represents the problem as described in section 2.1.If it were true, then our class of functions F will not be able to capture the relation between X and Y in a meaningful manner.Considering a dataset with permuted responses will be no different to class of functions F under H 0 .If H 0 is false, then the class of functions F will be able to capture some aspects of the relation between X and Y , although it does not guarantee that the model is suitable and readily applicable.
The null hypothesis H 0 as stated in (1) does not guide the choice of the test statistic.In order to choose a suitable test statistic, further understanding of H 0 is needed.
given the empirical measure P X n of X 1 , ..., X n and the empirical measure P Y n of Y 1 , ...,Y n is the same for all permutations τ of set {1, ..., n }.
Proof Let i ∈ {1, . . ., n } be fixed.For a given finite collection of functions f 1 , f 2 , ..., f m ∈ F and a permutation τ, the conditional joint distribution of (f 1 (X i ),Y τ (i ) ), ..., (f m (X i ),Y τ (i ) ) given P X n and P Y n is the same as the joint distribution of thanks to the assumption of independence of (f (X ) ) f ∈F and Y .Note that (3) is invariant with respect to the permutations of Y i .This statement will also be true if extended to a joint distribution of (2) thanks to Kolmogorov extension theorem (Kolmogorov (1956)), hence the distribution of joint conditional distribution of (2) given P X n and P Y n is invariant with respect to the permutation of Y i .
Before proposition 1 is translated into a result in terms of R 2 , we formally define R 2 .Consider n realizations of (X ,Y ) and denote them as (x 1 , y 1 ), ..., (x n , y n ).Let L be a loss function and f be an empirical risk estimator in the sense that where ȳ is the mean of the y i .This definition of R 2 is the natural one if the loss function L in equation ( 4) is chosen to be the squared error loss.In the context of R 2 , proposition 1 implies the following result.
a permutation τ of {1, ..., n } and a loss function L defining an empirical risk estimator as in (4).Then, conditionally on the empirical measure P X n of X 1 , ..., X n and the empirical measure P Y n of Y 1 , ...,Y n , the distribution of R 2 calculated based on data { (X i ,Y τ (i ) ) } using the aforementioned emprirical risk estimator does not depend on τ.
Proof Proposition 1 implies that the conditional distribution of given P X n and P Y n is the same for all permutations τ of set {1, ..., n }.This is a two-dimensional empirical process indexed by class of functions F. Plugging in the arg min of the second component into the first component still gives a distribution that does not depend on τ.Hence, combining the definition (4) of f and (5) of R 2 , we conclude that for each permutation τ, R 2 calculated for { (X i ,Y τ (i ) ) } is sampled from the same distribution conditioned on P X n and P Y n .
This allows us to consider R 2 as a viable choice for the test statistic.Under the null hypothesis, the R 2 as calculated for (x i , y i ) is sampled from the same distribution as the R 2 calculated for (x i , y τ (i ) ) for some permutation τ.The test itself is based on permutations of the pairings (x i , y i ).We reject H 0 only if the observed R 2 is much larger than "most" of the R 2 obtained via random permutations.Essentially we compare the observed R 2 to the distribution of R 2 under H 0 given specific realizations of X and Y , but not their pairings.It is notable that R 2 can also be replaced by some other statistic, as long as it can be calculated using the sample { (f (x i ), y i ) } i .Proposition 1 permits other statistics to be used instead of R 2 .Taking R 2 as the test statistic is equivalent to taking empirical risk with respect to quadratic loss as the test statistic.In that sense, the other tests can also be constructed by considering empirical risks with respect to other losses, e.g.absolute loss or Huber loss.
If the class of functions F contains the constant functions and the predictors are optimized with respect to the quadratic loss, then R 2 calculated for a given F is always non-negative.This is true, since given set of observations y i which yields R 2 = 0. Note that including the constants in F, does not disturb the independence of Y and (f (X ) ) f ∈F , since Y is always independent of a set of constant random variables.While R 2 is always non-negative in linear regression models (if the intercept is included), that is not the case for instance in the setting of neural nets.
Given a chosen α level * , the precise implementation of the test is as follows: 1. given original pairings of (x i , y i ), calculate the R 2 of class of functions F, which we will denote as r 2 0 , ** 2. find the distribution of R 2 under the null hypothesis conditionally on observed x i and y (i ) for i = 1, ..., n (approximated by the empirical distribution function of R 2 values based on a uniform sample of permutations of original pairings (x i , y i ); for each sample { (x i , y τ (i ) ) : 1 ≤ i ≤ n }, where τ is a permutation, R 2 is calculated; notably, the model is refit for each permuted sample), 3. if r 2 0 > q 1−α , where q 1−α is the 1 − α quantile of the empirical distribution of R 2 values, then we reject the null hypothesis, otherwise we do not reject it.
Any tuning parameters used in point (1) and (2) are not adjusted for each permutation.This implementation assumes that R 2 is the statistic of choice, but it can be adapted to suit other statistics as well.The reason we prioritize R 2 * default α = 0.05 ** The specific method of prediction of Ŷi is stated in 4. is primarily because of its benefits in terms of interpretability and ease of use.It is also important to note that in practice, determining the distribution of R 2 under the null hypothesis will not be exact in most cases.To obtain the exact distribution we need to run through n! permutations.Even for n > 10 the computational cost of such an operation is prohibitively expensive and sampling from the true distribution is more reasonable.
F I G U R E 1 Histogram of the distribution of generated R 2 using permutation of y values.The model considered here is a 3-layered neural net.The sample size is 10.The red line denotes the observed R 2 for the true pairings of X and Y , the green line denotes the 95%-quantile of the empirical distribution of R 2 (approximation using 200 permutations).R 2 for the permuted data to 1, the greater the capability of the model to fit to the noise.Close proximity of q 1−α to r 2 0 in case of r 2 0 > q 1−α and r 2 0 small implies that the model's predictive ability may not be satisfactory even though the null hypothesis is rejected by the test.The test is widely applicable, because of its general form and easily adaptable to different types of models.It also provides an interesting commentary on the predictive abilities of a chosen model.
In the event that the quantile q 1−α for one model, F 1 , significantly exceeds the same quantile for another model, F 2 , we conclude that F 1 is either overfitting, indicating a need for reduction of the set of independent variables or model simplification, or is better able to extract meaningful information from unrelated data.
We close this section with a continuation of the example from section 2.1.Let X 1 , X 2 be independent standard normal variables and Y = X 2 1 + X 2 2 + ϵ, where ϵ ∼ N (0, 0.01).We consider a neural net as a model of choice to predict Y using X 1 and X 2 .A random sample of size 10 is drawn.We conduct the permutation test.As seen in figure 1, the test rejects the null hypothesis.However, in the case of 2 out of 200 permutations the model achieves higher R 2 than in the case of the original pairings.Even though the model is capable to capture the relationship between X 1 , X 2 and Y , there are permutations of the vector of responses that can lead to a better performance of the model.

| Tennis serve dataset
This section concerns an application of the permutation test to a tennis serve dataset.Seven professional athletes wearing inertial measurement units (IMUs) performed tennis serves.Each athlete followed a protocol of first and second serves.Sensors were placed on 4 body parts: lower and upper arms, trunk and pelvis as can be seen in fig. 2.
Each IMU contained a triaxial accelerometer and triaxial gyroscope.The data consists of 7 uninterrupted time series of 24-dimensional data (4 body parts × 2 types of sensors × 3 axes).The dataset is further described in the Master thesis (Faneker (2021)).
F I G U R E 2 Segment model of right-handed player and racquet (back view, frontal plane).
Additionally, a dataset containing personal characteristics of the players and performance characteristics of each serve has been included.The personal characteristics are the sex, age, height and weight of the players.The performance characteristics are the ball velocity, an indication of whether the ball went in or out and the velocity-accuracy index (VA index).The VA index for a single serve was introduced and motivated by Kolman et al. (2017) and is defined as follows: where achieved points refer to the number of points assigned to a serve based on its closeness to a target area on the court (see fig. 3).The number of points assigned to a serve is based on a new Serve Tennis Test (STT) adapted from Kolman et al. (2017).Originally, the point system was devised based on the ellipses in the serve box where aces were hit in male tennis matches during the Australian Open (Whiteside and Reid (2017)).However, the system has been improved upon since then.The points are discrete.Nine points are given for hitting the center of the target area.Six and three points are assigned for areas further from the center.One point is assigned for a ball much further from the target area, but still a valid ball, while zero points are given to a serve which did go out.Each participant performed approximately 48 serves.In total, 29.6% of serves were faults (and as a result had a VA-index 0).
F I G U R E 3 Target areas for the tennis serve.The scenario considered here is a serve in the wide direction.The points given on each target area correspond to the number of accuracy points needed to calculate the VA index of the serve.
We will use the tennis serve dataset in order to demonstrate an application of the permutation test to real life data.We will focus on the prediction of ball speed and VA-index prediction.The functional predictors have been transformed into vectors, using a Fourier basis representation, in order to be able to use the linear regression model with the class of functions F LR and the neural net with the class of functions F NN (300, 300, 300).The choice to use Fourier coefficients as predictors was the most natural way of incorporating information from the time series.First, a prediction of ball speed was considered.The permutation test rejected the null hypothesis in cases of both models as seen in fig.4a and 4b.The test rejects the null hypothesis for both models, although higher values of R 2 achieved by the neural net for the original pairings suggest greater capabilities of that model to detect the dependence.
In the case of prediction of the VA-index as defined in ( 7), the permutation test did not reject the null hypothesis for the linear regression model with the class of functions F LR as well as for the neural net model with the class of functions F NN (300, 300, 300).Fig. 5a shows results for the linear regression model and fig.5b shows results for the neural net.The values of R 2 are quite low for both models and for many permutations of y -values the generated R 2 is much higher than the observed R 2 for the true pairings.These results convince us that a good prediction using the linear regression model or the neural net model is not possible at the moment.The issue may lie with the current size of the dataset or the number of serves per player or simply because the relation as can be described by the neural net is not strong.The fact that the number of Fourier coefficients used in this prediction was increased to achieve more favourable R 2 for the original pairings of (x i , y i ) (at least in the case of the deep learning model), shows how complex this task is and additional information is needed in the data to increase the R 2 .F I G U R E 5 Results of the permutation test for the VA index prediction using F LR and F NN (300, 300, 300).

| CONCLUSION AND DISCUSSION
This paper concerns the theoretical foundations and the application of the permutation approach for testing whether a model can capture dependence structure between predictors and responses.The test is a tool to determine whether a model is able to fit the data better than pure noise.We are mostly interested whether X has any effect on Y and we pursue that interest with the help of a chosen, fixed model.The null hypothesis is formulated in terms of independence of Y and (f (X ) ) f ∈F and in this form cannot be found in previous literature.Proposition 1 allows us to consider the test as a permutation test formally and proposition 2 allows us to consider R 2 as a test statistic.This approach is data-centered and the results of the test depend on just one model without the need to directly compare between different models.We also do not require sample splitting thus the test can rely on the power of the whole sample size, which can be vital in datasets of smaller size.Our findings are supported through an application to the tennis serve dataset.In this case, it gave evidence that a seemingly well-fitting model is not necessarily trustworthy.The prediction is either not possible with the given sensor data and model or a larger sample size is needed to predict the VA-index more accurately.

| Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

| Data availability
The data that support the findings of this study are available from the corresponding author upon request.

| Acknowledgments
We thank the two anonymous referees for valuable comments to the manuscript, in particular on the formulation of an earlier version of the null hypothesis.

A | SIMULATION STUDY
We apply our permutation test in multiple scenarios.This section will specifically focus on simulated datasets to assess the test's performance on datasets with varying dependence levels between X and Y and two different class of functions F. An empirical example will be considered in section 3.1.In all scenarios we consider the R 2 -based test.
Two different models will be used to fit the data throughout this section.One of them is a linear regression model, which models the relationship between a random vector X and a random variable Y in a linear manner: The parameter vector β will always be estimated using the least squares method.Regardless of the length of vector X , the class of functions associated with this model will be referred to as F LR .The other model we consider is a neural net.A neural net is a collection of neurons arranged into layers, with neurons from different layers connected to each other.Typically, a neural net consists of an input layer, multiple hidden layers and an output layer.The estimation of neural nets' parameters, the weights associated with neurons and edges between them, is done by feeding multiple training sets of inputs and outputs into the net.Weights are adjusted each time based on a predefined cost function.
Class of functions associated with neural nets will be referred to as F NN with the number of neurons on each layer specified as a k -tuple, where k refers to the number of layers, e.g.F NN (30, 30, 30) is a neural net with 3 hidden layers, each of which contains 30 neurons.
In the first two examples, we will compare the permutation test to two existing methods: Spearman's rank correlation coefficient (also referred to as Spearman's ρ) and Kendall rank correlation coefficient (also referred to as Kendall's τ).Both are statistics used to measure the rank correlation between two variables and both can be used as test statistics in a test for independence of two variables.Since, our examples have more than one explanatory variable, multiple statistics will be given.It is worth noting that both statistics are not applicable when there is no natural ordering in the data, e.g. in the case of functional data when datapoints are functions.
Let X 1 , X 2 ∼ N (0, 1) and Y ∼ U ( [0, 1] ) be independent random variables.We consider two models and two classes of functions associated with them: F LR and F NN (30, 30, 30) and a sample of size 100.In both cases the null hypothesis is not rejected, see fig.6a and 7a.We also consider 1000 repetitions of the experiment in the same setup to see the behavior of the test on a larger number of examples.As seen in fig.6b and 7b, the null hypothesis is rejected in most repetitions for both models, namely 4.7% for the linear model and 4.5% for the neural net.This shows that the rejection of the null hypothesis can still happen even in case of independence.Most importantly, the rejection rate is close to the confidence level α = 5%.Spearman's ρ test rejects the null hypothesis of independence of X 1 and Y in 5.2% of all cases and rejects the independence of X 2 and Y in 5% of all cases.Kendall's τ test rejects the null hypothesis of independence of X 1 and Y in 5.2% of all cases and rejects the independence of X 2 and Y in 5.3% of all cases.For both of these tests, the rejection rate is also close to the confidence level.
where ϵ ∼ N (0, 1) is the noise.Consider a sample of size 100.For both F LR and F NN (30, 30, 30), the permutation test rejects the null hypothesis, since the values of R 2 for the original pairings are much higher than for any of the permuted pairings.For the behavior of the test in a single example see fig.8a and 9a.In this case the neural net outperforms the linear model significantly, thanks to its complexity.Fig. 8b and 9b show that the rejection rate in this case is quite high when repeating the experiment 1000 times, close to 95% for the linear model and 94% for the neural net.This particular example illustrates the test's applicability in the case of a functional relation between predictors and responses.The model is not just fitting the noise, there is some relation between predictors and responses.It might not be captured well using a linear regression model, but the model is still able to capture more than pure noise.Spearman's ρ test and Kendall's τ test have also been performed in this example, but they show a slight difference from what we see in the case of the permutation test.Spearman's ρ test rejects the null hypothesis of independence of X 1 and Y in 99.6% of all cases   F I G U R E 7 Results of the permutation test for F NN(30,30,30) with data generated in a following manner X 1 , X 2 ∼ N (0, 1) and Y ∼ U ( [0, 1] ). and rejects the independence of X 2 and Y in 9.5% of all cases.Similarly, Kendall's τ test rejects the null hypothesis of independence of X 1 and Y in 99.7% of all cases and rejects the independence of X 2 and Y in 11.7% of all cases.This shows that the relationship between X 1 and Y is easier to capture than the relationship between X 2 and Y , and with high probability the test will indicate that X 1 and Y are not independent.The relationship between X 2 and Y is not as easy to capture using Kendall's τ or Spearman's ρ, which is to be expected due to the application of a nonlinear function with a minimum at the mean of X 2 when defining Y .
For the remaining scenarios in this section, we consider only the linear regression model with the class of functions F LR .We inspect the influence of changing the distribution slightly in the test in order to ensure the statistical analysis using the test is reliable and accurate.For a ∈ let X 1 ∼ N (a, 1), X 2 ∼ N (0, 0.1) be independent and Y = log |X 1 | + X 2 2 + ϵ, where ϵ ∼ N (0, 0.1) is the noise.Consider a sample of size 100.Note that the variance of X 2 has been decreased in comparison to the previous example.Only for values of a close to 0, the null hypothesis is not rejected (fig.10a).This makes sense, since the logarithm changes most rapidly close to 0 and for those arguments it is difficult to fit a linear function which describes this relationship well.This pattern is the same with average rejection rate of H 0 when repeating the experiment 100 times for each value of a, see fig.10b.For values of a greater than 0.6, the H 0 is almost never rejected.When the variance of X 2 increases to 0.5, the null hypothesis is no longer rejected for some values of a larger than 5 (fig.11).This particular case shows the influence of available information on rejecting the null hypothesis.The less informative predictors are the more likely it is not to reject the null hypothesis; we can see that as the parameter a increases, the log |X 1 | becomes flatter slowly losing its predictive value.Meanwhile, the influence of X 2 2 on the value of Y increases and given that the model can only predict linearly in X 2 , the power of the test decreases.Fig. 12 and 13 show explicitly the influence of the sample size on the test's capability to reject H 0 for the linear regression model with the class of functions F LR .In the case when H 0 is true (fig.12), the null hypothesis is rejected at a rate of 2-8% on average regardless of the sample size.† In the case when H 0 is false (fig.13), specifically with Y = log(X ) + ϵ for ϵ ∼ N (0, 1), the null hypothesis is rejected much less for smaller sample sizes and the rejection rate increases as the sample size increases reaching close to 95% at sample size 300.We can conclude that the power of our test increases until the sample size of around 300, at which point the type II error is particularly low.Meanwhile, the rejection of a true null hypothesis is rare, even for the smallest of sample sizes.
Using the bivariate normal distribution with varying correlation, we can empirically detect the point at which the test rejects H 0 for the linear regression model with the class of functions F LR as the variables become more and more Fig. 14 shows that as the correlation reaches 0.3, the test starts to reject H 0 almost always in case of sample size n = 100.‡ We conclude that for sample size n = 100, the dependence is only detectable reliably by the test when the correlation between variables is greater than 0.3.This particular example shows that for a given sample size a certain threshold of correlation exists at which the test starts to reject the null hypothesis.As the correlation increases the rejection becomes more and more likely for a given sample size.
Lastly, we present a comparison of our permutation test with a permutation test found in Pesarin and Salmaso (2010).This is also a test for no effect, but specifically in the linear regression model.Its formulation requires a sample

R
2 is bounded by 1 from above for any model.The proximity of R 2 values calculated from the permuted data or original R 2 values to 1 or to each other can provide insight into goodness of fit of a model.The closer the values of (a) Histogram of the distribution of generated R 2 using permutation of y values.The sample size is 46.The red line denotes the observed R 2 for the true pairings of X and Y , the green line denotes the 95%-quantile of the empirical distribution of R 2 (approximation using 200 permutations).(b) Histogram of the distribution of generated R 2 using permutation of y values.The sample size is 46.The red line denotes the observed R 2 for the true pairings of X and Y , the green line denotes the 95%-quantile of the empirical distribution of R 2 (approximation using 200 permutations).F I G U R E 4 Results of the permutation test for the ball speed prediction using F LR and F NN (300, 300, 300).(a)Histogram of the distribution of generated R 2 using permutation of y values.The sample size is 34.The red line denotes the observed R 2 for the true pairings of X and Y , the green line denotes the 95%-quantile of the empirical distribution of R 2 (approximation using 200 permutations).(b) Histogram of the distribution of generated R 2 using permutation of y values.The sample size is 34.The red line denotes the observed R 2 for the true pairings of X and Y , the green line denotes the 95%-quantile of the empirical distribution of R 2 (approximation using 200 permutations).
(a) Histogram of the distribution of generated R 2 using permutation of y values.The model considered here is linear regression with the class of functions F LR .The sample size is 100.The red line denotes the observed R 2 for the true pairings of X and Y , the green line denotes the 95%-quantile of the empirical distribution of R 2 (approximation using 200 permutations).(b) Scatterplot of the R 2 values for the original pairings against the 95% quantiles of the empirical distribution of R 2 .The orange line shows the identity function.

F
I G U R E 6 Results of the permutation test for F LR with data generated in a following manner X 1 , X 2 ∼ N (0, 1) and Y ∼ U ( [0, 1] ).
(a) Histogram of the distribution of generated R 2 using permutation of y values.The model considered here is a 3-layered neural net with the class of functions F NN (30, 30, 30).The sample size is 100.The red line denotes the observed R 2 for the true pairings of X and Y , the green line denotes the 95%-quantile of the empirical distribution of R 2 (approximation using 200 permutations).(b) Scatterplot of the R 2 values for the original pairings against the 95% quantiles of the empirical distribution of R 2 .The orange line shows the identity function.
(a) Histogram of the distribution of generated R 2 using permutation of y values.The model considered here is linear regression with the class of functions F LR .The sample size is 100.The red line denotes the observed R 2 for the true pairings of X and Y , the green line denotes the 95%-quantile of the empirical distribution of R 2 (approximation using 200 permutations).(b) Scatterplot of the R 2 values for the original pairings against the 95% quantiles of the empirical distribution of R 2 .The orange line shows the identity function.F I G U R E 8 Results of the permutation test for F LR with data generated in a following manner X 1 ∼ N (1, 1), X 2 ∼ N (0, 1) and Y = log |X 1 | + X 2 2 + ϵ, where ϵ ∼ N (0, 1).(a) Histogram of the distribution of generated R 2 using permutation of y values.The model considered here is a 3-layered neural net with the class of functions F NN (30, 30, 30).The sample size is 100.The red line denotes the observed R 2 for the true pairings of X and Y , the green line denotes the 95%-quantile of the empirical distribution of R 2 (approximation using 200 permutations).(b) Scatterplot of the R 2 values for the original pairings against the 95% quantiles of the empirical distribution of R 2 .The orange line shows the identity function.F I G U R E 9 Results of the permutation test for FNN(30,30,30)  with data generated in a following mannerX 1 ∼ N (1, 1), X 2 ∼ N (0, 1) and Y = log |X 1 | + X 2 2 + ϵ,where ϵ ∼ N (0, 1).(a) The plot shows the results of performing the permutation test for linear regression model with the class of functions F LR .The sample size is 100.The blue dots show the observed R 2 and the orange dots show the 95% quantile of the empirical distribution of generated R 2 (approximation using 200 permutations).The test has been performed for values of a ranging between 0 and 10.(b) Average rejection rate of H 0 with parameter a varying from 0 to 1.For each a 100 repetitions were made.F I G U R E 1 0 Results of the permutation test for F LR with data generated in a following manner X 1 ∼ N (a, 1), X 2 ∼ N (0, 0.1) and Y = log |X 1 | + X 2 2 + ϵ, where ϵ ∼ N (0, 0.1).(a) The plot shows the results of performing the permutation test for linear regression model with the class of functions F LR .The sample size is 100.The blue dots show the observed R 2 and the orange dots show the 95% quantile of the empirical distribution of generated R 2 (approximation using 200 permutations).The test has been performed for values of a ranging between 0 and 10.(b) Average rejection rate of H 0 with parameter a varying from 0 to 10.For each a 100 repetitions were made.F I G U R E 1 1 Results of the permutation test for F LR with data generated in a following manner X 1 ∼ N (a, 1), X 2 ∼ N (0, 0.5) and Y = log |X 1 | + X 2 2 + ϵ, where ϵ ∼ N (0, 0.1).(a) The plot shows the results of performing the permutation test for linear regression model with the class of functions F LR .The blue dots show the observed R 2 and the orange dots show the 95% quantile of the empirical distribution of generated R 2 (approximation using 200 permutations).The test has been performed for sample sizes ranging between 10 and 1000.(b) Average rejection rate of H 0 with sample size varying from 10 to 1000.For each sample size 100 repetitions were made.F I G U R E 1 2 Results of the permutation test for F LR with data generated in a following manner X ,Y ∼ N (µ, Σ), The plot shows the results of performing the permutation test for linear regression model with the class of functions F LR .The blue dots show the observed R 2 and the orange dots show the 95% quantile of the empirical distribution of generated R 2 (approximation using 200 permutations).The test has been performed for sample sizes ranging between 10 and 300.(b) Average rejection rate of H 0 with sample size varying from 10 to 300.For each sample size 100 repetitions were made.F I G U R E 1 3 Results of the permutation test for F LR with data generated in a following manner X ∼ N (5, 1) and Y = log |X | + ϵ, where ϵ ∼ N (0, 1).(a) The plot shows the results of performing the permutation test for linear regression model with the class of functions F LR .The sample size is 100.The blue dots show the observed R 2 and the orange dots show the 95% quantile of the empirical distribution of generated R 2 (approximation using 200 permutations).The test has been performed for values of correlation ρ ranging between 0 and 1.(b) Average rejection rate of H 0 with parameter ρ varying from 0 to 1.For each ρ 100 repetitions were made.F I G U R E 1 4 Results of the permutation test for F LR with data generated in a following manner X ,Y ∼ N (µ, Σ),