Order Invariant Tests for Proper Calibration of Multivariate Density Forecasts

Established tests for proper calibration of multivariate density forecasts based on Rosenblatt probability integral transforms can be manipulated by changing the order of variables in the forecasting model. We derive order invariant tests. The new tests are applicable to densities of arbitrary dimensions and can deal with parameter estimation uncertainty and dynamic misspecification. Monte Carlo simulations show that they often have superior power relative to established approaches. We use the tests to evaluate GARCH-based multivariate density forecasts for a vector of stock market returns.


Introduction
The use of density forecasts has recently become common in many scientific fields (Gneiting and Katzfuss, 2014) and, in particular, in many areas of economics. Density forecasts are increasingly used, for instance, in the fields of energy economics (Huurman et al., 2012), demand management (Taylor, 2012), finance (Shackleton et al., 2010;Kitsul and Wright, 2013;Hallam and Olmo, 2014;Ghosh and Bera, 2015), and macroeconomics (Clark, 2011;Herbst and Schorfheide, 2012;Aastveit et al., 2014;Wolters, 2015;Amisano and Geweke, 2017). Many tasks, such as the computation of Value-at-Risk measures for portfolios containing multiple assets or the planning of production for a firm that serves many markets from one central production facility, require the construction and evaluation of multivariate density forecasts. Beginning with Smith (1985) and Diebold et al. (1999), the literature has proposed several approaches for testing whether a sequence of multivariate density forecasts coincides with the corresponding true densities (e. g., Smith, 2000, 2002;Corradi and Swanson, 2006a;Bai and Chen, 2008;González-Rivera and Yoldas, 2012;Ko and Park, 2013a;Ziegel and Gneiting, 2014).
This strand of literature has neglected two important issues (Ziegel and Gneiting, 2014 being an exception). First, established tests depend on the order of variables in a multivariate model. As a consequence, researchers need to present a myriad of (sometimes inconclusive) results. The issue even offers room for data mining if a researcher decides to report only those results which correspond to one particular ("preferred") order. Essentially, this is one form of "data snooping" as discussed in White (2000). Second, all empirical applications and many of the theoretical results focus on the bivariate case. However, many applications, especially in finance, require models of higher dimensionality to be useful. We address both issues in this paper.
Following Diebold et al. (1999), the most commonly used approach for testing the calibration of multivariate density forecasts is based on the Rosenblatt (1952) probability integral transform (PIT). Examples include Clements and Smith (2000), Clements and Smith (2002), Ko and Park (2013a), and Ko and Park (2013b). This approach relies on the factorization of the multivariate forecast distribution into conditional distributions because these, in turn, can be used to form independent PITs which, for well-specified models, follow a uniform distribution. 1 Suitable transformations of these conditional PITs then lead to a reduction of the multivariate testing problem to a univariate one. How well a testing approach works depends crucially on the chosen transformation. The univariate tests can be implemented using any goodness-of-fit test (e. g., Neyman's smooth test, the Kolmogorov-Smirnov test, or the Anderson-Darling test).
Two aspects that are of utmost importance in practical applications that involve parametric forecasting models are parameter estimation uncertainty and dynamic misspecification. Corradi and Swanson (2006b) present a comprehensive overview about these aspects. Dynamic misspecification refers to the fact that a forecaster potentially uses only a subset of the relevant information to form a conditional density forecast. In most fields of economics and finance this is very likely. Parameter estimation uncertainty arises whenever a parametric forecast model is used to construct density forecasts whose parameters are estimated based on finite samples. Whether estimation uncertainty has to be dealt with when evaluating a sequence of predictive densities depends on the exact formulation of the null hypothesis one is interested in. One common approach is to test whether the forecast distribution belongs to a given parametric density family with parameters evaluated at their pseudo-true values. Thus one tests whether the underlying model specification is suitable to generate appropriate conditional density forecasts. An alternative view, proposed in Rossi and Sekhposyan (2016), is to test for the ability of a model to produce correct forecast distributions evaluated at the estimated parameter values. This means one tests whether the estimated model is suitable to generate appropriate conditional density forecasts. The tests that we propose are, in general, capable of handling both views.
A number of approaches have been suggested in the literature to address parameter estimation uncertainty. Bai (2003) (for the univariate case) and Bai and Chen (2008) (for multivariate densities) combine the Kolmogorov test with Khmaladze's martingale transformation to obtain a test which is distribution free in the presence of estimated parameters. Andrews (1997) solves this problem by using a parametric bootstrap. Duan (2004) uses a suitable sequence of transformations to obtain a parametric test that does not suffer from parameter estimation error. More recently, Chen (2011) adapts a number of tests from the parameter-free context to parameter-dependent density forecast evaluation, building on insights from Newey (1985) and Tauchen (1985) in the in-sample case and from West (1996) and West and McCracken (1998) in the out-of-sample case. This is the approach that we use in our paper.
Dynamic misspecification causes the PITs to be serially correlated. A number of papers propose tests that are robust against dynamic misspecification, i. e., preserve this misspecification under the null hypothesis. Usually, at the same time, those papers also consider the effects of parameter estimation uncertainty. Pioneering work in this context has been made by Corradi and Swanson (2006a), who show that a block bootstrap can be used to adjust Kolmogorov-type tests under such conditions. Their parametric approach has the advantage of a higher rate of convergence relative to the non-parametric test proposed in Hong and Li (2005). Both papers assume a stationary data generating process. In contrast, Rossi and Sekhposyan (2013) relax this assumption and propose a test for correct specification of density forecasts that is robust, in addition, to structural  Diebold et al. (1998), Diebold et al. (1999), Smith (2000, 2002), Ko and Park (2013a) Andrews (1997), Bai (2003), Chen (2011) Accounted for Knüppel (2015) Corradi and Swanson (2006a) Tested for Berkowitz (2001), Rossi and Sekhposyan (2016) Hong and Li (2005), Hong et al. (2007), Ko and Park (2013b), Lin and Wu (2017), Gonzlez-Rivera et al. (2011), González-Rivera and Yoldas (2012), González-Rivera and Sun (2015) Notes: This table contains a non-exhaustive collection of papers taking different views on how parameter estimation and dynamic misspecification should be treated when testing the calibration of (predictive) densities.
breaks. Serial dependence of PITs is also an issue in the context of multi-step forecasts (see, e.g., Knüppel, 2015). PITs based on h-step-ahead density forecasts will generally follow a moving average process of order h − 1.
Other papers test jointly for uniformity and the i.i.d. property of the PITs, thereby testing the null hypothesis of completely calibrated densities (Mitchell and Wallis, 2011). The first contributions in this context use simultaneous tests. Berkowitz (2001), for instance, develops a likelihood ratio test for this joint null hypothesis, allowing for serial correlation of different order under the alternative. Hong et al. (2007) (using a nonparametric kernel density estimate) and Ko and Park (2013b) (using a parametric density estimate) propose a test based on the joint distribution of consecutive PITs. Lin and Wu (2017), in contrast, propose a sequential procedure consisting of a data driven smooth test for uniformity, preceded by a test of serial independence of the PITs. An alternative approach that relies on so-called (generalized) autocontours has recently been proposed by Gonzlez-Rivera et al. (2011) and González-Rivera and Sun (2015).
Thus, there is a wide range of views about how density forecasts should be tested which we summarize in Table 1. In practice, the exact formulation of the testing problem depends on several factors such as the type of application or whether predictive densities are model-based or obtained via a survey. The methods that we propose below are compatible with any combinations of views about how dynamic misspecification and parameter uncertainty should be treated.
In this paper, we contribute to the literature on the evaluation of multivariate density forecasts in the following way. We propose new transformations of the conditional PITs which can be combined with any goodness-of-fit test for univariate distributions. Our preferred transformations (called Z 2 t * and Z 2 t † below) are constructed as the sum of squares of inverse normal transformations of the conditional PITs corresponding to all possible orders of the variables. The new transformations have a number of advantages. First, they are order invariant, a concept we define below, meaning that test results do not depend on the order of variables in the forecasting model. We show that the distortions in rejection rates caused by a tendentious application of the established tests, which are not order invariant, can be very substantial. Second, the new tests are applicable to densities of arbitrary dimension. Third, they have better power (relative to established tests) against a wide range of alternatives. Furthermore, we show that our tests can also be used when dynamic misspecification and parameter uncertainty have to be taken into account. In an application, we show that the new tests are helpful for testing the appropriateness of density forecasts based on sophisticated multivariate models for vectors of financial returns. In particular, we show that the potential for data mining is immense when using the established tests in practice and that our order invariant tests are required to draw unambiguous conclusions.
The remainder of this paper is organized as follows. In Section 2, we describe the testing problem, generalize established tests, and derive new tests to evaluate multivariate densities. In Section 3, we assess the finite sample properties of different tests by means of Monte Carlo simulations. In Section 4, we demonstrate the usefulness of the newly proposed tests in an application to forecasting the distribution of a vector of stock returns. Section 5 concludes. The Appendix contains all proofs and technical derivations as well as additional simulation results.

Setup and Test Hypothesis
] be a vector-valued continuous random variable with true (but unknown) conditional distribution function (CDF) G Yt (Y t |I t−1 ), where I t−1 denotes the relevant information set available at time t − 1. Furthermore, we consider the predictive CDF F Yt (Y t |Ω t−1 , θ 0 ) with corresponding conditional probability density function (PDF) f Yt (Y t |Ω t−1 , θ 0 ), where Ω t−1 ⊆ I t−1 is the information set available to the researcher and θ 0 denotes a parameter vector with compact and finite parameter space Θ. This framework takes into account that density forecasts are often constructed using parametric models and allows for dynamic misspecification as defined, for instance, by Corradi and Swanson (2006a). For the time being, we treat θ 0 as known. In practice, the parameters have to be estimated from the data. We discuss below how this affects the testing problem and how we can take estimation uncertainty into account.
Consider a sample {Y t , Ω t−1 } n t=1 of which the first R observations can potentially be used to estimate θ 0 and the remaining P observations are used to evaluate the predictive densities generated by F Yt (Y t |Ω t−1 , θ 0 ). We are interested in testing whether the model F Yt (Y t |Ω t−1 , θ 0 ) is correctly specified in the sense that (2.1) To specify the null exactly, assumptions need to be made about whether θ 0 has to be estimated and about whether dynamic misspecification can be ignored, should be controlled for, or should jointly be tested. In Table 1 above, we provide an overview about the assumptions made in the literature.
In the univariate case, H 0 implies that the probability integral transform (PIT), given by U t = F Yt (Y t ), is uniformly distributed between 0 and 1 (see, e.g., Gneiting and Katzfuss, 2014). This fact can be used to test for proper density calibration (e. g., Dawid, 1984;Diebold et al., 1998). The uniformity can be checked either by graphical methods, such as QQ-plots and histograms, or by goodness-of-fit tests, such as the Kolmogorov-Smirnov test, the Anderson-Darling test, or Neyman's smooth test.
Unfortunately, matters are more complicated in the multivariate case because the distribution of the multivariate PITs of Y t under the null is unknown, in general, for d > 1 (see, e. g., Genest and Rivest, 2001). In essence, the task then is to reduce the multivariate problem to a univariate one by using suitable transformations. One way to approach this problem, proposed in Ziegel and Gneiting (2014), is to work with the Kendall distribution function for F Yt (Y t |Ω t−1 , θ 0 ). 2 The more commonly used way to approach this problem is based on the factorization of the joint densities into the product of conditional densities. Let F Y i (Y i,t |Ω t−1 , θ 0 ) denote the marginal CDF for the i th element of Y t and denote by . . , Y 1,t . Suppressing the dependence on Ω t−1 and θ 0 , we can then write (2.2) Rosenblatt (1952) shows that the sequences of conditional PITs for the elements of Y t We experimented with this approach but results were much worse (in terms of power) than those based on alternative approaches presented below. Therefore, we do not report them in this paper. are independent of each other and distributed U(0, 1). The next step is to obtain a univariate testing problem based on this vector of PITs. Diebold et al. (1999) achieve the reduction of dimension by stacking all conditional PITs. More formally, if we let Instead of stacking the conditional PITs, a commonly used alternative is to transform the vector-valued random variable Y t into a scalar random variable and to compute PITs for this transformed random variable. This is also the approach that we use below when developing our new tests. The computation of the conditional PITs is often an intermediate step in such transformations. To formalize the idea, consider the general transform function g t (·) : R d → R and define the transformed series W t = g t (Y t ) with distribution function F Wt . The PIT of W t is given by Testing H 0 then is equivalent to testing whether U W t ∼ U(0, 1). In the absence of dynamic misspecification, the PITs are also independently distributed across time under H 0 , i. e., U W ∼ U(0, 1). Mitchell and Wallis (2011) call density forecasts which satisfy both features completely calibrated.

The Order of Variables
So far, we have implicitly assumed that there exists a natural order of variables from 1 to d. This, of course, is not true as already mentioned in most papers on the topic (Diebold et al., 1999;Clements and Smith, 2002;Hong and Li, 2005;Ishida, 2005). Ordering the elements in Y t in a different way will generally lead to different results because the Rosenblatt transform in (2.3) clearly depends on the order of the variables. Consequently, the outcome of a hypothesis test will depend on the selected order. This is an undesirable property for a test since a researcher who is interested in supporting or discrediting a certain model may perform the hypothesis test for all distinct orders and only report the outcome with the largest or smallest p-value. 3 While it is certainly true that for lowdimensional cases results for all possible permutations can be presented and discussed, this becomes quickly impossible for larger d. In addition, even when multiple test statistics are presented, it is unclear how an overall decision should be made. We use the following notation for different permutations of the variables. Let π k , for k = 1, . . . , d!, be the set of all possible permutations of the data. Furthermore, let π k (i) denote the index (or "position") of variable i in the k th permutation. Then, the conditional PITs under permutation π k are given by (2.6) The following definition formalizes the concept that the initial order of the data is not relevant for the test outcome.
In the next section, we show that established transformations are order invariant only under very restrictive conditions and we derive new transformations that are always order invariant.

Established Transformations
Different transformations g t (·) have been considered in the literature. Clements and Smith (2000) propose to evaluate density forecasts based on the product of the conditional PITs corresponding to one particular permutation of the variables. In this case, the transformation function g t (·) is given by (2.7) where we define U 1|1:0 t = U 1 t . Ko and Park (2013a) explain why tests based on CS t,d have good power only against correlations lower than the hypothesized value. They suggest a location-adjusted version which does not suffer from this asymmetry and is given by (2.8) Tests based on the transformations suggested by Diebold et al. (1999), Clements and Smith (2000), and Ko and Park (2013a) are not, in general, insensitive to the choice of the permutation.

New Transformations
The first transformation that we propose leads to order-invariant test statistics under less restrictive conditions and forms the basis for additional transformations that always lead to order invariant tests. Consider the following transformation which is based on the squares of inverse normal transforms of the PITs for one particular permutation of the data: where Φ(·) denotes the CDF of the standard normal distribution.
, when Y t follows a multivariate normal distribution with mean vector µ and covariance matrix Σ.
Of course, Z 2 t can also be used to test non-Gaussian densities. In this case, however, the corresponding test statistics are not generally order invariant, except for the obvious case of independence. The proof of Proposition 2 in the appendix shows that under the null hypothesis of normality it holds that , which is the transformation proposed by Ishida (2005).
Ideally, however, we would like to obtain a transformation that is order invariant in general. A transformation that fulfils this criterion is similar in structure to Z 2 t but considers the sum over all distinct conditional PITs. Consider all possible permutations π k for k = 1, . . . , d! and the corresponding sequences of conditional PITs defined by (2.6). This yields a total of d × d! terms. However, the number of distinct PITs is only To formalize, let γ k i , for k = 1, . . . , 2 d−1 , be the set of all sets of conditioning variables (including the empty set) corresponding to all distinct conditional PITs for Y i,t . Then the suggested transformation has the form (2.10) Since all distinct conditional PITs enter into this transformation and, thus, the initial order of the variables in Y t is irrelevant, order invariance is clearly ensured for any test statistic based on Z 2 t * .
When d increases, the number of terms entering Z 2 t * can become prohibitively large. In this case, it appears sensible to use a transformation for which the number of terms grows only linearly with d. Such transformation that, at the same time, is always order invariant can be obtained by picking a suitable subset from the set of all distinct conditional PITs on which Z 2 t * is based. The transformation that we propose considers only such conditional PITs corresponding to each Y i,t which are conditional on all other variables. Denoting those conditional PITs by U i|−i t , the transformation is given by (2.11) We think that this particular choice has some merits since the considered subset of conditional PITs contains rich information about the dependence structure of the elements of Y t .

Distribution of Transformations
To test H 0 based on the transformations, we need to know their distribution under the null hypothesis as indicated by (2.5). The distributions of the established transformations are as follows. S t is simply a vector of independent uniformly distributed random variables under H 0 . Clements and Smith (2000) derive the distribution of CS t for d = 2, 3. In the appendix, we show that for arbitrary d its distribution under H 0 is described by the following PDF and CDF: 4 Note that for d = 2, 3 the densities derived in Clements and Smith (2000) is recovered. Ko and Park (2013a) provide the distribution of KP t for d = 2. 5 In the appendix, we show that for arbitrary d its distribution under H 0 is described by the following PDF and CDF: In the next two subsections, we derive the distributions of the new transformations. We distinguish two cases. In the first case, we do not make any assumptions about the distribution of Y t , except that it is continuous. The corresponding results include, for instance, cases in which H 0 implies non-Gaussian parametric distributions of Y t or its distribution is not available analytically so that the conditional PITs have to be calculated numerically. In the second case, we show that if Y t is normally distributed the distributions of Z 2 * and Z 2 † become much more tractable.

Distributions of New Transformations: General Case
As shown in Section 2.1, the different conditional PITs for one particular permutation are independent. Therefore, H 0 implies that Z 2 t,d ∼ χ 2 d , where χ 2 d denotes the chi-squared distribution with d degrees of freedom. Denoting its CDF by F χ 2 d , the random variable However, due to the fact that the summands in (2.10) are not independent in general, Z 2 t * no longer follows a χ 2 distribution under H 0 . The same argument applies in the case of Z 2 t † . However, we can straightforwardly obtain the distributions of the transformations by Monte Carlo simulation as long as it is possible to generate random draws from the density model under H 0 .
The following algorithm describes how the distributions of Z • t = {Z 2 t * , Z 2 t † } can be approximated numerically to compute U Z • t . We would like to stress that this algorithm is exclusively used to approximate this distribution for a given parameter value that can be either θ 0 or an estimateθ. We show below how parameter uncertainty is accounted for in the latter case at another step of the testing procedure.
1. Generate M conditional forecasts, y (m) t , based on the model under H 0 , i.e., draw repeatedly from the conditional predictive densities f Yt (·).
by simply counting how often the transformed statistic based on the actual realizations is smaller than the transformed statistics based on conditional forecasts that are generated under H 0 .
is distributed U(0, 1) for M sufficiently large. Validity of this simulation approach is straightforward as we simulate directly from the (parametric) null distribution and only apply continuous transformations.

Distributions of New Transformations: Gaussian Case
Under the assumption that Y t is normally distributed, the distributions of Z 2 t * and Z 2 t † are available analytically and do not need to be simulated. In this case, the terms Φ −1 U i|γ k i t jointly follow a multivariate normal distribution. However, since their marginal distributions are not independent, the transformations do not follow a chi-squared distribution but a mixture of chi-squared distributions, where the weights depend on the dependence structure of the Φ −1 U i|γ k i t . For Z 2 t * , we obtain the following result: A typical entry of R Z * is given by where the Σ r,c (r, c ∈ {i, γ k i }) are scalars, vectors, and matrices containing those elements of Σ that are defined by the row(s) corresponding to the variable(s) defined by r and the column(s) corresponding to the variable(s) defined by c.
The distribution of Z 2 t † in the Gaussian case is given by the following corollary: where the index −i denotes all rows/columns of Σ except for the i th one.
Note that, of course, Z 2 t,d ∼ χ 2 d continues to hold under H 0 in the Gaussian case.

Tests for Proper Calibration
In this section, we describe how we can construct tests of H 0 based on the transformations derived in the previous section. Depending on the formulation of the null hypothesis, this involves either testing that U W t ∼ U(0, 1) or jointly testing U W t ∼ U(0, 1) and independence across time. In the former case, we use Neyman's (1937) smooth test and distinguish two cases. First, we do not account for potential autocorrelation in U W t and assume that we know the parameters of the model that is used to generate the densities. Then, we construct tests that are robust against the existence of autocorrelation in U W t and assume that we have to estimate the parameters of the density model. To test the joint hypothesis in the second case, we resort to the concept of Generalized Autocontour (G-ACR) suggested by González-Rivera and Sun (2015).

Known Parameters and no Dynamic Misspecification
The first case that we consider ignores the issues of parameter uncertainty and dynamic misspecification. We assume θ 0 is known and Ω t−1 = I t−1 . The null hypothesis is U W t ∼ U(0, 1). 6 Many tests can be used in this context. We follow, Bera and Ghosh (2002) and De Gooijer (2007) who advocate testing uniformity with Neyman's (1937) smooth test.
To understand Neyman's smooth test, consider the alternative family of smooth distributions with b 0 a normalization constant and ψ i the orthonormal Legendre polynomials. Testing uniformity (and hence H 0 ) against all distributions nested in (2.12) boils down to testing b i = 0 for all i = 1, . . . , k. Here we consider the first four Legendre polynomials, but in principle one could also determine the number of polynomials in a data-driven fashion as suggested by Ledwina (1994) and applied, for instance, by Lin and Wu (2017). A score test is easily computed as follows. Denoting the vector of (log-)scores of (2.12) where I 4 is the 4 × 4 identity matrix. The Neyman smooth test statistic is then given by (2.14) which follows a χ 2 4 distribution under H 0 . This result, however, only holds when the model parameters are assumed to be known and when there is no dynamic misspecification.

Estimated Parameters and Accounting for Dynamic Misspecification
Parameter uncertainty and dynamic misspecification are often relevant in practice when parametric forecast models are used and the DGP of the variables to be forecast (including the true values of the relevant parameters) is unknown to the forecaster. Ignoring both issues will, in general, lead to oversized tests in an out-of-sample evaluation. Thus, we now assume estimates of the parameters,θ, are obtained using a √ T -consistent estimator and Ω t−1 ⊂ I t−1 . We test again if U W t ∼ U(0, 1). We adjust Neyman's smooth test by relying on results in West (1996) and West and McCracken (1998) to derive suitable tests in the presence of parameter uncertainty and potential dynamic misspecification. Recall that we split our n observations into R insample observations, which we use to estimate the parameters, and P out-of-sample observations, which we use to evaluate the forecast model.
] denote the Legendre polynomials in the estimated PITs of the (univariate) transformed series W t , whereÛ W t has been computed using the in-sample parameter estimates. It follows that under H 0 the elements ofξ t are no longer independently distributed with unit variance as in (2.13), but where (using the notation in Chen, 2011) . The constants η 1 and η 2 are determined by the sampling scheme (fixed, rolling, or recursive) used to estimate the parameters and the limiting value of the ratio of in-sample and out-of-sample observations λ = lim n→∞ P/R; see Chen (2011) for the precise formulas.
In order to avoid the evaluation of the matrices A and C (the latter of which may be particularly tedious to obtain), we use the fact that the equalities A + B = 0 and C + D = 0 continue to hold even under dynamic misspecification, even though in this case they cannot be interpreted as (generalized) information matrix equalities; see White (1994). Thus, we can rewrite (2.16) as (2.17) The matrices B and D can be estimated straightforwardly by their sample counterparts. In contrast, S * , B * , and D * need to be estimated by an appropriate estimator that is autocorrelation consistent. While in principle the widely used HAC estimator by Newey and West (1987) could be used, we found results in finite samples to be better (in terms of size) if we use the quadratic spectral estimator proposed by Andrews (1991). Neyman's smooth test statistic is then given by which follows a χ 2 4 distribution under H 0 . In addition, we can consider two intermediate cases.

Estimated Parameters and Testing for Independence
To jointly test for proper calibration and temporal independence of PITs, we have to test the null hypothesis U W t i.i.d.
∼ U(0, 1). The G-ACR proposed in González-Rivera and Sun (2015) can be used to test this hypothesis and seems to work well in practice. Therefore, we adapt this approach to our testing problem. The idea is the following: consider the pair of PITs [U W t , U W t−k ] ⊂ R 2 . Under the null hypothesis, the G-ACR for a level of α ∈ (0, 1) is given by the set of points B such that We apply the G-ACR method to the PITs of the transformed variable(s) because we would loose the order invariance of our results if we directly compute the G-ACRs for the multivariate densities as the multivariate version proposed by González-Rivera and Sun (2015) also relies on the Rosenblatt transform and, thus, one particular order of the variables. A test is then based on the comparison of the empirical frequency with which the (current and lagged) PITs fall into the square defined by the G-ACR to its distribution under the null hypothesis (potentially involving results for a range of values for α and k). For details on the test statistic we refer the reader to Proposition 3 in González- Rivera and Sun (2015). In our simulations, we implement the test for k = 1 and with a set of 13 quantile values for α (jointly). Note that in the case of estimated parameters a bootstrap needs to be applied to obtain critical values of the test.

Monte Carlo Simulations
We use Monte Carlo simulations to analyze how severe the size and power distortions caused by data mining can be in the case of the order-dependent approaches and how the size and power of the tests based on the different transformations compare. In the main text, we consider the problem of testing the null hypothesis of a multivariate normal distribution for the cases of i) known parameters and no dynamic misspecification and ii) estimated parameters and accounting for dynamic misspecification. In the appendix, we consider the following additional settings: i) testing the null hypothesis of a multivariate t distribution, ii) testing the null hypothesis of a multivariate GARCH model, and iii) jointly testing the null hypothesis of a multivariate normality and independence using the G-ACR approach.

Known Parameters and no Dynamic Misspecification
Assume that the data generating process (DGP) under the null hypothesis is given by The d × d covariance matrix Σ is constructed such that all elements of y t have unit variances (σ 2 i = 1 for i = 1, . . . , d) and the correlation between any two elements of y t is equal to 0.5 (ρ ij = 0.5 for all i = j). We consider different dimensions between d = 2 and d = 50 7 , and (predictive) sample sizes of P = {50, 100, 200}. Throughout the paper, we use 10,000 iterations for our Monte Carlo simulations. We consider seven alternative DGPs which imply different (combinations of) deviations from H 0 : • Alternative 1 (H 1 ): The innovations are generated from a multivariate normal distribution with σ i = 1.1 and ρ ij = 0.5.
• Alternative 6 (H 6 ): The innovations are generated from a multivariate t distribution with 8 degrees of freedom with σ i = 1.1 and ρ ij = 0.4.
We use Neyman's smooth test as described in Section 2.6.1 for this set of simulations.

Estimated Parameters and Accounting for Dynamic Misspecification
We consider two types of simulations with dynamic misspecification and estimated parameters. First, we generate data by the following homoskedastic dynamic model: 8 To simulate dynamic misspecification, we generate predictive densities ignoring the autocorrelation in the conditional mean. Since parameters are now estimated, we consider only alternative 5 from the above list. Rejecting the null hypothesis might be very hard in small samples if the alternative is a t distribution with 8 degrees of freedom. Therefore, we consider an additional alternative with a more substantial deviation from H 0 : The innovations are generated from a multivariate t distribution with 4 degrees of freedom with σ i = 1.0 and ρ ij = 0.5.
In addition, we consider multivariate Gaussian CCC-GARCH(1,1) models with parameters (ω, α, β) = (0.05, 0.1, 0.85) under the alternative. The first assumes no dynamic misspecification in the mean equation (by simulating data from a DGP without dynamics in the unconditional mean), even though we control for it when testing due to the fact that there is dynamic misspecification in the second moments.
The second case involves GARCH effects and dynamic misspecification in the mean: • Alternative 10 (H 10 ): Same as H 7 but with conditional mean dynamics given in To estimate the model parameters, we consider a fixed estimation scheme and set λ = 1/4 to determine the size of the estimation sample R = P/λ. We use the modified Neyman's smooth test as described in Section 2.6.2 for this set of simulations.

Potential for Data Mining
In this section, we present results that show whether considering different permutations of the data can have a serious impact on the outcomes of the tests that are not order invariant. All results here are based on Neyman's smooth test. The idea is the following: a researcher who wants to discredit (support) the hypothesis that a particular model produces good density forecasts could, in principle, search across all permutations and select the one which yields the highest (lowest) test statistic. We present results for H 0 and H 5 based on P = 100 and the assumption that parameters are known; results are similar for other settings and available upon request. The left part of Figure 1 shows how severe data mining can be under the null hypothesis. The solid line indicates the nominal size of 5 % which, as we show below, is obtained when tests are applied properly (meaning that the order of variables is chosen randomly). The other lines refer to the rejection frequencies that we obtain for the tests based on S, CS, and KP , respectively, when we always choose the permutation for which we obtain the highest (lowest) test statistic. At the lower end of obtainable rejection rates, it is clearly possible to virtually never reject the null hypothesis for any dimension. On the other hand, the null hypothesis can be rejected much too frequently if one chooses those permutations that yield high test statistics. For d = 2 the scope for data mining is rather limited, with obtainable rejection rates being around 10 %. However, once the dimension (and consequently the number of possible permutations) increases, obtainable rejection rates increase quickly. They lie above 50% for d = 6 for all transformations considered and reach virtually 100 % for the test based on KP .
In the right part of Figure 1, the solid lines indicate the power that is obtained when the tests are applied properly. The upper (lower) lines show the rejection rates that one obtains when always selecting the highest (lowest) test statistic across all possible permutations. The range of obtainable rejection rates is considerable in all cases. For tests based on CS and KP , the lower line is very close to 0. This means that even though the data are generated from a different DGP, a researcher would be able to purposely select permutations in such a way that H 0 is almost never rejected.  fication. Focusing on the upper panel of the table, we see that none of the approaches suffers from notable size distortions. In all cases, the obtained actual sizes are very close to the nominal size of 5 %. The second panel of the table reveals that tests based on our three new transformations and on S have the best power when the alternative implies deviations of the variances (H 1 ). Tests based on our new transformations outperform the test based on S for large dimensions, as also documented by the performance against H 2 , the more challenging alternative in which we change only a third of the variances. Results for H 3 show that the three new approaches consistently outperform the tests based on established transformations when deviations from H 0 are specified in terms of the correlation structure of the multivariate density. In the case of simultaneously misspecified variances and correlations (H 4 ), the new approaches consistently outperform the tests based on CS or KP . For small samples (P = 50), they also outperform the test based on S; for larger samples tests based on either S or the new transformations quickly approach a power of 1.

Known Parameters and no Dynamic Misspecification
Turning to the power of the different tests for detecting misspecification of the kurtosis (H 5 ), we see that the new approaches outperform all established tests by a wide margin. Especially for P = 50 the results are stunning: the power of the new approaches exceeds that of even the best-performing established approach by a factor of more than two in many cases. Adding wrongly calibrated variances and correlations to the misspecification of the distribution in H 6 leads to a decrease of this outperformance because there is little room for the power of the new approaches to improve while, at the same time, tests based on the old transformations gain a lot of power through these additional deviations from H 0 . However, our new approaches still clearly outperform the established approaches. Finally, the new transformations have also better power against GARCH effects (H 7 ). In light of our previous results, this is expected given that this alternative leads to innovations that are unconditionally distributed with excess kurtosis. The second set of simulations which assumes a multivariate t distribution as the null model yields very similar results with respect to the relative performance of the different transformations; results can be found in the appendix.

Estimated Parameters and Accounting for Dynamic Misspecification
Now, we turn to the case where the model parameters have to be estimated from in-sample data and we account for dynamic misspecification. In general, the results indicate that tests based on all transformations are substantially oversized if one does not adjust Neyman's smooth test (Table 3). Using the adjusted version described in Section 2.6.2, yields correctly sized tests for P ≥ 100. Furthermore, the results indicate that having to deal with estimated parameters and dynamic misspecification results in a considerable loss of power against H 5 . Large evaluation samples seem to be necessary to detect this deviation from the null hypothesis reasonably well (especially for low dimensional densities). Power increases considerably for sample sizes of P ≥ 100 in the case of H 8 which implies a much stronger deviation from H 0 . At the same time, the ranking of the competing tests remains largely unaffected under both alternatives so that the new tests proposed in this paper continue to perform substantially better than established tests. Interestingly, when GARCH effects are present, the test based on S outperforms other tests both when dynamic misspecification is present (H 10 ) and when it is not (H 9 ).
In further simulations that can be found in the appendix, we show that the new tests have the highest power against CCC-t-GARCH models when the null hypothesis is a Gaussian CCC-GARCH model if no dynamic misspecification is present. If, on the other hand, dynamic misspecification is present, tests based on S have the highest power for small P and comparable power to the new tests for larger evaluation samples. Furthermore, we show that tests based on the idea of G-ACRs (discussed in Section 2.6.3) are properly sized and that tests of this kind based on S have the best power against dynamic misspecification while tests based on our new transformations have the best power against distributional misspecifications. Again, results can be found in the appendix.

Predicting the Distribution of Stock Market Returns
In this section, we provide an application of the tests discussed above. The application shows that using tests that are not order invariant offers room for data mining in many  Power against H 1 P = 50 P = 100 P = 200 Power against H 2 P = 50 P = 100 P = 200   Power against H 5 P = 50 P = 100 P = 200 Power against H 6 P = 50 P = 100 P = 200 Power against H 7 P = 50 P = 100 P = 200    Power against H 5 P = 50 P = 100 P = 200  Power against H 9 P = 50 P = 100 P = 200 Power against H 10 P = 50 P = 100 P = 200 situations. We consider the problem of forecasting the joint distribution of five international stock market indices. Our data consist of weekly returns of the MSCI indices for the US, Japan (JA), UK, Australia (AU), and Germany (GE) which we obtained from Datastream. The sample spans the period from January 1971 until October 2013 for a total of 2,232 weekly returns. We consider eight different time periods of four years for which we evaluate density forecasts. These (out-of-sample) evaluation periods are 1981-1984, 1985-1988, 1989-1992, 1993-1996, 1997-2000, 2001-2004, 2005-2008, and 2009-2013. In addition, we evaluate the forecast models over the entire sample from 1981-2013. For each period, the previous ten years are considered as in-sample data to estimate the models of interest. The models are re-estimated for each week using a recursive scheme. Three competing models of increasing complexity are considered: (i) a Gaussian DCC-GARCH model (Engle, 2002), (ii) a time-varying t-copula with DCC-type dynamics and t-GARCH margins, and (iii) a time-varying t-copula with skewed-t-GJR-GARCH margins. For the DCC-GARCH model the marginal models for i = 1, . . . , d are given by The correlation matrix R t of the innovations z t = [z 1,t , . . . , z d,t ] is given by with α c , β c ≥ 0, α c + β c ≤ 1, andQ = E(z t z t ), which, in practice, is estimated using the sample covariance matrix of z t . For the second model, the marginal models are the same as above, with the difference that the innovations z i,t follow a t distribution with ν i degrees of freedom. The dependence between the t-distributed GARCH innovations z t is given by a t-copula with degrees of freedom ν c and correlation matrix R t . For details and properties of the t-copula see, e. g., Joe (2014). The evolution of the correlation matrix is given by (4.1) and (4.2), but with z i,t replaced by T −1 νc (U i,t ) νc−2 νc . Note that this model is slightly more flexible than a DCC-GARCH model based on a multivariate t distribution since the copula approach allows all marginal series to have distinct degrees of freedom. The estimation of the copula-based model is naturally done in two steps, ensuring numerical stability at the price of a small loss in statistical efficiency (Joe, 2005).
The third model is even more flexible by assuming that the GARCH innovations z i,t follow the skewed t distribution of Hansen (1994) and by relying on the GJR-GARCH model of Glosten et al. (1993), for which the conditional variance follows The dependence is again given by the DCC-t-copula model. For each model and each time period, we compute the Rosenblatt PITs and apply the established and new transformations described above. Recall that for non-Gaussian models the distribution of Z 2 * t and Z 2 † t is not known but can be computed numerically as explained in Section 2.5.1. The null hypothesis of correctly predicted densities is then tested with Neyman's smooth test (Neyman, 1937), accounting for parameter estimation and potential dynamic misspecification as explained in Section 2.6.2. We estimate the long-run covariance matrices using a quadratic spectral kernel and the automatic lag selection as proposed in Andrews (1991). In Table 4, we report the p-values based on the different transformations. For those tests which are not order invariant, we consider all 5! = 120 permutations of the data. We report the p-value of a random permutation of the variables (based on the arbitrary order in which we downloaded the data: US, JP, UK, AU, GE) and, in brackets, the lowest and highest p-values across all permutations.
The results are mixed and depend on the time period under study. However, a few things clearly stand out. First, the Gaussian DCC model is rejected by almost all tests for all time periods except the 1997-2000 period. Second, model specifications (ii) and (iii) perform much better, but are still rejected for some periods. Notably, most tests reject these models for the aforementioned 1997-2000 period. This suggests that during that period returns had much lighter tails than during other periods. A shorter insample period may be appropriate to reflect such non-stationarities. Third, the most flexible specification (iii) does not consistently outperform specification (ii), confirming the known fact that model complexity may yield superior in-sample fit, but not necessarily a better forecasting performance. Fourth, when we use the entire sample, all models are rejected. However, given the very large number of observations and the fact that none of the models features time-varying parameters that might help dealing with structural breaks, this is not too surprising.
Finally, the potential for data mining using the tests based on S, CS, and KP , respectively, is immense. For the wide majority of periods one can find permutations that reject and permutations that do not reject the null hypothesis of properly calibrated density forecasts for any of the models. Note, however, that in line with our results from Section 3.2, the range of p-values for tests based on S is, on average, smaller than for  (Neyman, 1937) that accounts for parameter estimation and potential dynamic misspecification as explained in Section 2.6.2. The data are weekly MSCI stock index returns for the US, Japan, UK, Australia and Germany. Forecasts are evaluated for the stated periods and the previous 10 years of data are used as the in-sample period. For transformations which are not order invariant, the numbers in brackets show the lowest and highest obtained p-values across all permutations of the variables; for these transformations, the first p-value is for an arbitrarily selected permutation.
the ones based on CS and KP . Finally, turning to the results for Z 2 which are not order invariant for the non-Gaussian models, one can see that the range of the p-values is very limited and that there is only moderate scope for data mining based on this transformation.
In summary, we recommend evaluating the density forecasts based on Z 2 * and Z 2 † , and possibly based on Z 2 . The results based on the other transformations are not reliable as different permutations can lead to substantially different conclusions regarding the performance of the models. Furthermore, our Monte Carlo simulations show that the new tests are superior in terms of power. Using a 1% significance level, specifications (ii) and (iii) are rejected by the test based on Z 2 * (Z 2 † ) for only 2 (2) and 3 (2) sub-samples, respectively. When using a Bonferroni correction to address the fact that this is a case of multiple testing, specification (ii) is only rejected for the 1993-1996 period (based on both Z 2 * and Z 2 † ) and for the 1997-2000 period (based on Z 2 * ). 9 Thus, overall, the t-GARCH model with a time-varying t-copula can be recommended for modeling and predicting the joint density of weekly stock market returns.

Conclusion
In this paper we derive order-invariant tests for proper calibration of multivariate density forecasts of arbitrary dimension. We demonstrate that distortions in rejection rates can be very large when established tests, which are not order invariant, are used for data mining. Furthermore, we show that the new tests have very good power against a wide range of deviations from the null hypothesis; this holds true, in particular, when the data exhibit fat tails that are not taken into account by the null model. We do not find that one of our new tests dominates the others in terms of power regardless of the alternatives. Our Monte Carlo simulations indicate that tests based on Z 2 * have slightly better power against misspecification of the variance and in the presence of fat tails while tests based on Z 2 † have slightly better power against misspecification of the correlation structure. Therefore, we recommend using simultaneously the tests based on Z 2 * and Z 2 † whenever there is no strong prior about the nature of potential deviations from the specified null model. We want to stress again that our approach, which essentially relies on transforming the multivariate problem to a univariate one, is compatible with any method for testing univariate distributions. We recommend using the powerful Neyman smooth test and show how it can be adjusted to account for parameter uncertainty and dynamic misspecification. If the aim is to test the joint hypothesis of completely calibrated distributions, G-ACR-based tests seem to work well.
We believe there is a wide range of other applications in various fields. First, the proposed methods are useful whenever properly calibrated density forecasts are crucial to form well-informed decisions (about production, investment, pricing, etc.) and will foster the use of multivariate density forecasts in situations in which decisions are based on loss functions that take more than one variable as input arguments. Our tests could, for instance, be used to assess the overall forecast performance of macroeconomic DSGE models used at central banks. Second, the proposed methods are useful to improve the specification of multivariate models taking higher moments into account; obvious applications of this kind are common in financial econometrics, e. g., for estimating the Value-at-Risk of a portfolio, but it can be expected that the modeling of the dependence structure of higher moments of multivariate data becomes more common also for demand management or in macroeconomics.
Our study leaves room for future research along several dimensions. First, especially for financial applications, it would be interesting to extend the analytical results of our paper which are limited to the case of multivariate Gaussian processes under the null hypothesis to more general settings. Second, we believe it may be possible to develop tests with even better power for very high-dimensional densities; this could be achieved by selecting the terms entering the Z 2 * transformation in a data-driven way or by assigning weights to the conditional PITs entering the transformations that are based on the relevant loss function. Finally, it might be worthwhile to investigate whether powerful order invariant tests can be constructed that are not based on the Rosenblatt transformation.

APPENDIX Appendix A Proofs and Derivations
Proof of Proposition 1. Under independence, we have U i|1:i−1 t = U i t , i. e., the conditional CDF is equal to the marginal CDF. In this case, the product transformation reduces to P t,d = d i=1 U i t . This is clearly robust to permutations. The same argument can be made for the location-adjusted version P * t,d . The stacked transformation then becomes S t = [U 1 t , . . . , U d t ] , which again is obviously order invariant. Now consider the following two permutations: π 1 = (1, 2, 3, . . . , d) and π 2 = (2, 1, 3, . . . , d). For these permutations, the product transformations only differ in their first two components. So w.l.g., we only check that independence is needed for to hold. The latter equality is equivalent to for all t, which does not hold in general, unless we have independence.
For these two permutations order invariance in S t is given only if t ] for all t, which again only holds under independence.
Proof of Proposition 2. W.l.g. let µ = 0, which can be achieved by demeaning the original data. Rewrite Y t as with Z i,t normally distributed. Writing this more compactly we obtain is the matrix of population regression coefficients, whose precise form in terms of the covariance matrix is directly available using standard results on conditional normal random variables. It holds that Consequently, The last term is clearly invariant to the order of the variables.
Derivation of the distribution of CS t,d for arbitrary d. This can be shown by induction. To simplify notations we drop the time subscript and replace the conditional PITs by a sequence of independent U(0, 1) random variables U 1 , U 2 , . . ..
Step 1 (d = 2): For d = 2 the density is given by which is equal to the density derived in Clements and Smith (2000). Note that we could also start at d = 1, for which the density is equal to 1, corresponding to the uniform distribution.
Step 2 (d → d + 1): Consider the change of variables The determinant of the Jacobian for the inverse transformation is The joint density of CS d+1 and U d+1 is To show that the CDF is correct first note that It follows that Derivation of the distribution of KP t,d for arbitrary d. Again, this can be shown by induction and again for simplicity we consider a sequence of independent U(0, 1) random variables U 1 , U 2 , . . ..
Step 1 (d = 2): Consider the change of variables The determinant of the Jacobian is J = 1 U * 2 , so the joint density of KP 2 and U * 2 is given by Integrating out U * 2 gives where the second equality follows from the symmetry around 0 and the fact that |2KP 2 | < |U * 2 | < 1/2.
Step 2 (d → d + 1): Consider the following change of variables The determinant of the Jacobian is and therefore the joint density of KP d+1 and U * d+1 is Again the symmetry around 0 and the fact that |2 d KP d+1 | < |U * d+1 | < 1/2 was used.
Now consider the CDF. Note that Then using the product rule The addition of 1/2 ensures that the CDF lies between 0 and 1.

Proof of Proposition 3. Consider the generic term
stands for a set of indices representing the conditioning variables. Under normality, these terms are also jointly normally distributed. Then the fact that Z * 2 t has a mixture of independent χ 2 1 random variables follows directly from Lemma 17.1 in van der Vaart (1998). The weights of the mixture are given by the eigenvalues of the covariance matrix of the terms Φ −1 U i|γ k i t for all i = 1, . . . , d and k = 1, . . . , 2 d−1 . This matrix is actually a correlation matrix due to the unit variance of the inverse normal transformation.
To compute this correlation matrix, we start with the covariance between Y t i|γ k i and Y t j|γ l j . Then, dropping the time index, Y i conditional on the vector Y γ k i is Then the computation of the covariance/correlation is straightforward. The reduced rank of R Z * follows from the fact that all conditional variables Y t i|γ k i are a linear combination of the original d variables.

B.1 Multivariate t Distribution under the Null Hypothesis
In addition to the results in Section 3.1.1, we also investigate the test performances when Y t follows a multivariate t distribution with 5 degrees of freedom under the null. We modify the alternatives 1-7 such that the innovations also follow a multivariate t distribution in the case of alternatives 1-4 and 7 (keeping variances and correlations as in the simulations with the Gaussian null model) and a multivariate normal distribution in the case of alternatives 5 and 6. Results are given in Table B.1 and confirm the relative performance that we find in our baseline simulations.

B.2 GARCH Dynamics under the Null Hypothesis
In addition to the results in Section 3.1.2, we also investigate the test performances when the innovations under the null follow a Gaussian CCC-GARCH(1,1) process while they are generated by CCC-t-GARCH(1,1) models with 8 and 4 degrees of freedom under the alternatives: • Alternative 11 (H 11 ): The innovations are generated from a multivariate CCCt-GARCH(1,1) model with 8 degrees of freedom.
The unconditional variances and correlations are the same as in the baseline simulations. We consider one set of simulations where dynamic misspecification in the conditional mean is present and one set of simulations where it is not. Since we need more data to reliably estimate the GARCH models and want to hold λ = 1/4 fix, we also consider P = 500 in these simulations. Again, we use a fixed estimation scheme and the modified smooth test.  Power against H 1 P = 50 P = 100 P = 200 Power against H 2 P = 50 P = 100 P = 200  Power against H 4 P = 50 P = 100 P = 200 Power against H 5 P = 50 P = 100 P = 200 Power against H 6 P = 50 P = 100 P = 200   Power against H 11 P = 50 P = 200 P = 500   Power against H 12 P = 50 P = 200 P = 500 Notes: Rejection frequencies of Neyman's smooth test based on the transformations introduced in Sections 2.3 and 2.4 for the null hypothesis of a Gaussian CCC-GARCH(1,1). The alternatives are CCC-t-GARCH(1,1) models with 8 (H 11 ) and 4 (H 11 ) degrees of freedom, respectively. All Monte Carlo simulations are based on 10,000 iterations.

B.3 Jointly testing for Multivariate Normality and Independence
For this set of simulations, we use similar DGPs as in Section 3.1.2. What is different is the formulation of the null hypothesis and that we rely on G-ACRs to implement a joint test of the combined null hypothesis (see Section 2.6.3). The null model is the multivariate normal distribution given by equation (3.1). We consider the following alternatives: • Alternative 13 (H 13 ): The data is generated according to equation (3.2). To simulate dynamic misspecication, we generate predictive densities ignoring the autocorrelation in the conditional mean.
• Alternative 14 (H 14 ): The data is generated by a static model with innovations following a multivariate t distribution with 8 degrees of freedom.
• Alternative 15 (H 15 ): The data is generated according to equation (3.2) but with innovations following a multivariate t distribution with 8 degrees of freedom.
To simulate dynamic misspecication, we generate predictive densities ignoring the autocorrelation in the conditional mean.
• Alternative 16 (H 16 ): The data is generated according to equation (3.2) but with innovations following a multivariate t distribution with 4 degrees of freedom.
To simulate dynamic misspecication, we generate predictive densities ignoring the autocorrelation in the conditional mean.
Like above, we consider a fixed estimation scheme with λ = 1/4 and evaluation samples of size P={50,100,200}. Results are given in Table B.3. All tests are properly sized. If only distributional deviations from the null model but no dynamic misspecification is present (H 14 ) the new transformations clearly outperform the established ones also with the G-ACR approach. When both features are present under the alternative, the new test and tests based on S clearly outperform the other two tests. The ranking of the former depend on whether the distributional deviation from the null is moderate (H 15 , power is highest for S) or strong (H 16 , power is highest for the new tests). The new tests are clearly dominated by tests based on S and interestingly also CS when the only deviation from the null hypothesis is given by dynamic misspecification (H 13 ); hence in this-rather unlikely-case there is apparently a price that one has to pay to make tests order invariant.  Power against H 13 P = 50 P = 100 P = 200 Power against H 14 P = 50 P = 100 P = 200 Notes: Rejection frequencies for joint tests of uniform distribution and independence of U W t using the G-ACR approach described in Section 2.6.3. All Monte Carlo simulations are based on 10,000 iterations. H 13 involves only dynamic misspecification. H 14 involves only a misspecification of the distribution. H 15 and H 16 involve dynamic misspecification plus a deviation from the null distribution.