##### 2.2.1. CART Basics

[14] For many kinds of predictive tasks, one seeks to establish a functional dependence *d* between predictors from (multidimensional) predictor domain *X* and a predictand from (unidimensional) predictand space *Y*, *d*: *X* *Y*. In our case, *X* has 8 dimensions and corresponds to the phase space spanned by the predictors (2 + 2 SLP statistics, 3 WBAL averages and elevation) and *Y* corresponds to the unidimensional phase space of the persistent hot days. Multilinear regression analysis is often used to model such a dependence *d*. However, it can be advantageous to define *d* piecewise, that is, to represent the dependence by several models, which are defined over different subdomains of *X*. See, for example, *Schomburg et al.* [2010] for an application of threshold based rules for downscaling of RCM output to spatial scales of less than a kilometer. CARTs also belong to this class of tasks, especially where piecewise constant models over disjoint subdomains of *X* are appropriate.

[15] An exemplary CART outcome is illustrated in Figure 1a for two predictors (spatial patterns of elevation and time-averaged water balance), which explain the predictand “pattern of heat waves.” In this example, strong heat waves occur only in regions of low elevation and mostly in those low elevation regions with a negative water balance. As shown, CARTs provide disjoint rectangular subdomains of *X*, *L*_{i}*X* (hereafter called “leaves,” see below), and predict for each *x* within a given *L*_{i} one common value of *y*.

[16] For a given training sample (*x*, *y*)_{k}, *k* = 1 … *N* (where in our case *k* is the index of the grid points), CARTs iteratively construct the set of *L*_{i}*X* which minimizes the sum of squared residuals,

although this minimum is possibly a local one, see below. *d* is piecewise constant over the *L*_{i}, simply consisting of the averages of the *y*_{k} values associated with the *x*_{k} within the *L*_{i} subdomains (or leaves),

where # (*x*_{k} ∈ *L*_{i}) denotes the number of *x*_{k} within *L*_{i}. Contrasting correlation and multilinear regression analysis, CARTs do not attempt to find one model that describes the dependence between *x* and *y* over the entire domain *X*, which enables them to naturally adapt to nonlinearities and interactions in the predictors.

[17] As seen in Figure 1a, the *L*_{i} leaves are determined by binary splits *S*_{l} = (*p*_{l}, *v*_{l}) in single predictors, defining which predictor *p* splits a (sub) domain ⊆ *X* and at which value *v* it does so. In Figure 1a they are represented by the dashed separating lines. Splits and subdomains/leaves can be conveniently represented in a tree like structure (see Figure 1b), where the branches of the tree represent the left or right sides of the *S*_{l} splits and the leaves at the very ends of the branches correspond to the *L*_{i} subdomains. The tree-like representation is especially suitable for high-dimensional predictor domains (a representation such as the one in Figure 1a is obviously limited to the two-dimensional case) and can assume any shape, not necessarily symmetric as in Figure 1b.

[18] The CART algorithm starts by screening possible binary splits of all predictors (in terms of a “greedy search” [see *Breiman et al.*, 1993]) and chooses the one that minimizes equation (1). The splitting is recursively repeated for all of the therewith identified subdomains, until preset parameters such as a required minimum of observations per subdomain prevent further splitting. To avoid overfitting of the data, the size of a tree (that is, the number of subdomains, or leaves, *L*_{i}) is reduced (“pruned”) after growing the tree, usually based on cross validation. For growing and pruning of the trees, the rpart package is used, available for R (a statistical computing environment) [*R Development Core Team*, 2008].

[19] Complementing standard multilinear regression analysis, in CARTs both interactions and nonlinear relations in the data are taken into account in a natural way, see for example the (hypothetical) amplifying effect of low elevation with negative water balances onto heat waves in Figure 1.

[20] The fact that CARTs define the splits by screening the predictors separately can cause problems if predictors are correlated and the predictor choice for a given split becomes unstable. This is why ideally CARTs are applied to independent predictors. In our case, however, some of the predictor patterns are quite similar to each other, resulting for example in the high correlations between WBAL_{ANN} and WBAL_{MAM} in Table A1. Another source for unstable splits is that CARTs often detect only secondary minima of equation (1), as a consequence of the “greedy search” approach. Bootstrapping can be used to assess the robustness of a tree in such cases.

##### 2.2.2. Assessing the Robustness of CARTs

[21] Since CARTs are inherently unstable, an assessment of their robustness can be necessary. We are primarily interested in the structural robustness (see below) and try to evaluate it. To this end, for an individual RCM simulation, 50 bootstrap trees are grown, based on random subsamples consisting of 66% of the original (*x*, *y*)_{k} data. If the structures of these 50 trees turn out to be similar, we assume that the tree gives a robust representation of the relationships within the complete (*x*, *y*)_{k} sample. As described in the following, structural similarity is evaluated by analyzing the predictors chosen for the *S*_{l} splits, which define the *L*_{i} leaves.

[22] Formally, the *L*_{i} leaves can be obtained by a concatenation of left and right operators which return the subdomain of any domain ⊆ *X*, which lies left (right) of a split *S*_{l}, that is,

[23] For example, *L*_{3} in Figure 1b can be written as *L*_{3} = left(*S*_{3}, right(*S*_{1}, *X*)).

[24] If one assumes two of the bootstrap trees to have the same number of *L*_{i} leaves, which are additionally sorted in ascending order of *y*, then a direct pairwise comparison of the leaves from the two trees is possible. A measure of structural similarity for such a pair of leaves is the fraction of agreeing predictors chosen for their splits. That is, for each leaf of these two trees, the *S*_{l} splits in the concatenations of left and right operators defining the respective leaf are analyzed and the *p*_{l} predictors from the involved *S*_{l} splits are extracted. In the example of *L*_{3} in Figure 1b, this would yield the set {elevation, water balance}, where elevation is from *S*_{1} and water balance from *S*_{3}. If in one of the bootstrap trees *L*_{3} happened to be defined by elevation only, the intersection set of these two predictor sets ({elevation, water balance} and {elevation}) would be the set {elevation}. We define the similarity of the two leaves as the number of different predictors in the intersection set divided by the averaged numbers of different predictors in the two individual predictor sets. In the example, this would be 1/((2 + 1)) = 2/3. A value of 1 of this similarity measure indicates total agreement, while a value of 0 indicates total disagreement. The average of these values from all leaf pairs characterizes the similarity of the two bootstrap trees. A perfect structural similarity in this sense would be given if the predictors defining the leaves of the two trees agreed for each leaf. Such assessment requires two properties of the compared trees. First, the number of leaves has to be the same, and second, the leaves within each compared pair must correspond to each other. This is detailed in section 2.2.3.

[25] This metric is very rudimentary, but it captures the predictor choices which characterize the extreme event patterns within the trees, which is one of our main interests here. More sophisticated metrics could e.g. include the distance between the *v*_{l} values in the corresponding *S*_{l} splits of two trees, the order of the chosen predictors from the “trunk” to the leaves or the distances of the splits from the trunk. Such extensions are left to future studies.

[26] We apply this framework to each RCM simulation and compare the structural similarities found between the bootstrap trees of one single RCM to the structural similarities between the trees of the different RCM simulations. If the similarities between the bootstrap trees are high, and in particular, are higher than the inter-RCM similarities, this supports the robustness of the structures identified for each RCM. See section 3.1 for the results of this assessment.

##### 2.2.3. CARTs in Our Study

[27] Here we grow an individual CART for each RCM simulation (and the ERA40 reanalysis). As a result of the cross validation pruning, not all of the trees will have the same number of leaves. To enable their pairwise comparability, they are therefore first pruned further down to the minimum number of leaves found across all trees. Note that this pruning may lead to an overgeneralization for certain trees, however, it is needed for their intercomparability. The *L*_{i} leaves are reordered with ascending values of *y* for a direct pairwise correspondence between the *L*_{i} leaves from different trees. This pruning and reordering is applied both for the trees from the RCM ensembles and for the bootstrap tree ensembles used to evaluate their robustness.

[28] As already mentioned, the main interest lies on the partition of the predictor domain that is defined by a CART, and the relationships between the predictors and extreme event contained in it. In order to capture the qualitative aspects of these relationships (such as “long persistent hot days concur with pronounced water balance deficits”), we express the partition in terms of thresholds based on quantiles, which can thereby differ between the models. Thus in our evaluation, attributes like “long” or “pronounced” always refer to the predictor distributions of the individual models, which makes it independent of possible model biases and facilitates a consistent qualitative synthesis of the relations in the models.

[29] Note that even the most robust CART, like any other statistical method in general, cannot detect the causality of relations between different variables [see, e.g., *Orlowsky and Seneviratne*, 2010]. CARTs also cannot evaluate true physical relations if these are misrepresented in the RCMs. Physical interpretations of CART outcomes therefore always reflect the relations within the model world, like any other multimodel output analysis (e.g., based on ensemble averages).