Semi-supervised multivariate regression trees: putting the ‘circulation’ back into a ‘circulation-to-environment’ synoptic classifier

Authors

  • Alex J. Cannon

    Corresponding author
    1. Meteorological Service of Canada, Environment Canada, Pacific & Yukon Region 201-401 Burrard Street, Vancouver, BC, V6C 3S9, Canada
    • Meteorological Service of Canada, Environment Canada, Pacific & Yukon Region, 201-401 Burrard Street, Vancouver, BC, V6C 3S9, Canada.
    Search for more papers by this author

Abstract

Multivariate regression trees (MRTs) have been used in synoptic climatology to construct “circulation-to-environment” synoptic classifications. Because the goal of an MRT is to maximize discrimination of the environmental predictand variables, performance in terms of the synoptic-scale circulation predictors is typically sacrificed. This paper introduces a semi-supervised approach in which a weighted combination of synoptic-scale predictors and environmental variables serve as predictands in a MRT. Results for southern British Columbia, Canada, indicate that (1) a semi-supervised MRT can outperform a fully supervised MRT in terms of discrimination of the surface environment; (2) weighting allows the synoptic classifier to behave as a fully unsupervised model, a fully supervised model, or intermediate between the two ends of the spectrum; and (3) the optimum trade-off between circulation and environment must be chosen by the user depending on specific needs. © 2011 Crown in the right of Canada. Published by John Wiley & Sons Ltd.

1. Introduction

In synoptic climatology, the ‘circulation-to-environment’ approach to synoptic classification is used to investigate relationships between synoptic-scale atmospheric circulation conditions and the local-scale surface environment (Barry and Perry, 1973; Harman and Winkler, 1991; Yarnal, 1993; Yarnal et al., 2001). Typically, this involves applying some form of unsupervised clustering algorithm to the synoptic-scale circulation data and then relating the resulting classes to the surface environment data.

Recently, supervised approaches, ones in which information about the surface environment guides development of the synoptic classifier, have been developed as a means of better resolving linkages between the two scales. As examples, Hughes et al. (1993) and Zorita et al. (1995) used recursive partitioning trees, also known as classification and regression trees (Breiman et al., 1984), to develop synoptic classification systems relating gridded atmospheric circulation data to station precipitation. Cannon et al. (2002a, 2002b) formalized the use of recursive partitioning in synoptic climatology, including a general extension to multiple environmental variables via multivariate regression trees (MRTs) (De'ath, 2002). Unlike standard divisive clustering algorithms in which a single dataset (i.e. the synoptic-scale circulation) is split into increasingly similar groups (Kašpar and Müller, 2010), splits in the MRT are defined by a predictor dataset (i.e. the synoptic-scale circulation) such that a separate predictand dataset (i.e. the surface environment) is partitioned into increasingly similar groups. In the former, the algorithm is unsupervised, whereas in the latter, it is supervised.

Because an MRT provides an explicit link between the synoptic-scale circulation and the environmental response, atmospheric circulation patterns relevant to local environmental conditions are likely to be selected. However, because the goal is to maximize discrimination of the environmental predictand variables, performance in terms of the synoptic-scale circulation variables will be sacrificed. In other words, the synoptic classes will tend to exhibit more internal synoptic-scale variability than those from an unsupervised classification approach applied to the same predictors. How can a compromise between performance in terms of ‘circulation’ and ‘environment’ be reached?

As a potential solution, this paper introduces a semi-supervised approach to synoptic classification in which a weighted combination of synoptic-scale predictors and environmental variables serve as predictands in a MRT. The weighting determines the relative influence of the synoptic-scale dataset in guiding the model. The approach is demonstrated on data from the south coast of British Columbia, Canada.

2. Multivariate regression tree

An MRT is structured as a binary tree with nodes defined by simple decision rules applied to predictors X of dimension N × I, where N is the number of cases and I is the number of predictor variables. All cases start out assigned to a single node. Cases in this top-level node are divided into two groups by a decision rule defined by one of the predictors. Depending on the outcome of the decision, cases follow one of the two branches from the node. New decision rules are created at nodes by a splitting algorithm until one or more stopping criteria are met. Cases are assigned to classes based on the terminal (unsplit) nodes they reach in the tree.

A splitting algorithm selects the decision variables and their thresholds. At each step, it is responsible for determining the decision rule that will best partition the remaining cases into classes that are as homogeneous as possible. Homogeneity is measured with respect to a set of predictands Y of dimension N × J, where J is the number of predictand variables. This requires a quantitative measure of error for nodes in the tree to be defined. Each node is characterized by the centroid of its assigned cases, where the centroid for a specified variable is defined as

equation image(1)

where equation image is the value of the jth of J predictand variables for the tth of Nk cases assigned to the kth of K nodes. The error measure for the kth node is then given by the within-cluster sums of squared deviations from the centroid

equation image(2)

The overall error for a tree is calculated by summing the WSSk values over the K terminal nodes

equation image(3)

To build the tree, splits are chosen to maximize the decrease in WSSk between the existing parent node WSSA and the new child nodes WSSB and WSSC

equation image(4)

The search for the best split is exhaustive, considering each value of the I predictor variables as a potential threshold. Data are split until each terminal node contains a specified minimum number of cases, until no further splits can be made because WSS has been minimized, or until a desired level of complexity has been reached. Once a tree has been built, the proportion of explained predictand variance EV can be calculated as

equation image(5)

where WSST is the total within-cluster sums of squared deviations for the top-level node. New cases can be assigned to classes using the splits defined at the decision nodes. As mentioned above, each node is characterized by the centroid of cases assigned to it during fitting. These values serve as predictions for new cases, and estimates of predictive error, EV, etc. can be estimated for the model.

3. Semi-supervised multivariate regression tree

With one exception, a semi-supervised MRT is built as described above. However, instead of using Y as the predictand dataset, a weighted combination of the predictors X and the original predictands Y serve as a new set of predictands. The MRT is thus applied with X as predictors and the N × (I + J) matrix

equation image(6)

where 0⩽α⩽1 determines the relative weighting of the two matrices, as predictands. If α = 0, the method reduces to a standard MRT; if α = 1, the method is auto-associative and performs unsupervised clustering of the predictors X (Chavent, 1998). Note that the relative scaling of individual variables (and, similarly, collinearity and redundancy between variables), will affect the final classification. Fovell and Fovell (1993) and Mimmack et al. (2001) discuss biases related to scaling and redundancy in climatological cluster analyses. The same caveats apply here. Redefinition of X and Y using methods such as truncated principal component analysis may be warranted, depending on the problem.

The appropriate value of α must be decided by the user. What level of degradation in predictand performance is acceptable for a given improvement in predictor performance? Because a MRT is predictive, one can use cross-validation techniques to evaluate performance trade-offs in the same manner that they are traditionally used to assist in selecting the number of synoptic classes (Elsner et al., 1996; Cannon et al., 2002b). The semi-supervised MRT is demonstrated via application to a real-world dataset in the next section.

4. Synoptic classifications

Following Cannon et al. (2002a, 2002b), MRT synoptic classifiers are developed for surface weather conditions at Vancouver, British Columbia, Canada. Gridded sea level pressure and 500-hPa geopotential height data from the National Centers for Environmental Prediction–National Center for Atmospheric Research (NCEP–NCAR) Reanalysis Project (Kalnay et al., 1996) are used as inputs to the synoptic classifications. Daily averages for 1961–2005 are obtained for 40°N to 62.5°N and 157.5°W to 110°W. Daily mean values of surface temperature, dewpoint depression, cloud opacity, and u-/v-wind components for Vancouver International Airport (49°11′31″N, 123°10′53″W) are used to describe surface weather conditions. Ninety days (0.5% of the record) contained missing data and are excluded from the analysis. To reduce the impact of seasonality on the classifications, all data are expressed as anomalies from a climatological mean state estimated by least squares regression on the first four harmonics of the seasonal cycle (Narapusetty et al., 2009).

Semi-supervised MRT models are fitted with NCEP-NCAR grid-points as the X dataset and the five surface weather variables as the Y dataset. Surface weather variables are standardized to zero mean and unit standard deviation so that each contributes equally to the WSS cost function. For the semi-supervised MRT models with α< 1, synoptic-scale predictors are similarly standardized prior to entry into the concatenated matrix Z. In this case, because there are more predictors than predictands, X and Y entries in Z are rescaled to have the same total sums of squares WSST. An α value of 0.5 thus means that equal weight is given to the two sets of variables.

Models are fitted to data from 1961 to 1990 for 0⩽α⩽1 and are validated on data from 1991 to 2005. Synoptic classifications with K = 5, 10, 15, 20, 25 and 30 synoptic classes, a range spanning the typical number found in established synoptic climatologies (Philipp et al., 2010), are developed to investigate the dependence of model performance on K. Training and validation performance is shown in Figure 1 for the synoptic-scale circulation (X) and the local surface weather variables (Y). For reference, consider values of EV for semi-supervised MRT models with K = 10. On the training dataset, performance in terms of Y decreases in a nearly monotonic fashion as the influence of X in the concatenated Z matrix increases (i.e. as α goes from 1 to 0). This is, naturally, accompanied by an attendant increase in EV for X. Moving to the validation dataset, performance for Y is initially relatively flat as α is reduced from 1 and then degrades rapidly as α→0. For X, performance improves as α is reduced from 1 to 0, with the rate of improvement in EV slowing as α→0. Other than an overall increase in EV, the same general pattern holds with different numbers of classes K. As expected, the rate of improvement in EV for both X and Y for a given value of α slows as K increases.

Figure 1.

Values of EV for semi-supervised MRT models with α ranging from 0 to 1 and the number of classes K equal to 5, 10, 15, 20, 25 and 30 for: (a) training dataset (1961–1990); and (b) validation dataset (1991–2005). Each dot shows mean values of EV averaged over all synoptic-scale variables X and surface weather variables Y for a given value of α

What are the specific effects of this general pattern? If results for the standard MRT model (i.e. α = 1) are compared against those from semi-supervised MRT models with α< 1, it is evident that a semi-supervised model can offer a modest improvement in terms of X (i.e. the circulation) without sacrificing performance in terms of Y (i.e. the environment). For instance, with K = 10 the validation EV for X is 31.8% for a standard MRT versus 39.2% for a semi-supervised MRT model (α = 0.085), whereas values of EV are nearly identical (23.4 and 23.0%, respectively) for Y. Further improvements in terms of the synoptic-scale circulation variables can be obtained with α< 0.085, but only if one accepts some level of degradation in terms of the surface weather variables. Taken to the logical extreme, a completely unsupervised MRT model (α = 0) leads to an increase in EV for X to 40.5% but a decrease in EV for Y to just 14.1%.

5. Conclusions

The semi-supervised MRT model allows a user to balance the performance of a ‘circulation-to-environment’ synoptic classification system in terms of both the synoptic-scale circulation and a set of local environmental variables. Rather than using the MRT algorithm to link the synoptic-scale atmospheric circulation predictors directly to the surface environmental predictand variables, which can result in a loss of performance in terms of the predictors, one instead uses a weighted combination of the circulation and environment data as predictands in the model. Depending on the weighting factor α, the model can, in effect, be used in unsupervised (α = 0), supervised (α = 1), or semi-supervised (0 < α< 1) modes.

In the example given for the south coast of British Columbia, Canada, results suggest that a semi-supervised model (α = 0.085) can be used to improve the ability to discriminate circulation variables relative to a fully supervised model (α = 1) without sacrificing performance in terms of surface weather elements at a local station. In some situations, however, maximising the discrimination of the circulation variables might be of more importance. This might be true if one were interested in developing a general purpose synoptic climatology for a large region, for example, for use in validating a global climate model (McKendry et al., 2006). In this case, a fully unsupervised model (α = 0) might be more appropriate. As with many of the decisions, i.e. number of classes involved in synoptic climatology, an appropriate trade-off between circulation and environment must also be selected by the user according to specific needs.

Ancillary