Correspondence site: http://www.respond2articles.com/MEE/
Cascade multivariate regression tree: a novel approach for modelling nested explanatory sets
Article first published online: 23 JAN 2012
© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society
Methods in Ecology and Evolution
Volume 3, Issue 2, pages 234–244, April 2012
How to Cite
Ouellette, M.-H., Legendre, P. and Borcard, D. (2012), Cascade multivariate regression tree: a novel approach for modelling nested explanatory sets. Methods in Ecology and Evolution, 3: 234–244. doi: 10.1111/j.2041-210X.2011.00171.x
- Issue published online: 4 APR 2012
- Article first published online: 23 JAN 2012
- Received 24 May 2011; accepted 30 October 2011 Handling Editor: David Warton
- multivariate regression tree;
- nested explanatory assessment;
- species composition drivers
1. Ecological data analysis frequently calls for the assessment of the relationship between species composition and a set of explanatory variables of interest. The assessment may have to be pursued while taking into account the influence of another set of explanatory variables. The hypothetical nature and structure of the influence of an explanatory set on the effect of a distinct explanatory set guides the proper choice of modelling methodology for a combined explanatory assessment.
2. Here, we describe a framework where the relationship between the response data and a main set of explanatory variables is not linear. It may, for example, take the form of abrupt changes in the response following thresholds of the explanatory variables, or any other nonlinearizable relationship. The influence of a second set of explanatory variables is determined a posteriori, after the influence of the main explanatory set has been taken into account. This is useful when one of the sets is thought to have an effect that varies as a function of the other.
3. To achieve this type of assessment, we propose a cascade of multivariate regression trees (CMRT). We decompose the total dispersion of a response matrix between two explanatory data sets in a nested manner. By handling each leaf (group) resulting from the first-level multivariate regression tree (MRT) analysis as separate independent data sets in following analyses, we can separate the explanatory power of the first partition from those of the subordinate partitions computed using a second explanatory set. A preliminary biological hypothesis will guide the choice of which set of explanatory variables should be used to compute the main partition. The method could be extended to more than two explanatory data sets whose effects on the response data are hierarchical.
4. Cascade of multivariate regression trees allows the users to impose a nested structure to their causal hypotheses in MRT analysis. To illustrate this new procedure, we use the well-known and readily available Doubs fish and oribatid mite data sets and provide the necessary R functions in a package available on CRAN (http://cran.r-project.org).
Modelling field data in ecology often translates into the study of the effect of more than one set of explanatory variables on a response data set (Legendre & Legendre 2012). Species assemblages, in particular, can respond to a great number of environmental factors, and most of these may play an important explanatory role, but their effects on the response are not necessarily independent from one another.
The most common methodologies used to assess the influence of multiple explanatory data sets in ecology are linear regression modelling and anova, as well as their multivariate extensions: canonical analysis [redundancy analysis (RDA), and canonical correspondence analysis (CCA)] and manova (Legendre & Anderson 1999; Anderson 2001a,b; McArdle & Anderson 2001). In the linear modelling framework, where we want to model a response as a function of two sets of explanatory variables, we use partial linear regression in the univariate case and partial canonical analysis in the multivariate case (partial RDA: Davies & Tso 1982; partial CCA: ter Braak 1988). The effect of two or several explanatory data sets on response data can be untangled by variation partitioning (Borcard, Legendre, & Drapeau 1992; Borcard & Legendre 1994; Anderson & Gribble 1998; Peres-Neto et al. 2006). The effects of both explanatory sets are then hypothesized to be additive. Partial RDA and partial CCA both allow a constrained ordination of the response matrix Y on the explanatory variables X to be computed while controlling for the linear effect of a matrix of covariables W. In the manova case, the effect of two (or more) factors is assessed, and interaction can be tested if replicates are available.
In this study, we use available statistical tools in a new combination to show how to tackle ecological data assessment when the relationship between a main explanatory data set and the response is nonlinear. An extreme example is when strong discontinuities in species composition exist along particular variables of a main explanatory data set. In such a case, thresholds better describe the relationship between the two data sets than linear models. Subsequently, the variation in each leaf (or group at the end of the tree) depicted by the discontinuities is to be independently explained by other explanatory variables of interest in a (possibly) different manner. Thus, we study the effect of both explanatory sets simultaneously by keeping in mind that the effect of one set might change as a function of the other. Multivariate regression tree analysis (MRT) is the perfect tool to undertake such a task, and we call the global procedure by the name cascade multivariate regression tree analysis (CMRT).
Multivariate regression tree analysis has stimulated growing interest in several ecological fields during the past few years. For instance, we find applications of MRT in microbial ecology (Auguet, Barberan, & Casamayor 2010), limnology (Davidson et al. 2010), forestry (Chen et al. 2010), reefs studies (DeVantier et al. 2006), entomology (Koivula & Vermeulen 2005), ornithology (Ouellette et al. 2005), arachnology (Pinzón & Spence 2010) and wetland studies (Sheaves, Abrantes, & Johnston 2007). This method, introduced in the ecological literature by De’ath (2002) and Larsen & Speckman (2004), is a recursive binary partitioning algorithm that allocates objects of the response matrix to homogenous groups, with partition criteria imposed by the explanatory variables. MRT is particularly useful to detect abrupt changes in community composition along an environmental gradient, because thresholds in the explanatory variables are used to delimit the leaves. In the procedure, the data set is split a large number of times to form the tree, then a pruning procedure is applied to reduce the large tree and obtain the best predictive tree size.
Cascade of multivariate regression trees is a procedure modelling the response data by means of two sets of explanatory variables that are taken into account in an order that reflects their hypothesized nested influence. The explanatory variables may be of any mathematical type as quantitative and qualitative explanatory variables can be used by MRT analysis. Moreover, because it is based on MRT analysis, this new procedure does not require that the relationships between the response and explanatory variables be linear, or the residuals normally distributed. It can also deal with missing values. These features make CMRT a valuable modelling technique for ecological data, where stringent statistical assumptions are seldom met.
Materials and methods
CMRT: The Procedure
Because CMRT is a new procedure, we first provide the necessary associated terminology (see Box 1 for illustration). We use the word wave to describe each level of the nested structure imposed by the user, and the word drop for each data set analysed at each level; see Fig. 1 for a diagram of the general structure. The number of waves is equal to the number of explanatory data sets in the user’s nested structure. Before launching the procedure, it is essential to identify which of the explanatory sets will have the main effect and which will have the subordinate effect. This decision should not be taken lightly because it strongly influences the inferences that can be drawn from the resulting model; see Discussion.
Let Y be the response matrix, A and S be, respectively, the main and subordinate explanatory tables. First, an MRT model is computed with Y as the response and A as the explanatory table. Variables in A may represent spatial scales: broad, medium and fine scales, or else landscape and microhabitat variables, which are another representation of scales. The hierarchy can also be based on the nature of the explanatory data sets, for example morphometry of a river (main) and land use (subordinate). See the Nested hypotheses in ecology subsection of the Discussion for more examples. Cross-validation is carried out to prune the tree and complete the first wave of the cascade.
Pruning is achieved by a resampling method called v-fold cross-validation (Breiman et al. 1984). First, the response data set is randomly split into v test subsets of roughly equal numbers of objects. These test subsets correspond to randomly chosen response row vectors, e.g. sites, with their corresponding species abundances. Then, v trees are built from v learning sets constructed by removing each of the v test sets one at a time from the whole set of objects. All trees are fully grown, and for each tree size, the cross-validated relative error is calculated as follows:
where yij(k) is one observation in test set k, is the predicted value of this observation in the k-th tree computed from the corresponding learning set, n is the number of observations, and m is the number of variables in the response matrix Y. Cross-validation for the v test subsets produces predicted values for all n observations, which are all included in the calculation of the CVRE statistic. If the response data contain species abundances, the predicted response for observation i is a particular species composition.
The first wave thus consists of analysing a single drop containing all observations through an MRT model. The response set is hypothesized to vary as a function of A (main effect). The first wave will identify the groups of sites with the most homogeneous species composition, split by the explanatory variables coded in A.
The complexity parameter of an MRT model is the minimum contribution to the R2 of the tree for a split to be considered. The value of the complexity parameter selected for the first drop shapes the partition produced by this first wave by controlling the number of splits. The value is determined by the user: a split will not be performed unless it explains at least as much variation as the chosen R2 value. For the first wave of analysis, it is important to set the complexity parameter high enough to identify only the main factors determining variation in species composition.
Let g be the number of leaves resulting from the first wave (see Fig. 1b). In a second step, the variation in the response data in each leaf (the leaf response tables are noted Yh for h = 1, …, g) is modelled independently with the S explanatory table to form the subordinate drops. For these drops, the complexity parameter may be reduced to a small value as the second wave is intended to model finer variation in species composition. The default value of the complexity parameter is 0·01 in the mvpart() R function; it is passed from rpart.control(); both functions are found in the mvpart package (available on cran.r-project.org).
The algorithm used to fit all MRT models is the standard recursive greedy splitting algorithm described in the study by Breiman et al. (1984) and De’ath (2002). Each tree is fully grown and its final size is chosen by v-fold cross-validation, where v is often chosen to be 10 (Breiman et al. 1984). The user may choose between the ‘min’ or ‘1se’ rules (Breiman et al. 1984; De’ath 2002) for the validation. Breiman et al. (1984) suggested that both rules should lead to about the same risk. The less complex model, obtained with the ‘1se’ rule, should be chosen in most cases because the aim is to minimize both risk and complexity. In the case of a regression tree, the risk is the cross-validated relative error (the variation between true and predicted values of test objects divided by the total variation in the response) and the complexity is the number of splits. Thus, the risk vs. complexity assessment can be made by examining the plot of the cross-validated relative error as a function of the size of the tree, to see whether both rules lead to similar risks. In this study, the within-group sum of squares around the mean is used as the criterion to be minimized, even though other criteria could be used.
The combined model, called cascade, is exactly that: a cascade of models, depicting in a nested manner the partitioning of the response data by two sets of explanatory variables (Fig. 1). Two general conclusions may emerge from a cascade: either the explanatory variables or splits of the second wave are the same for all leaves identified in the first wave, which means that the subordinate effect is the same over all subordinate data sets, or they are not. Therefore, the subordinate drops may be examined in turn to identify the differences in splits and explanatory variables among them. This approach is conceptually analogous to the search for interaction in manova.
We define the R2 of a single MRT tree as 1 minus the relative error defined by De’ath (2002). Thus, a single coefficient of determination (R2) can be computed for each drop. Consequently, a coefficient of determination (R2) can be obtained for the global analysis by pooling the R2 of the first wave with the weighted R2 obtained in the second wave (Fig. 1b). The CMRT procedure implies that the subordinate R2 are computed as proportions of the variation in their corresponding leaf as defined in the first wave. Each of these secondary drop R2 must then be reexpressed as a proportion of the total variation in the response data; the overall R2 is finally obtained by summing the R2 of the main drop and those of the subordinate drops. This procedure is valid because the subordinate drops are independent from one another.
The diagram provided by the CasMRTR2() function of the MVPARTwrap package takes the form of a square of unit area. The entire area represents the total variation in the response data. The proportion of the total response variation explained by each drop is represented as a shaded box of corresponding area. The box for the drop of the first wave is at the far left. Its height is 1, so that its width represents its R2. The partitioning of the remaining variation is represented by drawing a box for each drop produced by the first wave. The widths of these boxes are proportional to the unexplained variation of the response table in the corresponding leaves of the first drop, so their sum is equal to the relative error of the first drop. The heights of the rectangles represent the R2 of the subordinate drops within their leaf. Therefore, their areas represent their R2 as ratios of the total response variation.
We illustrate the CMRT procedure by using two data sets that were submitted to different types of analyses by Borcard, Gillet, & Legendre (2011) and are readily available in r (R Development Core Team 2010). For both case studies, a complexity parameter of 0·10 is used for the first wave and the usual 0·01 value is used for the second wave. Also, both community response matrices are Hellinger transformed prior to the analysis (Legendre & Gallagher 2001).
The first data set consists of three data tables (species abundances of oribatid mites, micro-environmental variables and spatial coordinates) extracted from 70 peat moss cores collected in a small area in the peat blanket surrounding Lac Geai (Québec, Canada), going from the edge of the forest to the open water of this bog lake (Borcard & Legendre 1994). The sampling area is 2·5 m × 10 m in size; the small size of these arthropods calls for small sampling units and extent. In the usual MRT analysis run with all variables, water content (g dm−3) determines the first split of the model (Fig. 2). As oribatids are not aquatic, in this extremely wet environment some oribatids prefer more or less water, which confers this explanatory variable a direct effect. The water content also has an indirect effect on the biota by structuring the vegetation. Other substrate and micro-environmental variables are available as explanatory variables, in particular density of the substrate (g dm−3), type of substrate (seven unordered classes), shrub density (none, few and many) and microtopography (blanket-hummock). This data set is available in the vegan r package (cran-r.project.org) as well as in the electronic material provided with the book of Borcard, Gillet, & Legendre (2011).
In the CMRT analysis, we use the variable ‘shrub density’ as the main effect because shrubs impose particular microclimate and microsubstrate changes for the mites: it increases shade and tops the original substrate (sphagnum moss) with additional woody matter.
The first drop of the cascade divides the sites into two groups separating the sites with no shrubs, with indicator morphospecies Trimalaconothrus sp., Tectocepheus cf. vietsi and Ceratozetidae sp3, from the sites with a few or many shrubs (indicator morphospecies Tectocepheus velatus, Malaconothrus cf. egregius, Oppiella nova, Fuscozetes setosus, Hypochthoniella sp1 & sp2 and Galumnidae).
In subordinate drops 2 and 3 (Fig. 3), different explanatory variables are identified to split each subset of sites into two: for the sites without shrubs, substrate density is the splitting explanatory variable and the splitting point is 50·36 g dm−3 and for sites with shrubs, water content at 385·1 g dm−3 is the delimiter. For the sites without shrubs, we have only one indicator morphospecies per group: for low substrate density, we have O. nova and for high substrate density, Trhypochthonius cf. tectorum. For the sites with shrubs and high water content, the indicator morphospecies are Nanhermannia coronata, Limnozetes rugosus and Limnozetes cf. ciliatus, whereas for low water content, we have T. velatus, F. setosus, Hypochthoniella sp. 2 and Rhysotritia ardua. After forcing the shrub variable at the top of the model, the R2 of the first drop is low (0·163) and the CVRE is high (0·94). Yet, we are still able to extract new insight from the cascade, not available in the global MRT: where there is no shrub, substrate density has stronger control over the species composition, whereas where shrubs are present, water content is the most discriminating explanatory variable. Figure 4 shows the variation partitioning of the original response of the oribatid mite data by drops.
Doubs river fish
The Doubs River fish data were collected by Verneaux (1973; see also Verneaux et al. 2003) who considered the fish species composition to be an ecological indicator of water quality along the Doubs River in the Jura mountains, near the France–Switzerland border. The data set presented here is a subset of the original data in Verneaux’s thesis, i.e. 30 sites described by three data tables: the fish species composition (abundance classes ranging from 0 to 5), explanatory variables describing the water quality and river morphology and finally the spatial coordinates of the sites. It is provided as electronic material with the book of Borcard, Gillet, & Legendre (2011). In the original MRT analysis (Fig. 5), the distance from the source provides the first split; actually, this split identifies two zones that had been identified by Verneaux as the Salmonid region (upstream) and the Cyprinid region (downstream). To illustrate the CMRT procedure, we use the morphological variables ‘mean discharge’ and ‘slope’ as the main explanatory set. It should be noted that these variables are likely to have been represented in the CMRT analysis by their proxy, i.e. distance from the source. We will comment this choice, made for demonstration purposes only, in the Discussion. The physical and chemical variables [calcium concentration (hardness), pH, phosphate, nitrate, ammonium, dissolved oxygen and biological oxygen demand] are selected as the subordinate explanatory set.
The resulting cascade is shown in Fig. 6. In the first drop, the sites are split by a mean discharge of 23·65 m3 s−1. On the left is the Cyprinid region of Verneaux (1973) (group 3), whereas the Salmonid region (group 2) is found in the right-hand branch of the tree. Indicator species analysis (Dufrêne & Legendre 1997) with Holm correction for multiple testing shows that the Salmonid region is characterized by the brown trout (Salmo trutta fario, a Salmonid) and the common minnow (Phoxinus phoxinus, a Cyprinid). The Cyprinid region has the bleak (Alburnus alburnus), the common nase (Chondrostoma nasus), the ruff (Acerina cernua), the pumpkinseed sunfish (Lepomis gibbosus), the European bitterling (Rhodeus amarus), the European eel (Anguilla anguilla), the roach (Rutilus rutilus), the spirlin (Spirlinus bipunctatus), the common carp (Cyprinus carpio), the whitebream (Blicca bjoerkna), the common barbell (Barbus barbus), the common bream (Abramis brama), the rudd (Scardinius erythrophthalmus) and the south-west European nase (Chondrostoma toxostoma) as indicator species.
Within each zone identified by the first drop, the water quality variables are used in the subordinate analyses to identify and explain finer differences in species composition. No further splits are found in the Salmonid region (v-fold cross-validation pointed to one group). It is not the case for the Cyprinid region (right-hand leaf of drop 1, called drop 3 in our analysis), which showed three species assemblages responding to two explanatory variables: ammonium concentration and dissolved oxygen; see Fig. 6 for a map of the sites along the river and the cascade of analyses and Fig. 7 for a summary of the explained variation.
Group 2 of the tree of drop 3 contains sites 23–25, characterized by large concentrations of ammonium (≥ 0·45 mg L−1) and, by correlation, by large concentrations of phosphorus (r = 0·9695) and high biological oxygen demand (r = 0·8858); these two variables, which would produce the same split, are not shown in the tree. The bleak A. alburnus, the chub Leuciscus cephalus cephalus and the roach R. rutilus are the indicator species of this group (sites 23–25). The bleak is present at sites 21–30 but particularly successful at the highly eutrophized sites 23–25. This species feeds on zooplankton near the surface (Horppila & Kairesalo 1992) which is, for this species, an important habitat for feeding (de Nie 1987) and to lay eggs (Pihu 1996). Thus, the indicator value of this species at sites 23–25 corresponds to the presence of macrophytes, which are in turn associated with high nutrient concentrations (Carr & Chambers 1998). The same applies to the roach for which macrophytes are also an important feeding habitat. As shown by Borcard, Gillet, & Legendre (2011, Fig. 2·5), this group is found in a zone where there is a significant drop in species richness and where one is more likely to find perturbation-tolerant species.
Group 4, which includes sites 17–20, is also part of drop 3. It is characterized by high levels of dissolved oxygen (≥ 9·65 mg L−1) and small concentrations of ammonium (< 0·45 mg L−1). The indicator species in this case are the stone loach (Nemacheilus barbatulus, Kottelat & Freyhof 2007), the western vairone (Telestes soufia agassizi, Kottelat & Freyhof 2007), the common minnow (P. phoxinus, DORIS 2010b), the south-west European nase (C. toxostoma, Chappaz, Brun, & Olivari 1989), the spirlin [S. bipunctatus, (Kottelat & Freyhof 2007)] and the common dace Leuciscus leuciscus (DORIS 2010a). All these species have a common preference for intermediate to high oxygen levels (see associated references).
Lastly, from drop 3, we get group 5, which is characterized by low dissolved oxygen levels (< 9·65 mg L−1) and small concentrations of ammonium (< 0·45 mg L−1). Low dissolved oxygen levels are found in stagnant turbid waters linked to muddy bed, to which all the following species are indicators. First, the European eel (A. anguilla) is found near river mouths; this species migrates to the sea for reproduction and prefers to live close to the bottom in mud or crevasses (Deelder 1984). The bream (A. brama) prefers slow-flowing waters (Kottelat & Freyhof 2007), and the catfish (Ictalurus melas) is found in slow current, pools and backwaters (Page & Burr 1991), just like the northern pike (Esox lucius) (Crossman 1996); A. cernua (or Gymnocephalus cernua) is favoured by eutrophic conditions (Kottelat & Freyhof 2007). The carp (C. carpio) prefers warm, deep, slow to still waters (Kottelat & Freyhof 2007), the silver bream (B. bjoerkna) still waters (Kottelat & Freyhof 2007), and the pumpkinseed (L. gibbosus) vegetated pools (Page & Burr 1991).
We could not identify further splits in the Salmonid region using the physical and chemical explanatory variables. For the Cyprinid region, however, the ammonium and dissolved oxygen variables delimited first a polluted region, sites 23–25. Then, among the less polluted sites, two groups were discriminated by the low oxygen level, which is a proxy for less agitated waters, which in turn is a proxy for the type of river bed.
General Remarks on the Procedure
Cascade of multivariate regression trees offers the opportunity to address ecological hypotheses in a preferential order, allowing one to override the original explanatory order of the variables presented in MRT analysis to explore specific avenues by testing the influence of precise variables on the response data. Both MRT and CMRT produce a hierarchy; the peculiarity of the CMRT procedure resides in the possibility to preselect the explanatory set of variables that will be used to compute the first few bipartitions. Therefore, when creatively applied with specific hypotheses in mind, the cascade provides new insights into the data structure that would not have been available in simple MRT analysis. In order to exploit the CMRT procedure to its full potential, the explanatory variables selected for the first wave should be different from the first bipartition of the simple MRT; if it was the same, the two analyses would depict the same pattern. Actually, if we had used in CMRT the same first explanatory variables that were identified by the simple MRT model, the resulting CMRT model would have been the same as the MRT result, but with a smaller number of leaves because the independent cross-validations conducted in the drops would have reduced the overall power. This is what happened with our second example (Doubs fish data), where the variable selected among those chosen for the first wave, mean minimum discharge (Fig. 6, drop 1), was represented by its proxy, distance from the source, in the non-hierarchical MRT (Fig. 5). Furthermore, because distance from the source actually explains much of the chemical variation along the river, it also appears in further splits of the original MRT. This variable being absent from the CMRT, the corresponding splits have been identified by true physical or chemical variables, leading to similar, if not completely identical results.
In the linear procedures – partial linear regression and canonical analysis (RDA) – where we include the use of covariables, the use of residuals is necessary to partial out the variation explained by one of the explanatory sets (Legendre, Oksanen, & ter Braak 2011; Legendre & Legendre 2012). Here, as each leaf of the first wave is treated and modelled separately by the subordinate set of explanatory variables, there is no need to use the residuals of the first wave in the second wave. Actually, if one uses the residuals of the first wave for the subordinate analyses, one obtains exactly the same cascade structure and R2 as with the original data; thus, this practice is useless.
For the total sum of squares of Y to be meaningful, the response variables have to be dimensionally homogeneous, so that the sums of squares of individual variables are additive. If the variables are not dimensionally homogenous, e.g. environmental descriptors, they have to be standardized prior to MRT and CMRT analysis.
The Case Studies
We propose two contrasting case studies to stress the importance of the choice of the explanatory variables of the first wave. The oribatid mite example illustrates a case where a hypothesis about the effect of shrub vegetation leads us to impose the corresponding variable as a first-level effect. This is clearly different from the spontaneous order of the variables, as revealed by the standard MRT, but it allows us to gain new insights into the hypothesized effect. Therefore, the application of CMRT adds to our knowledge of the ecological processes at work.
The Doubs River example, on the contrary, shows that the choice of a variable representing many of the important ecological drivers of the river system and strongly correlated to the first-rank variable identified in the classical MRT is not adequate. This choice not only leads to a result that closely resembles the one obtained by classical MRT, but also leads to a result that is impoverished by the lack of power induced by the sequential nature of the CMRT method. Therefore, in this case, the CMRT has not added to our knowledge of the system, although, as shown in the Results, this simpler classification can be well explained in ecological terms.
Nested Hypotheses in Ecology
Cascade of multivariate regression trees allows for the first time users to impose a hierarchy to their causal hypotheses in MRT analysis. Several ecological studies include a natural hierarchical explanatory configuration. For instance, a land use impact study of communities (e.g. fish, phytoplankton and zooplankton) could include explanatory variables about the lake or river morphometry as the main driver and land use impact variables as the subordinate effect. With the CMRT procedure, inherently, for each of the groups identified by the morphometry explanatory data, the subordinate effect of land use impact can be studied and identified in a fully independent manner.
In the analysis of time series, one can use the time sequence as the basis for a primary segmentation (wave 1 analysis) of the data in CMRT. This first step, corresponding to a clustering with chronological constraint, is followed by secondary analyses of each segment using environmental variables, where different explanatory variables may express themselves in different time segments. The same could be carried out for a spatial transect. The Doubs River data, which form a spatial series along the course of the river, could be analysed in that way. Segmentation of the river by MRT using the distance from the source variable, which corresponds to wave 1 of this type of analysis, is shown as an example in Section 4·11·5 of Borcard, Gillet, & Legendre (2011). For surveys conducted on a two-dimensional geographical map, the primary segmentation could be carried out by clustering, spatially constrained by the geographical coordinates of the sites (see e.g. Legendre & Legendre 2012, Chapitre 13).
Another possible application of CMRT is for space–time surveys. Legendre, De Cáceres, & Borcard (2010) showed how one can test the space–time interaction in this type of survey for univariate or multivariate response data. (i) If the interaction is not significant, fairly homogeneous space–time blocks of observations can be identified by wave 1 analysis in CMRT, followed by secondary (wave 2) separate analysis of each block using environmental variables. (ii) If a significant interaction between space and time is identified, it indicates that the spatial distribution of the response data, e.g. species community data, has changed through time or, conversely, that the species composition has changed differently through time at the different sampling sites. In that case, CMRT could be used to analyse the multivariate time series from each site separately or the multivariate data across sampling sites from each sampling time separately.
In some applications, the nested structure inherent in CMRT may be imposed by the researcher for heuristic reasons. For space–time studies, time or space can be used as the main set of explanatory variables. (i) Let us explore a hypothetical situation where tree community composition has been collected at n sites in a forest (space) over t time steps, and the study is concerned with the evolution of the distribution of a potentially invasive species. In this case, space will be used as the primary factor. By doing so, we delineate regions of the forest, i.e. groups of geographically contiguous sites, that are the most similar through times. Each of these regions with similar species assemblages may respond differently in time to disturbances: for example, a local drought could boost the invasive ability of a species. The secondary analysis, carried out separately on each region using time as the explanatory variable, would allow the identification of regions where the communities changed most over time, possibly as a consequence of the invasion. (ii) Let us now suppose that our main interest is to study the effect of an unusually long drought affecting the whole forest. In this case, we would use time as the main factor to first focus on the evolution of the overall species composition through time, pointing perhaps at main extinction events because of this drought. Subsequently, we could study each assemblage identified along the time line and see how they are structured in space, or with respect to environmental factors that may condition the structure of the community through space. The number of sites affected by the invasive species may vary greatly from time to time.
In any space–time study, spatial or temporal correlation between adjacent sites owing to neutral processes of community dynamics (see e.g. Legendre & Legendre 2012) may be present along one or the other sampling axis, or both. Further simulations are needed to fully understand the effect and the extent of this effect on the CMRT modelling process, notably on the cross-validated estimation of the error made on a prediction, which is the basis to pick the size of the tree.
Extensions of the Cascade
The procedure described in this study was solely based on MRT. It is possible to pursue a cascade analysis using other methods. For example, the first drop may come from a partition either constructed with another method or simply known by previous knowledge of the data. A linear model, if the assumptions of such a procedure are met, may also be used to model the subordinate drops. Thus, a mixture of modelling procedures may be used in the framework. The computation of an overall R2 is still valid because the subordinate analyses are independently conducted in each drop and the calculation of the R2 is independent in each analysis. This framework is also applicable to univariate CART classification or regression tree models. Moreover, more than two waves could be used. This would require that the data set be large in order to have a sufficient number of sites in the leaves of the second wave and some variation left to be explained in the third wave of the analysis.
Relating CMRT to Nested manova
The CMRT procedure has some fundamental resemblance to nested manova but users should be aware of important theoretical differences. One of them is that in CMRT, the structure results from splits of the explanatory variables that best explain the response through an MRT analysis. This means that the usual calculation of degrees of freedom, which are necessary to compute an F statistic and carry out the statistical tests that are computed in manova to test the significance of the main factor, the subordinate factor and their interaction (Legendre & Anderson 1999; Anderson 2001a,b; McArdle & Anderson 2001), is not directly applicable (M.-H. Ouellette & P. Legendre, in preparation). For that reason, these tests are not implemented in CMRT. However, it is possible to subjectively infer from the cascade if the effect of the subordinate explanatory set on the response data changed as a function of the main set, by examining whether the subordinate explanatory variables chosen or their splitting values changed as a function of the groups defined in the main partition.
Finally, the possibility of fitting trees in an additive, non-nested way has yet to be explored. But this approach would be conceptually very different from CMRT, it would answer different questions, and its development would imply the resolution of several mathematical issues about the nature and computation of residuals in a nonlinear context.
The CMRT procedure is a framework where nested ecological hypotheses are privileged. Users must choose in which order two (or more) explanatory sets are considered in an MRT structure. It is also possible to partition the explained variation (R2) among the sets and ultimately obtain a coefficient of determination for the complete cascade of MRT analyses. The final CMRT model may be subjectively assessed for interaction between the explanatory sets, to evaluate whether the effect of the subordinate set changed as a function of the group membership produced by the first wave of analysis. The overall procedure is interesting for fundamental as well as applied ecological studies and may be applied in other fields such as geography, oceanography, soil science, as well as outside the biological domain.
This study was supported by Natural Sciences and Engineering Research Council of Canada (NSERC) grant no. 7738 to P. Legendre. We wish to thank Steven Walker for useful comments and suggestions that helped in improving the manuscript.
- 2001a) A new method for non-parametric multivariate analysis of variance. Austral Ecology, 26, 32–46. (
- 2001b) Permutation tests for univariate or multivariate analysis of variance and regression. Canadian Journal of Fisheries and Aquatic Science, 58, 626–639. (
- 1998) Partitioning the variation among spatial, temporal and environmental components in a multivariate data set. Austral Ecology, 23, 158–167. & (
- 2010) Global ecological patterns in uncultured Archaea. ISME Journal, 4, 182–190. , & (
- 2011) Numerical Ecology with R. Springer, New York. , & (
- 1994) Environmental control and spatial structure in ecological communities: an example using oribatid mites (Acari, Oribatei). Environmental and Ecological statistics, 1, 37–61. & (
- 1992) Partialling out the spatial component of ecological variation. Ecology, 73, 1045–1055. , & (
- 1988) Partial canonical correspondence analysis. Classification and Related Methods of Data Analysis (ed. H.-H. Bock), pp. 551–558. North-Holland, Amsterdam. (
- 1984) Classification and Regression Trees. Wadsworth International Group, Belmont, CA, USA. , , & (
- 1998) Macrophyte growth and sediment phosphorus and nitrogen in a Canadian prairie river. Freshwater Biology, 39, 525–536. & (
- 1989) Données nouvelles sur la biologie et l’écologie d’un poisson Cyprinidé peu étudiéChondrostoma toxostoma (Vallot, 1836). Comparaison avec Chondrostoma nasus (L. 1766). Comptes rendus de l’académie des sciences, Paris, série III, 309, 181–186. , & (
- 2010) Community-level consequences of density dependence and habitat association in a subtropical broad-leaved forest. Ecology Letters, 13, 695–704. , , , , & (
- 1996) Taxonomy and distribution. Pike Biology and Exploration (ed. J.F. Craig), pp. 1–11. Chapman and Hall, London. (
- 2010) Inferring past zooplanktivorous fish and macrophyte density in a shallow lake: application of a new regression tree model. Freshwater Biology, 55, 584–599. , , , & (
- 1982) Procedures for reduced-rank regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 31, 244–255. & (
- 2002) Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology, 83, 1105–1117. (
- 2011) mvpart: Multivariate partitioning. R package version 1.4-0. http://CRAN.R-project.org/package=mvpart. (
- 1984) Synopsis of Biological Data on the eel, Anguilla anguilla (Linnaeus, 1758). FAO, Rome, Italy. (
- 2006) Species richness and community structure of reef-building corals on the nearshore Great Barrier Reef. Coral Reefs, 25, 329–340. , , , & (
- DORIS (2010a) Leuciscus leuciscus (Linnaeus, 1758). Available at: http://doris.ffessm.fr/fiche2.asp?fiche_numero=2166 (accessed 21 March 2011).
- DORIS (2010b) Phoxinus phoxinus (Linnaeus, 1758). Available at: http://doris.ffessm.fr/fiche2.asp?fiche_numero=1656 (accessed 23 March 2011).
- 1997) Species assemblages and indicator species: the need for a flexible asymmetrical approach. Ecological Monographs, 67, 345–366. & (
- 1992) Impacts of bleak (Alburnus alburnus) and roach (Rutilus rutilus) on water quality, sedimentation and internal nutrient loading. Hydrobiologia, 243–244, 323–331. & (
- 2005) Highways and forest fragmentation – effects on Carabid Beetles (Coleoptera, Carabidae). Landscape Ecology, 20, 911–926. & (
- 2007) Handbook of European Freshwater Fishes. Publications Kottelat, Cornol, Switzerland. & (
- 2004) Multivariate regression trees for analysis of abundance data. Biometrics, 60, 543–549. & (
- 1999) Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecological Monographs, 69, 1–24. & (
- 2010) Community surveys through space and time: testing the space-time interaction in the absence of replication. Ecology, 91, 262–272. , & (
- 2001) Ecologically meaningful transformations for ordination of species data. Oecologia, 129, 271–280. & (
- 2012) Numerical Ecology, 3rd English edn. Elsevier Science BV, Amsterdam. & (
- 2011) Testing the significance of canonical axes in redundancy analysis. Methods in Ecology and Evolution, 2, 269–277. , & (
- 2001) Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology, 82, 290–297. & (
- 1987) The decrease in aquatic vegetarian in Europe and its consequences for fish populations, EIFAC/CECPI. Occasional paper No. 19. (
- 2011) vegan: Community Ecology Package. R package version 2.0-2. http://CRAN.R-project.org/package=vegan. , , et al. (
- 2005) L’arbre de régression multivariables: classification d’assemblage d’oiseaux fondée sur les caractéristiques de leur habitat. Société Francophone de Classification, Montréal. , , & (
- 2011) MVPARTwrap: Wrap functions of mvpart function providing for instance a more descriptive tree, the discriminant species at each node (table 1 of Dea’th (2002)), the cascade MRT analysis and an adjusted R2 for an MRT model. R package version 0.1-8. http://CRAN.R-project.org/package=MVPARTwrap. & (
- 1991) A Field Guide to Freshwater Fishes of North America North of Mexico. Houghton Mifflin Company, Boston, MA. & (
- 2006) Variation partitioning of species data matrices: estimation and comparison of fractions. Ecology, 87, 2614–2625. , , & (
- 1996) Fishes, their biology and fisheries management in Lake Peipsi. Hydrobiologia, 338, 163–172. (
- 2010) Bark-dwelling spider assemblages (Araneae) in the boreal forest: dominance, diversity, composition and life-histories. Journal of Insect Conservation, 14, 439–458. & (
- R Development Core Team (2010) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
- 2007) Nursery ground value of an endangered wetland to juvenile shrimps. Wetlands Ecology and Management, 15, 311–327. , & (
- 1973) Cours d’eau de Franche-Comté (Massif du Jura). Recherches écologiques sur le réseau hydrographique du Doubs. Essai de biotypologie. Thèse d’état, Besançon. (
- 2003) Benthic insects and fish of the Doubs river system: typological traits and the development of a species continuum in a theoretically extrapolated watercourse. Hydrobiologia, 490, 63–74. , , & (