In total, 4000 data sets, analysed by 23 different methods were produced. We carried out two pre-analyses on all data sets. The first compared model selection procedures based on the small sample-size corrected Akaike's information criterion (AICc) and Bayesian information criterion (BICc); the second compared three different ways to represent clusters (see Supplementary material Appendix 1.2 for details). Based on the results we represented a cluster by its central variable, and kept all four clustering methods. We used BICc-derived models except for CPCA, PLS and PPLS. As references we used three models: firstly, a correctly specified linear model (i.e. a GLM with Gaussian errors, from here on referred to as GLM), where we only estimated the parameters (ML true); secondly a backward stepwise simplified GLM (starting with linear and quadratic terms and first-order interactions for all 21 predictors and BICc setting); and, thirdly, a GAM with cubic splines and shrinkage (i.e. reduction in spline flexibility Wood 2006) applied to all predictors (see Fig. 6 for all methods remaining).
Model validation under collinearity
In the analysis, we focussed on three aspects affecting the performance of a method, as assessed by the Root Mean Squared Error (RMSE) on different test data: 1) degree of collinearity present in the data (x-axis in Fig. 4 and 6); 2) complexity of the functional relationship used for simulation (five subfigures of Fig. 6); and 3) change in collinearity structure from training to test data set (five line types within each panel in Fig. 6). As absolute reference, we used the RMSE of a correctly specified model (first panel with formula for simulation; here all error is due to the noise imposed in the simulations).
Figure 4. Root Mean Square Errors across all simulations for the eight different levels of collinearity and using different collinearity structures for validation. Small linear changes, both increasing and decreasing absolute correlation (more/less), have little effect and are depicted together. Grey line indicates RMSE of the fit to the training data.
Download figure to PowerPoint
Summarised across all functional relationships and model types, we did not detect a trend of degeneration of model fit on the test-same, test-more or test-less data with increasing collinearity (Fig. 4). When the collinearity structure changed non-linearly or was completely lost, however, model fit decreased substantially and became much more variable as collinearity increased (test non-linear and test none, Fig. 4).
As a first rough guide on which statistical approaches worked best, we analysed the shortlisted 23 methods plus the reference ‘ML true’ across all functional relationships (Fig. 5). When evaluated using the test data with the same collinearity structure, most methods performed very well in terms of RMSE. A moderate loss of performance was observed for PPLS, PCA-based clustering and BRT when the collinearity structure changed slightly (test-less). This trend was aggravated under non-linear changes of collinearity, where also variability started to substantially increase for several methods (among them GLM and several latent variable approaches). Using the test data without collinearity (test none), however, the verdict came clearly in favour of the select07/04 methods, ridge, lasso, DR, GAM and MARS. Other methods were also similar in their median performance but exhibited much larger variability (GLM, seqreg, machine learning methods). Neither latent variables (except DR) nor clustering approaches could compete. This may differ between functional relationships, so we subsequently analysed this in more detail.
Figure 5. Root Mean Square Errors across all simulations for the different methods and using different collinearity structures for validation, sorted by median. Top: same correlation structure, bottom: none. Grey lines refer to RMSE on training data. Note that sequence of models is different in each panel. Test data ‘more’ was very similar to those of ‘less’, hence only the latter is shown.
Download figure to PowerPoint
Figure 6 shows the effect of increasing collinearity on prediction accuracy (in terms of RMSE) of all models on the different test data sets. We found that collinearity affects model performance negatively for most methods and functional relationships (increasing RMSE towards the right in the panels of Fig. 6). Collinearity effects were generally non-linear, and almost all methods proved tolerant under weak collinearity (CN below 10). A threshold of CN =30 (indicated in Fig. 6) was clearly too high for most methods analysed here. Notable exceptions from this pattern are PCA-based clustering and SVM, which increased in performance with collinearity (albeit PCA-based clustering starting from a very poor fit).
Figure 6. Relative prediction accuracy on test data for an ideal model (ML true) and 23 collinearity methods as a function of collinearity in the data set. In each panel, solid/short-hatched/dotted/dash-dotted/long-hatched locally-weighted smoothers (lowess) depict model predictions on same/more/less/non-linear/no correlation data sets accordingly (not discernable in function 5 for select07 and select04 because they yield nearly identical values). X-axis is log(condition number), depicted logarithmically. That is, x-values are in fact double-log-ed CNs (one log for the fact that CN is a ratio, the second because we chose logarithmic scaling of collinearity decay rates when generating the data). Vertical line (at CN =30) indicates the rule-of-thumb threshold for CN beyond data set collinearity is deemed problematic.
Download figure to PowerPoint
The results across all collinearity test structures are complex (Fig. 6). They can be first summarised by looking for a general pattern of low and consistent RMSE across all condition numbers, excluding the hardest case that of prediction to completely changed collinearity (the long dashed line). Consistently well-performing methods include select04/07, GAM, ridge, lasso, MARS and DR. Some other methods were consistent, but at a higher RMSE level (Hoeffding/Ward and Spearman/average clustering, seqreg, LRR, OSCAR, randomForest, BRT and SVM).
When investigating, by eye, the performance under severe collinearity (i.e. to the right of the CI =30 line), we found most methods outperformed GLM-like approaches. In particular clustering, penalised and machine-learning approaches yielded lower-error models. However, several of the purpose-built latent variable techniques were only marginally better than GLM, delaying the degeneration of model performance from a condition number of 10 (for GLM) to 30 (LRR, DR, CPCA, PLS, PPLS). Two other noteworthy results are that the GAM also did well at high collinearity, while the commonly used Principal Component Regression showed no improvement on the GLM.
The performance of methods changed only slightly across levels of functional complexity (Fig. 6). Trends became more pronounced as the underlying functions became non-linear, and at a level of functional complexity that might be typical for an ecological regression model (two quadratic terms and an interaction), clustering methods in particular suffered from poor model fits. Also three of the four penalised approaches were unable to regularise the model sufficiently and thus only the ridge was still performing very well.
The most striking pattern we observed was the performance under changing collinearity structure. Since we generally have little idea of how environmental variables are correlated over time or space, this will not help us decide which method to use. Generally, few of the methods were able to correctly predict under the most difficult combination of high collinearity in training data and complete loss of the collinearity structure in the test data (as reported in the right tail of the long dashed lines in Fig. 6). Methods where the RMSE for this combination stayed lowest were select04/07, ridge and MARS, with GAM, randomForest, BRT, SVM, lasso and OSCAR working fine up to a condition number of approximately 150–200 (2.2 on the log-scale of Fig. 6).
Some methods deserve specific comment. The PCA-based clustering was useful only under highest collinearity. Under normal circumstances, using the most central variable in a cluster is likely to mislead variable identification. However, using the principal component of each cluster was even worse (Supplementary material Appendix 1.2, Fig. A3). Select04 and select07 were yielding nearly identical results in all runs. This is probably due to the way we generated our data, where correlations within a cluster are very high, and both thresholds (|r|=0.4 and |r|=0.7) hence led to near-identical selection of predictors. Ridge penalisation failed to converge for the quadratic model (function 3) withoutcollinearity (see also Tricks and tips in the Discussion). PPLS was the most unreliable approach, despite combining the strengths of PLS, GAM and penalisation. Finally, CWR yielded very similar results to the GLM, only slightly outperforming GLMs under high collinearity. Again, this is probably due to the way we generated our data as collinearity between variables was modelled as intrinsic and not incidental due to outliers which is the main (proposed) application domain of CWR.
For each group of approaches, our simulations suggest the following most promising methods: from the control group, GAM; from clustering, Hoeffding/Ward or Spearman/average; from the latent variable approaches, DR; from the penalised approaches, ridge; and from the machine learning group: MARS, randomForest and BRT.