## Introduction

The last three decades have seen an enormous expansion of the statistical tools available to applied ecologists. A short list of available techniques includes linear regression, generalized linear (mixed) modelling, generalized additive (mixed) modelling, regression and classification trees, survival analysis, neural networks, multivariate analysis with all its many methods such as principal component analysis (PCA), canonical correspondence analysis (CCA), (non-)metric multidimensional scaling (NMDS), various time series and spatial techniques, etc. Although some of these techniques have been around for some time, the development of fast computers and freely available software such as R (R Development Core Team 2009) makes it possible to routinely apply sophisticated statistical techniques on any type of data. This paper is not about these methods. Instead, it is about the vital step that should, but frequently does not, precede their application.

All statistical techniques have in common the problem of ‘rubbish in, rubbish out’. In some methods, for example, a single outlier may determine the final results and conclusions. Heterogeneity (differences in variation) may cause serious trouble in linear regression and analysis of variance models (Fox 2008), and with certain multivariate methods (Huberty 1994).

When the underlying question is to determine which covariates are driving a system, then the most difficult aspect of the analysis is probably how to deal with collinearity (correlation between covariates), which increases type II errors (i.e. failure to reject the null hypothesis when it is untrue). In multivariate analysis applied to data on ecological communities, the presence of double zeros (e.g. two species being jointly absent at various sites) contributes towards similarity in some techniques (e.g. PCA), but not others. Yet other multivariate techniques are sensitive to species with clumped distributions and low abundance (e.g. CCA). In univariate analysis techniques like generalized linear modelling (GLM) for count data, zero inflation of the response variable may cause biased parameter estimates (Cameron & Trivedi 1998). When multivariate techniques use permutation methods to obtain *P*-values, for example in CCA and redundancy analysis (RDA, ter Braak & Verdonschot 1995), or the Mantel test (Legendre & Legendre 1998), temporal or spatial correlation between observations can increase type I errors (rejecting the null hypothesis when it is true).

The same holds with regression-type techniques applied on temporally or spatially correlated observations. One of the most used, and misused, techniques is without doubt linear regression. Often, this technique is associated with linear patterns and normality; both concepts are often misunderstood. Linear regression is more than capable of fitting nonlinear relationships, e.g. by using interactions or quadratic terms (Montgomery & Peck 1992). The term ‘linear’ in linear regression refers to the way parameters are used in the model and not to the type of relationships that are modelled. Knowing whether we have linear or nonlinear patterns between response and explanatory variables is crucial for how we apply linear regression and related techniques. We also need to know whether the data are balanced before including interactions. For example, Zuur, Ieno & Smith (2007) used the covariates sex, location and month to model the gonadosomatic index (the weight of the gonads relative to total body weight) of squid. However, both sexes were not measured at every location in each month due to unbalanced sampling. In fact, the data were so unbalanced that it made more sense to analyse only a subset of the data, and refrain from including certain interactions.

With this wealth of potential pitfalls, ensuring that the scientist does not discover a false covariate effect (type I error), wrongly dismiss a model with a particular covariate (type II error) or produce results determined by only a few influential observations, requires that detailed data exploration be applied before any statistical analysis. The aim of this paper is to provide a protocol for data exploration that identifies potential problems (Fig. 1). In our experience, data exploration can take up to 50% of the time spent on analysis.

Although data exploration is an important part of any analysis, it is important that it be clearly separated from hypothesis testing. Decisions about what models to test should be made *a priori* based on the researcher’s biological understanding of the system (Burnham & Anderson 2002). When that understanding is very limited, data exploration can be used as a hypothesis-generating exercise, but this is fundamentally different from the process that we advocate in this paper. Using aspects of a data exploration to search out patterns (‘data dredging’) can provide guidance for future work, but the results should be viewed very cautiously and inferences about the broader population avoided. Instead, new data should be collected based on the hypotheses generated and independent tests conducted. When data exploration is used in this manner, both the process used and the limitations of any inferences should be clearly stated.

Throughout the paper we focus on the use of graphical tools (Chatfield 1998; Gelman, Pasarica & Dodhia 2002), but in some cases it is also possible to apply tests for normality or homogeneity. The statistical literature, however, warns against certain tests and advocates graphical tools (Montgomery & Peck 1992; Draper & Smith 1998, Quinn & Keough 2002).Läärä (2009) gives seven reasons for not applying preliminary tests for normality, including: most statistical techniques based on normality are robust against violation; for larger data sets the central limit theory implies approximate normality; for small samples the power of the tests is low; and for larger data sets the tests are sensitive to small deviations (contradicting the central limit theory).

All graphs were produced using the software package R (R Development Core Team 2008). All R code and data used in this paper are available in Appendix S1 (Supporting Information) and from http://www.highstat.com.