## Introduction

Ecologists often use regression to elucidate relationships among variables and to build predictive models, but issues associated with ecological data and modelling complicate regression analyses. Ecological data are complex, often including non-normal errors, nonlinear relationships and variables that are spatially or temporally autocorrelated. To address these complexities, ecologists routinely apply flexible modelling approaches. For example, generalized linear models (GLMs; McCullagh & Nelder 1989) allow users to specify appropriate response distributions with link functions and to pre-specify nonlinear relationships, such as logarithmic transformations for positive predictor variables. However, pre-specification of nonlinearities is often intractable when relationships are unknown or when the number of relationships is large. Ecologists have applied generalized additive models (GAMs; Yuan & Norton 2003; Austin 2007) to overcome the limitations of GLMs. GAMs do not require pre-defined specification of nonlinearities, but preserve the ability of GLMs to construct complex models (Hastie & Tibshirani 1990; Hastie, Tibshirani & Friedman 2009). In addition, GAMs automatically identify nonlinearities using flexible nonlinear modelling approaches (usually based on spline smoothing) and preserve the easy interpretability of predictor–response relationships of GLMs (Hastie & Tibshirani 1990; Wood 2006).

Ecologists should also consider general modelling issues, like overfitting, variable selection and prediction. Overfitting often results from including too many covariates for a given sample size and yields overly complex models that contain spurious effects. Overfitting also decreases prediction accuracy (Hastie, Tibshirani & Friedman 2009). Variable selection is the process of correctly identifying the subset of covariates that are most important in explaining variation in the response and excluding covariates that do not add explanatory value to a model. Methods available to address overfitting and variable selection include penalized estimation (e.g. the lasso or ridge regression), cross-validation, pruning of decision trees, early stopping of boosting algorithms, model selection using criteria like AIC (see Hastie, Tibshirani & Friedman 2009), and Bayesian regularization (O’Hara & Sillanpää 2009).

Prediction accuracy is an especially important requirement of ecological models, and ecologists have applied machine learning algorithms (e.g. bagging, boosting, random forests, Breiman 1996, 2001; Freund & Schapire 1996) to increase prediction accuracy over standard regression methods (e.g. Cutler *et al.* 2007; Elith, Leathwick & Hastie 2008; Maloney *et al.* 2009). Machine learning procedures also incorporate methods to address overfitting and model selection. Unfortunately, some machine learning techniques (bagging or random forests) produce estimates of predictor–response relationships (marginal functions) that are difficult to interpret because they are based on complex ensembles of decision trees (Cutler *et al.* 2007). Ecologists need modelling approaches that combine the increased prediction accuracy of machine learning algorithms with the interpretability and flexibility of GAM models.

Here, we present a recently developed technique that extends the procedure of gradient boosting to GAMs (Bühlmann & Hothorn 2007), hereafter referred to as boosted GAMs. To illustrate the advantages of this method, we develop boosted GAM models that identify the relationships between watershed-scale environmental and anthropogenic factors and eight measures of small-stream communities within Maryland, USA. All models account for spatial dependencies in the data. We use bootstrapping to compare the prediction accuracies of traditional and boosted GAMs. Although our example focuses on stream data, it demonstrates how boosted GAMs can be used for modelling basic and applied ecological questions in other systems.