## Introduction

Phylogenetic methods have become *de rigueur* in the analysis of interspecies data (Harvey & Pagel 1991). This is because species are non-independent for the purposes of statistical analysis due to their common history (Felsenstein 1985; Harvey & Pagel 1991). This problem of the statistical dependence of related species has been solved in various different ways for different types of data and scientific questions (e.g. Ridley 1983; Felsenstein 1985, 2005, 2008; Cheverud, Dow & Leutenegger 1985; Grafen 1989; Pagel & Harvey 1989; Maddison 1990; Garland *et al.* 1993; Hansen 1997; Pagel 1999; Garland & Ives 2000; Rohlf 2001; Butler & King 2004; Ives, Midford & Garland 2007; Revell *et al.* 2007; Hansen, Pienaar & Orzack 2008; Lajeunesse 2009; Revell & Collar 2009; reviewed in Harvey & Pagel 1991; Felsenstein 2004). Arguably, the most widely used statistical method for the analysis of interspecific data that accounts for the historical relationships of species is the phylogenetic regression (Felsenstein 1985; Grafen 1989).

Typical linear regression analysis is of the form: **y** = **Xβ** + **ɛ**, with the ordinary least squares (OLS) solution: , in which **y** is an *n* × 1 vector (for *n* species) containing values for the dependent variable, *Y*; **X** is an *n* × (*m* + 1) matrix containing 1·0s in the first column and the *m* independent (explanatory) variables of the model in columns two through *m* + 1; and is a vector containing the parameter estimates (including intercept) of the fitted univariate or multivariate linear regression model (Rencher & Schaalje 2008). **ɛ** is an *n* × 1 vector containing the residual error in the model, and under OLS it is assumed that **ɛ** is multivariate normally distributed with a variance–covariance matrix given by . Here, **I** is the identity matrix (an *n* × *n* matrix containing 1·0s on the diagonal and zeroes elsewhere), and is the residual variance of the model (i.e. the variability in *Y* not explained by the regressors).

If the residuals in **ɛ** are not distributed according to , but instead according to in which **C** is known and is not proportional to **I** (i.e. **C** ≠ *k***I** for *k* ∈ **R** and *k* > 0.0), then fitting the regression model becomes a generalized (instead of ordinary) least squares problem (Rohlf 2001; Kariya & Karuta 2004; Rencher & Schaalje 2008). For non-phylogenetic data, **C** ≠ *k***I** might be true, for example, if the sampling variance of *Y* is uneven across data points (i.e. if our data for *Y* have been collected with varying amounts of error). In this situation, **C** would be a diagonal matrix containing the *n* sampling variances of each of the observations for *Y*. Here, the generalized least squares regression would be the same as a weighted regression in which the weights are proportional to the inverse of the sampling variances for each observation of *Y*. In the phylogenetic case, the problem is not usually that the diagonal of **C** is uneven – all extant taxa in a phylogeny are temporally equidistant from the root of the tree (by definition) so they are frequently assumed to have equivalent variances (given that they are all extant and have been measured with comparable accuracy; but see Ives, Midford & Garland 2007). Rather, in the phylogenetic case, it is that the off-diagonals of **C** are non-zero due to the correlated histories of related species (Butler, Schoener & Losos 2000; Garland & Ives 2000).

To solve this problem, we can find the minimum variance regression slope and intercept using the generalized least squares estimating equation (or Gauss–Markov estimator; Kariya & Karuta 2004):

This approach to the regression of interspecies data was first suggested by Grafen (1989), and has since been showed to be exactly equivalent to regression estimated using the contrasts method of Felsenstein (1985; Garland & Ives 2000; Rohlf 2001). The generalized least squares estimating equation is similar to the OLS estimator (given above), except that now we have down-weighted each observation for *Y* (and corresponding row of **X**) depending on the correlation of its residual error with the other observations in our set.

Under a simple Brownian motion model for evolutionary change in *Y* and the *X*s (Cavalli-Sforza & Edwards 1967; Felsenstein 1985, 2004), **y** (or any column of **X**, barring the first) is expected to be distributed as a multivariate normal with variance–covariance matrix given by (or ) in which **C** contains the height of each of the *n* tips of the tree on its diagonal, as well as the heights of the most recent common ancestor of each species pair *i* and *j* in each *i*,*j*th off-diagonal position (Felsenstein 1973; O’Meara *et al.* 2006). (or ) gives the phylogenetic variance or ‘evolutionary rate’ for *Y* (or *X*; O’Meara *et al.* 2006; Revell 2008). More importantly, however, **ɛ** = **y** − **Xβ** will also be distributed according to a multivariate normal with variance–covariance matrix given by under this evolutionary scenario. Figure 1(b) shows the computation of **C** from a simplified five taxon tree given in Fig. 1(a).

When data for our dependent and independent variables come from species it is a common procedure to estimate the degree to which each variable is distributed according to the variance–covariance matrices and . This measurement, which can be taken in a variety of ways, is usually described as a measure of ‘phylogenetic signal’ for the characters in question (e.g. Blomberg & Garland 2002; Freckleton, Harvey & Pagel 2002; Blomberg, Garland & Ives 2003; Revell, Harmon & Collar 2008). If *X* and *Y* have been evolved by Brownian motion evolution, then their phylogenetic signal will be high (i.e. close to 1·0; Revell, Harmon & Collar 2008). Furthermore, if *X* and *Y* have evolved by Brownian motion then **ɛ** = **y** − **Xβ** will be distributed according to and the phylogenetic regression is an appropriate method to analyze the relationship between the independent variables contained in **X** and the response variable of our model, *Y*. Thus, it is tempting to use high phylogenetic signal in the dependent and/or independent variables as a justification for the phylogenetic regression. This is, in fact, commonly done (e.g. Ashton 2002; Gustafsson & Lindenfors 2004; Rezende, Bozinovic & Garland 2004; Muñoz-Garcia & Williams 2005; Collen *et al.* 2006; Ebensperger & Blumstein 2006; Ezenwa *et al.* 2006; Duminil *et al.* 2007; Hendrixson, Sterner & Kay 2007; Johnson, Isaac & Fisher 2007; Rönn, Katvala & Arnqvist 2007; Beaulieu *et al.* 2008; Capellini *et al.* 2008; Møller, Neilsen & Garamzegi 2008; Lovegrove 2009; Lindenfors, Revell & Nunn 2010).

However, it does not follow that if phylogenetic signal for *X* and/or *Y* is relatively high then **ɛ** = **y** − **Xβ** will necessarily be distributed according to . Furthermore, it is also possible that even if phylogenetic signal is very low, **ɛ** = **y** − **Xβ** may still be distributed with variance–covariance matrix . Thus, the appropriate test for phylogenetic signal is actually on the residual variability in *Y* given our regression model – a test which is relatively infrequently applied. In this study, I simulate scenarios in which *X* and/or *Y* have relatively high phylogenetic signal, but in which **ɛ** = **y** − **Xβ** is non-phylogenetic and thus the phylogenetic regression is inappropriate. I show that using a phylogenetic regression here will induce increased variance on the regression estimator. I also examine the possibility that *X* and/or *Y* are non-phylogenetic, but that **ɛ** = **y** − **Xβ** is distributed according to . In this case, the phylogenetic regression is appropriate; however, standard diagnostic tests on *X* and *Y* might be taken to imply that ‘phylogenetic correction’ of the regression is unnecessary. I show that ignoring phylogeny in this case can lead to poor statistical performance of the regression. Finally, I repeat a maximum likelihood procedure using the *λ* statistic of Pagel (1999) in which we simultaneously estimate phylogenetic signal and the regression parameters (e.g. Revell 2009), thus obviating the need for a priori estimation of phylogenetic signal in the regression variables.