Most authors (e.g. Hugueny & Paugy 1995; Caley & Schluter 1997; Griffiths 1997; Oberdorff *et al*. 1998) have concluded that local species richness is directly proportional to regional species richness—the proportional sampling hypothesis (Cornell & Lawton 1992). Others (see Cornell & Karlson 1997) have found evidence for a levelling off in local richness as regional richness increases, consistent with the saturation of communities with species in some regions. Figure 1a shows the expected relations for the proportional sampling hypothesis and one version of the saturation curve.

Here I consider the effects of inadequate sampling, inadequate pool identification and statistical methodology (polynomial vs logarithmic and unconstrained vs constrained regression) on the shape of the local–regional relation detected, and examine other possible interpretations of a claimed non-interactive assemblage (Oberdorff *et al*. 1998). The position of the intercept and the degree of curvature of the regression line are central to shape identification.

Inadequate sampling can affect where the local–regional relation crosses the ordinate. It is conceivable that local diversity at any given site could be zero when some species occur in the region. Whether or not this is the case will depend, in part, on the sample effort. Sampling effort that is inadequate but uniform across regions will move the intercept below zero by shifting the regression line to the right but should not affect the shape of the relation. If the degree of undersampling is a function of regional richness, the intercept might not pass through the origin (Fig. 1b).

Changes in species rank-abundance patterns (or dispersion patterns) with regional richness could also affect the type of relation detected; for example, a shift from a geometric to a log-normal series pattern with increasing regional richness will result in a smaller fraction of the assemblage being sampled. A trend of reduced dominance with increasing species richness, which is plausible (Gray 1987; Whittaker 1975), will increase the chance of falsely detecting a saturating relationship.

Correctly identifying regional pools can be problematic; for example Oberdorff *et al*. (1998) defined their regional pool in ecological terms, noting that ‘a regional but ecologically based species richness would include only those species that are able to maintain populations within the sites studied’. The ecological pools were identified from a more geographically widespread correspondence analysis (Verneaux 1981) which divided a species continuum into nine categories. How the sample sites were related to Verneaux's categories was not explained. This step is important because identifying an assemblage as indicative of, for example, Verneaux's metarhithron rather than the adjacent mesorhithron category would increase the regional pool size from 10 to 15 (i.e. by 50%) or if only the two most common abundance classes were used from 5 to 9 (by 80%). Moreover the 10 species found by Oberdorff *et al*. (1998) do not appear to come from a single habitat because they divide into two ecological assemblages, one characteristic of relatively fast-flowing conditions (salmon, trout, minnow, bullhead, stoneloach) and the other of much slower flows (gudgeon, roach, chub and dace). Misidentifications of pools will shift the intercept up or down, depending on whether they have been under- or overestimated: incorrect inclusion seems more likely than exclusion, favouring subzero intercepts.

Two methods have been used to test the shape of the local–regional relation.

**1.** Fit a second-order polynomial regression, the lower order, linear, model being used to test for linearity if the second-order term is not significant (Hawkins & Compton 1992).

Some authors (Hugueny & Paugy 1995; Caley & Schluter 1997; Cornell & Karlson 1997; Oberdorff *et al*. 1998) have omitted the constant, thereby constraining the line to pass through the origin, because ‘when regional diversity is zero, so too is local diversity’ (Caley & Schluter 1997). It is correct that local richness must, by definition, be equal to or less than regional richness but the line need not pass through the origin, for the reasons outlined above.

Accordingly when using polynomial regression the intercept *a* should be ≤ 0 but not necessarily zero. The appropriate procedure (Neter *et al*. 1996) is to fit the unconstrained model and test whether *a* = 0. If this is the case, the estimates from the unconstrained model should be used. Only if *a* is significantly greater than 0 should the constrained model (*a* = 0) be used to estimate the slope coefficient, because positive values of the intercept are not theoretically possible.

Previous workers have used constrained regression when examining local–regional species richness, even when intercepts are not significantly different from 0. This greatly inflates *r*^{2} values, because the regression is no longer centred on the mean of the dependent variable (the total sum of squares about zero is used, rather than about the mean as in unconstrained regression (Wilkinson, Blank & Gruber, 1996). Table 1 illustrates the effect for a sample of 20 observations of two independent normally distributed random variables (mean *X* = 25, SD = 4; mean *Y* = 10, SD = 2). More seriously, both slopes for the constrained polynomial are now statistically significant, indicating, as *b*_{2} was negative, that there is a saturating relationship despite the independence of the two variables.

P(b_{1} = 0) | P(b_{2} = 0) | r^{2} | |
---|---|---|---|

Unconstrained | Polynomial | 0·426 | 0·528 |

Unconstrained | Linear | 0·117 | 0·131 |

Constrained | Polynomial | < 0·001 | < 0·001 |

Constrained | Linear | < 0·001 | 0·907 |

**2.** Perform a linear regression on logarithmically transformed data (Griffiths 1997): slopes significantly below 1 indicate curvilinearity. The logarithm of zero is not defined but it is possible to test whether the line passes through the origin with log (*x* + 1) transformed data: when log (regional richness + 1) = 0 then log (local richness + 1) should not differ significantly from 0. However, note that this transformation will introduce an error (usually small) into the slope estimate and tests of the slope (*b* = 1) should be performed on log-log transformed data.

Because there is no agreed form for the saturated local–regional richness curve, there is no reason to prefer either of these methods. However, if polynomial and power models show less curvature than the data this will lead to an overestimate of the intercept. Whether the intercept is significantly greater than zero in saturating relationships will depend both on the slope and the scatter of the observations about that slope and on the relative numbers of observations on the ascending and saturated limbs of the data.

Published data (Terborgh & Faaborg 1980; Cornell 1985a; Ricklefs 1987; Minns 1989; Hawkins & Compton 1992; Aho & Bush 1993; Lawton, Lewinsohn & Compton 1993; Kennedy & Guégan 1994; Stuart & Rex 1994; Dawah, Hawkins & Claridge 1995; Hugueny & Paugy 1995; Belkessam, Oberdorff & Hugueny 1997; Caley & Schluter 1997; Griffiths 1997; Oberdorff *et al*. 1998) were analysed for saturating or linear relations using the decision trees shown in Fig. 2. The regressions detected significant relationships in 20/21 cases. The polynomial and log-log regression methods gave the same conclusion for 18/21 of the data sets: two of the three discrepancies were a result of low local richness estimates at a single site which had a strong influence on log-transformed data, although there was no obvious reason for the third discrepancy. These discrepant cases were omitted from further consideration.

Only two of the nine (log-transformed) linear datasets had intercepts significantly below zero but the values for all the nonsignificant ones were negative, a statistically significant deviation from expectation (binomial test, *P* = 0·008) which cautions against the use of regression through the origin.

Nine of the 18 datasets showed saturated curves. Failure to examine the intercept in polynomial regression, which most authors have done, would reduce the number of saturated curves detected to 4/18 but have no effect on log-transformed data because all datasets with positive intercepts had slopes significantly less than 1.

To summarize, the results of this analysis indicate that using constrained regression as a matter of course is potentially misleading because it makes incorrect assumptions about the position of the intercept for many datasets and can detect significant relationships where none exist. There is little to choose between using polynomial and log-log regressions to identify the shape of local–regional relationships. However, it is important to note that performing single regressions and/or inspecting slopes alone is insufficient to identify the shape of the relation.

The number of observations (ideally regions, but see below) and the range in regional richness were used as indicators of sample adequacy. Mann–Whitney tests detected no significant differences between the two types of local–regional relationship and the median number of observations or the median range of regional richnesses and hence gave no indication that the observed patterns are a consequence of inadequate sampling. However, 9/18 log-log regressions had intercepts significantly different from zero, which suggests that sampling/pool specification problems are common.

Valid statistical tests require that data points are independent (Harvey 1996): Caley & Schluter (1998) make this point for local–regional patterns. Seven of the 14 sources analysed above (e.g. Cornell 1985a; Hugueny & Paugy 1995; Griffiths 1997; Oberdorff *et al*. 1998) have violated this by including several samples from a given regional pool when analysing for local–regional relations; for example, the number of data points Oberdorff *et al*. (1998) included in their analysis of nine sites varies from three to six per site, these replicates being taken in different years. Consequently their analysis had a total of 43 data points rather than the correct number of nine. Using all data points, unconstrained polynomial regression showed a saturating curve, whereas the log-transformed data indicated linearity. Both methods indicated linearity for the site data. Analysis of Cornell's (1985a, 1985b) cynipid wasp data found no difference in the form of the relationship detected by either method or data type.

There are two consequences of such replication. First it inflates the chance of detecting statistically significant results because standard errors are smaller than they should be; for example the probabilities using Oberdorff *et al*.'s (1998) site data were only marginally significant (0·10 > *P* > 0·05), whereas those using all data points were highly significant (*P* < 0·001). Second, because the number of samples per site is not equal, this procedure weights some sites more heavily than others, thereby biasing parameter estimates and the likelihood of detecting a particular form of the local–regional relation.

Even if data unequivocally demonstrate a particular pattern it can be hard to interpret what this signifies; for example, proportional sampling does not always imply noninteractive communities and the dominance of regional scale processes, while saturating curves can be a result of sampling artefacts (Griffiths 1997; Cornell 1999). In line with their conclusions, my analysis of Oberdorff *et al*.'s (1998) data supported the proportional sampling hypothesis, although, because the regression slopes were only marginally significant (linear *b* = 0·509 ± 0·245, *n* = 9, *P* = 0·077, log-log *b* = 1·171 ± 0·498, *P* = 0·051), this was rather a weak test. A stepwise multiple linear regression of mean local richness against mean regional richness, mean total fish density, and physical stream parameters (discharge, distance from source, width, gradient) for the nine sample sites gave four predictor variables, which explained 99·2% of the variance (Table 2). The standardized coefficients show that fish density explained much more of the variation in local richness than two physical parameters, while regional richness made an even smaller contribution. There was no evidence of collinearity between these variables.

Regression coefficient ± SE | Standardized coefficient | P | |
---|---|---|---|

Intercept | 1·633 ± 0·566 | 0·045 | |

Density | 0·049 ± 0·003 | 0·849 | < 0·001 |

Discharge | –0·351 ± 0·068 | –0·277 | 0·007 |

Slope | 0·083 ± 0·016 | 0·266 | 0·006 |

Regional richness | 0·138 ± 0·044 | 0·167 | 0·035 |

This relation between local species richness and fish density could arise in a number of ways. It might simply be a consequence of variation in sampling effort because, for a given species-abundance pattern, small samples will contain fewer species than large samples: Oberdorff *et al*. (1998) do not provide sampling effort information to test this possibility.

It might also arise from competition. Oberdorff *et al*. (1998) argued that if competition was important, one would expect to see density compensation in those regions with few species. If complete compensation occurred, there should be no relation between density or biomass and local species richness ,while if there was no compensation, a linear relationship would be expected. Their analysis, using all 43 (non-independent) data points, detected only linear relationships [although *P*(*b* = 1) = 0·078 for the log-log biomass slope of 0·640 ± 0·199] and they concluded that biotic interactions were unimportant. Using (independent) site data there was a linear relation between density and local richness, consistent with no density compensation, but no relation between biomass and local richness, consistent with complete density compensation. Analysis of covariance showed that the slopes of the log (biomass)–log (density) relation for all species combined did not differ across sites, i.e. with local richness. The common slope over the nine sites was 0·538 ± 0·102 [*P*(*b* = 1) < 0·01], indicating a levelling off in biomass with increasing density, a pattern consistent with competition. This result suggests that biotic interactions were important in the local fish assemblages studied by Oberdorff *et al*. (1998), although these did not necessarily lead to the observed differences in local richness.

Finally, density differences between regions could also be a result of productivity differences. The species-energy hypothesis (Wright, Currie & Maurer 1993) predicts that more productive habitats (such as those with higher total density or biomass) will support more species, in line with the data of Oberdorff *et al*. (1998). Because much of the variation in productivity is determined locally, this would imply an important role for local processes.

The message is clear: ecologists need to exercise considerable caution in collecting, analysing and interpreting local–regional species richness data.