A comparison of the power of categorical and correlational tests applied to community ecology data from gradient studies

Authors


Paul Somerfield, Plymouth Marine Laboratory, Prospect Place, Plymouth PL1 3DH, UK. Fax: 01752 633101; E-mail: pjso@pml.ac.uk

Summary

  • 1Where gradients exist it is possible for a global test of community change to fail to achieve significance even though pairwise tests between groups of samples from opposite ends of gradients reveal significant differences. This study examines the power of alternative tests in situations where spatial (or temporal) gradients exist.
  • 2Data from an oilfield survey in the Norwegian sector of the North Sea are used to examine the power of two types of test. There is no theoretical statistical framework with which to examine the formal power of the non-parametric multivariate tests. A univariate analogue is used to demonstrate the behaviour and relative power of linear regression and anova to detect different strengths of gradients.
  • 3Two multivariate tests are then contrasted using simulations. anosim (Analysis of Similarities) is a categorical analysis and, for an ecological matrix of pairwise dissimilarities among all samples, tests null hypotheses of the anova type: H 0 : There are no differences between groups of samples. This test is compared with relate , a non-parametric Mantel test which is used here to test the null hypothesis: H 0 : The dissimilarities among samples in the ecological matrix are not (monotonically) correlated with corresponding ‘model’ distances between samples along the gradient.
  • 4It is demonstrated that, where a gradient is present in the data, an appropriate univariate test based on linear regression always has greater power than anova to detect it. This is especially true where the gradient is weak and/or replication is low.
  • 5It is shown that the multivariate tests behave in a similar way to their univariate analogues. Where there is a detectable gradient in the data the correlative test ( relate ) has greater potential to detect it than does the categorical test anosim . The differences in power are especially apparent at low to intermediate strengths of gradient. Distributing replicates among more groups decreases the power of both tests, but especially anosim , to detect a constant gradient effect.
  • 6Although demonstrated using a practical application, the findings presented here are general in nature and applicable to any ecological investigation in which a gradient in response may be hypothesized.

Introduction

Field, Clarke & Warwick (1982 ) described a non-parametric multivariate strategy for analysing multispecies distribution patterns or, in other words, changes in species composition. Although formal tests for differences in species composition between groups of samples did not form part of the original strategy, as pointed out by Clarke & Green (1988 ) many community data possess some a priori defined structure within a set of samples, for example replicates from a number of different sites (and/or times). A prerequisite to interpreting community differences between sites should be a demonstration that there are statistically significant differences to interpret. Within the non-parametric similarity-based framework for analysis of assemblage data such tests generally involve ‘distance’ matrix comparisons, either explicitly or implicitly.

One method for comparing two (symmetric) similarity or distance matrices, computed for the same objects, is the Mantel test (Mantel 1967). First, a statistic is calculated between corresponding elements in the two lower-triangular matrices. The basic form of the Mantel statistic is the sum of cross-products of corresponding elements in the two similarity or distance matrices. Values may or may not be standardized, or transformed into ranks, before computing the statistic (Legendre & Legendre 1998). Transforming distances into ranks before computing a standardized Mantel statistic is equivalent to computing a Spearman rank correlation (ρ) between corresponding values in the two matrices. As suggested by Dietz (1983) and Hubert (1985), using such a non-parametric rank correlation has several advantages. Primarily, it diminishes the influence of large differences in values and is appropriate for non-linear monotonic relationships (Somerfield & Gage 2000).

Whichever statistic is calculated, a decision needs to be made whether to accept or reject the null hypothesis. Standard tables for testing correlations such as ρ are invalidated by the lack of independence of elements of a similarity matrix. More usually, the significance of the statistic in a particular test is determined by a permutation procedure (Hope 1968). One such procedure is to reallocate object-labels randomly in one of the matrices a number of times, recalculating the relevant statistic each time, in order to construct the distribution of the statistic under the null hypothesis. The actual value is then compared to this reference distribution and, as with any statistical test, the null hypothesis is rejected for observed values of the test statistic in the upper P tail of this distribution. We adopt the modern practice of quoting the actual value of P, the probability of obtaining a value of the test statistic as large as, or larger than, the observed value if the null hypothesis is true. Note that the test is one-sided because an appropriate alternative hypothesis is that intergroup dissimilarities are greater than intragroup dissimilarities (two-sided alternatives can be of interest in some special cases, see Chapman & Underwood 2000).

The Mantel test may be used to assess how well observational data match an a priori model. The test compares the empirical resemblance matrix to a model matrix, so-called because it is constructed to represent the model to be tested. In other words, it depicts the alternative hypothesis of the test. If the model is a classification of the objects into groups, the Mantel test is then a type of non-parametric multivariate analysis of variance, and addresses a null hypothesis of the one-way anova type:

  • •   H0: There are no differences between two (or more) groups defined a priori, with the alternative;
  • •   H1: There are differences between two (or more) groups.

Focusing on problems of analysis of variance that involve community composition, Clarke & Green (1988) and Clarke (1993) developed a parallel approach to model-based Mantel tests, called anosim (ANalysis Of SIMilarities), which encompassed both the one-way layout and the two-way crossed and nested anova-type designs. As demonstrated by Legendre & Legendre (1998), the main difference between the model-based Mantel test and one-way anosim, as test procedures, is one of parametric or non-parametric tradition. Instead of using a Mantel statistic, calculated using actual or standardized distances, for anosim distances are converted into ranks prior to calculating the anosim statistic R, a difference between inter- and intragroup rank dissimilarities. In the simple two-group one-way layout Legendre & Legendre (1998) show that R is a monotonic function of a Mantel statistic computed on ranked distances. It is thus analogous to a Spearman correlation coefficient in this case, although for more complex model structures the precise link is less clear. The anosim statistic R has the advantage of an absolute interpretable value which may be compared across different tests. By contrast, the Mantel statistic is usually of the normalized type, with the statistic divided by its standard deviation, which limits its interpretation to testing the null hypothesis (rather than interpreting the size of group differences when the null is rejected).

Often, however, an investigator anticipates more than an undefined difference between groups of samples and may have an a priori expectation of the probable direction and magnitude of differences to be expected. For example, a model matrix may be constructed to represent the hypothesis that a gradient in species composition is present in the data, in which case the model matrix could represent geographical distances along a transect, and the null and alternative hypotheses would then be:

  • •   H0: The dissimilarities among samples from the ecological matrix are not (non-parametrically) correlated with the corresponding model distances;
  • •   H1: The sample dissimilarities are correlated to the distances in the model matrix.

Despite the fact that a Mantel test based on expected differences between samples, for example samples at opposite ends of a gradient being more dissimilar than samples at intermediate positions along the gradient, might be expected to be commonly employed, this is not the case. The majority of investigations employing the non-parametric rank-similarity based approach rely on anosim to test for significant differences between groups. This is a perfectly valid approach, but can lead to a lack of sensitivity, as we shall show.

One area where the type of multivariate analyses discussed in this paper have become commonplace is in analyses of data resulting from pollution surveys. In the Norwegian sector of the North Sea such analyses were used to demonstrate that the effects of disturbance from drilling and extraction activities are far more extensive than previously realized (Gray et al. 1990) leading to changes in the legislation regulating such activities in the sector. Until recently (Gray 1999) biological monitoring was required around each installation every 3 years. These surveys, in which samples are collected at different distances from a putative point source of pollution (Olsgard & Gray 1995), provide good examples of data in which gradients in species composition are to be expected (Olsgard, Somerfield & Carr 1997, 1998). Macrobenthic assemblage data from surveys at 6 fields are used in the present study.

anosim tests between groups of samples near (< 1 km), further (1 km) and far (> 1 km) away at five of these fields ( Table 1 , Fig. 1 ) could be interpreted as showing that oilfield activities are only affecting macrobenthic community structure at Statfjord C, although there is a suggestion that these activities are having a mild influence on macrobenthic abundances in at least some of the others. Convential wisdom, however, states that further analyses should only be undertaken if the global test is significant, and the null hypothesis of ‘no difference between groups’ is rejected with confidence. Remembering, however, that in each of these surveys a gradient in community structure may be expected, what happens if we choose to ignore conventional wisdom and continue with pairwise anosim tests between the three distance groups of samples in each survey? The results ( Table 1 ) do indeed indicate that there is more going on than is revealed by a simple global test for differences between groups. In each case R -values for the pairwise test between samples at opposite ends of the gradient (< 1 km and > 1 km) are considerably higher than R -values for pairwise tests between samples from each end of the gradient and samples in the middle. In two of the surveys these pairwise tests have P < 0·05, even though the relevant global test failed to reject the null hypothesis of ‘no differences between groups’. MDS plots ( Fig. 1 ) also confirm that in each of the surveys there is some evidence of a gradient in macrobenthic community structure related to a gradient in distance from the centre of the field.

Table 1.  Results from one-way anosim tests, based on Bray–Curtis similarities in square-root transformed macrofaunal abundances from five fields in the Norwegian sector of the North Sea, for differences between stations < 1 km, 1 km and > 1 km from each platform/field centre. Pairwise anosim tests for differences between station groups. Stations categorized as follows: A: < 1 km from platform/field centre; B: 1 km from platform/field centre; C: > 1 km from platform/field centre. Results from relate ‘seriation with replication’ tests, which take into account the ordering of the groups
Location, year, number of stationsanosim global tests anosim pairwise tests relate ‘seriation with replication’ tests
A v BA v CB v C
RPRPRPRPRP
Veslefrikk 1993, 140·1740·0710·1190·2460·3560·0080·0310·3890·2450·005
Gullfaks A 1989, 160·1140·141 −0·1560·8650·3030·0270·0130·4420·2000·019
Gullfaks B 1993, 130·0650·2170·0190·3330·1620·1430·0420·4000·2180·014
Statfjord A 1993, 110·2940·0610·3560·0630·5450·095 −0·1430·6000·3100·034
Statfjord C 1993, 130·3490·0120·0880·2380·6940·0080·3020·0290·3790·002
Figure 1.

MDS plots (derived from Bray–Curtis similarity in √ transformed abundances) of survey data from five oilfields. Symbols represent station groups as follows: Black circles: < 1 km; grey circles 1 km; white circles > 1 km from the platform/field centre in each survey.

What happens if we apply a test designed for the detection of a gradient in the data? For each survey a model distance matrix was constructed in which samples were grouped, as before, into samples < 1 km, 1 km and > 1 km from the centre of the field. Samples in the same distance group are considered to be at distance 0 apart, in adjacent groups at distance 1, and at opposite ends of the gradient at distance 2. The absolute numbers used are not important, only the rank order of their differences being used in the non-parametric form of the Mantel test. Following the terminology of Clarke, Warwick & Brown (1993), who examined the impact of dredging activities on the structure of coral communities with ρ as a test statistic which they called an ‘index of seriation’, a relate test of ‘no relationship’ between the resulting distance and biotic matrices may be referred to as a relate‘seriation with replication’ test (Table 1), and the results are unequivocal. In each of the five surveys the hypothesis ‘there is no gradient in the data’ can be rejected with a high degree of confidence.

The failure of anosim to detect differences between groups of samples, when there are apparent differences to be interpreted, and the greater ability of relate to detect a gradient when one exists, are directly related to the power of the different tests for specific alternative hypotheses. There is no analytical framework for calculating the power of such multivariate tests, and simulation studies comparing type I error and power of the various methods of matrix comparison (anosim, Mantel tests, etc.) are needed (Legendre & Legendre 1998). To understand better the advantage, in terms of power of the test, that the relate approach may have over the anosim procedure, we look in detail at the direct analogue of this comparison in the univariate case, where a general analytical framework is available. The improved power of univariate gradient (correlational) analyses in comparison with control-impact (categorical) designs, specifically for oil-field studies, was examined by Ellis & Schneider (1997). They, however, calculated only relative power values for a specific data set. In contrast this paper provides general results for power comparisons of correlational and categorical tests in the univariate case. Simulations of specific cases are then used to determine the extent to which the multivariate tests match their univariate analogues in terms of their behaviour and relative power.

Methods and results

TEST DATA

Univariate comparisons are exemplified with, and multivariate simulations based upon, a data set from the Valhall oilfield, sampled in 1991. Multivariate analyses of these data were discussed in detail by Olsgard & Gray (1995) and Olsgard et al. (1997), and they are chosen here because of the strong faunal gradients apparent within them.

UNIVARIATE COMPARISONS

Assume that a single variable is recorded (a diversity index, say) for each of J replicate samples in each of I groups, the groups spanning some anticipated linear gradient of response. If, possibly after transformation, one can make the assumption of normality of the index and constant variance (σ2) across the groups, then appeal can be made to standard general linear model theory (e.g. Scheffé 1959) to derive theoretical expressions for the power of appropriate tests. The two competing analogues are: (a) linear regression and (b) one-way analysis of variance (anova). Figure 2 shows the values of Shannon diversity H′ from pooled replicates at 27 stations, grouped into the three distance classes discussed earlier (‘near’: < 1 km; ‘further’: = 1 km; ‘far’: > 1 km from the centre of drilling activity). For this illustration the distance classes (x) are given values 0, 1, 2 and, as it turns out, the diversity response (y) to this scale is very close to being linear (Fig. 2).

Figure 2.

Shannon diversity (H′) for stations around the Valhall oilfield, sampled in 1991 (replicate grabs at each station pooled). For illustrative purposes, the stations are split into three distance groups (‘near’/‘mid’/‘far’ from the drilling centre), denoted by distances 0, 1, 2, respectively. Crosses denote data points (one per station), the diagonal line is the fitted linear regression, circles denote the sample means of each group and the intervals (displaced to the left, purely for clarity of presentation) are 95% confidence intervals for the underlying group means, resulting from a one-way anova (both linear and anova fits assume a constant variance).

Under model (a), the regression model, the fitted line is y=−0·064 + 1·025x, and the estimated change in mean response across the full gradient is 2·05 (twice the slope). The residual sum of squares is 8·844 on 25 degrees of freedom (d.f.), giving an estimate of the underlying variance of a single replicate as s2 = 0·353. The F-test of the null hypothesis that the slope is zero leads to decisive rejection: F= 55·5, to be compared with the F distribution on 1, 25 d.f. (P = 0·0000, to 4 d.p.).

Under model (b), the anova model, the residual sum of squares is 8·616 on 24 d.f., giving a virtually identical estimate of underlying ‘error’ variance of s2 = 0·359. The null hypothesis of equality of the mean response for all three groups is again decisively rejected: F= 27·7, to be compared with the F distribution on 2, 24 d.f. (P = 0·0000, to 4 d.p.). Note also that a test of the (two parameter) linear regression model against the more general (three parameter) anova alternative, of separate means for each group, does not lead to rejection: F= (8·844–8·616)/0·359 = 0·64 (P = 0·43), giving no evidence at all for a departure from linearity.

The message of the example is that, with this number of replicates, while the response gradient between ‘near’ and ‘far’ stations is large enough to be clearly detected by either the linear regression or the anova test, the F-values in the two cases differ substantially (55·5 down to 27·7). This suggests that, in the case of a smaller number of replicates, there will be situations in which the anova test is unable to detect the gradient but the linear regression does, and this can be readily demonstrated by selecting smaller numbers of replicates, at random, from the existing sets. Of course, the greater sensitivity of the linear regression test, in an example where there is genuinely a linear response to a gradient, is not unexpected, but it is helpful to quantify the extent of this gain, and possible to do so with great generality through the following theoretical calculations of power in the two cases.

Linear regression on groups

Figure 3 illustrates the simple case of I = 3 groups, so that the regression model y = α   +   β x for x = 0, 1, 2 dictates that y has underlying mean values µ 1   =   α, µ 2   =   α   +   β, µ 3   =   α   +  2β for the three groups, there being J replicates in each group, each normally distributed with standard deviation σ. More generally, for I groups, it is assumed (without loss of generality) that the group means (µ 1 , µ 2 , … , µ I ) lie on the same regression line y = α   +   β x , with the I values of x equally spaced between 0 and 2, so that the mean response difference across the full gradient (µ I – µ 1 ) is again 2β. The null hypothesis H 0 : β= 0 implies that the group means are all equal and the test of H 0 is just the standard linear regression F -test for zero slope:

Figure 3.

The five alternative hypotheses for which the power of a test for zero slope in a linear regression (β = 0) is calculated, and compared with that for a test of equality of I (= 3) means in a 1-way anova1   =   µ 2   =   µ 3 ). The ± 2σ intervals denote the usual normal distribution range within which approximately 95% of the J replicates for each group will lie.

Reject H0 if F > F0·05; 1, IJ−2(eqn 1,)

where the latter denotes the upper 5% point of the (central) F distribution on (1, IJ– 2) degrees of freedom.

The power of the test is defined as the probability of rejecting H0 given that an alternative hypothesis is true (the means follow a linear regression with non-zero slope β). Thus:

Power = PLIN = Prob(F ′1, IJ−2; δ > F0·05; 1, IJ−2)(eqn 2,)

where F ′1, IJ−2; δ denotes the non-central F distribution on (1, IJ– 2) degrees of freedom and δ is called the non-centrality parameter. The non-central F distribution function is tabulated or, more conveniently these days, available in a few specialized statistics packages. (Note, however, that some packages refer to λ=δ2 as the non-centrality parameter, rather than δ itself.) Scheffé (1959) proves the simple general rule for deriving the non-centrality parameter from a general linear model: in the model sum of squares (SS) from the anova table, substitute for each observation its expected value under the specific alternative hypothesis being considered, and the result is σ2δ2. For the special, balanced, form of linear regression dealt with here, some straightforward algebra gives the model SS as:

12Ji{i − ½(I + 1)}yi•]2/[I(I2 − 1)](eqn 3)

where yi• is the mean of the responses for the ith group and the summation is over i= 1, 2, … , I. Under the alternative (regression) hypothesis,

E(yi•) = α + βxi = α + 2β(i − 1)/(I − 1) (i = 1, 2, … , I)(eqn 4)
  • and, applying the above rule, further algebra produces

δ 2 (≡ λ) =  J (2β/σ) 2 [ I ( I   +  1)]/[12( I   −  1)] (eqn 5)

a simple function of the scaled effect size (2β/σ), the number of groups (I) and the number of replicates in each group (J).

Analysis of variance

For standard one-way analysis of variance (anova), the groups have means µ1, µ2, … , µI. The null hypothesis again specifies equality, H0: µ1 = µ2 = … = µI but the F-test is constructed against a general alternative that the means are not all equal. It will thus have power to detect alternatives which are not at all linear, e.g. µ1 = µ3 << µ2. (The price paid for this is a loss of power in detecting a more constrained alternative, such as a linear trend, and quantifying the power loss is the purpose of this univariate analogy, of course.) The usual F-test for one-way anova specifies:

Reject H0 if F > F0·05; I−1, I(J−1)(eqn 6)

and its power is again given by a non-central F probability:

Power = PAOV = Prob(F′I−1, I(J−1); δ* > F0·05;  I−1, I(J−1))(eqn 7)

The model SS from the anova table is:

J Σ i ( yi•   −   y•• ) 2(eqn 8)

where y•• is the average of the group means {yi•}. The conventional power calculation for one-way anova (as in Scheffé 1959 and many subsequent texts) evaluates P under an alternative that fixes the maximum difference between any two underlying means, and therefore differs from the analysis given here. If the true alternative model is that of linear regression,

E(y••) = α + β(eqn 9)

then substituting eqns 4 and 9 into eqn 8 gives an expression for σ2δ*2, where δ* is the noncentrality parameter in eqn 7. In fact, more simple algebra demonstrates that:

δ*2 = J(2β/σ)2[I(I + 1)]/[12(I − 1)]≡δ2(eqn 10,)

i.e. the non-centrality parameter is the same in both cases, which makes the power comparison eqn 2 with eqn 7 concisely expressible as a simple change in the degrees of freedom of a non-central F probability.

Power comparisons of the two univariate tests

From eqns 2, 5, 7 and 10, and with access to a package which calculates non-central F probabilities, it is straightforward to compute the powers of the linear regression and anova tests for any suggested number of groups (I), replicates per group (J) and effect size (2β/σ) across the gradient, scaled in relation to the underlying standard deviation of a single observation. These are general formulae and, with the constraint that the data are normally distributed with constant variance, the conclusions will apply to any data set with this regression design (equally spaced x-values with balanced replication).

When there are only three groups (I = 3), Fig. 3 illustrates a spread of cases for the scaled effect size 2β/σ from the situation in which few, if any, of the replicates at the extremes of the gradient overlap each other (2β/σ = 4) to one where the replication variability appears to dominate the shift in mean value (2β/σ = 1). Table 2 computes the power for each of the five cases, for a range of replication numbers (J). Of course, the power increases with replication number and scaled effect size for both tests (but note the well-known ‘law of diminishing returns’ arising from the way power scales with √J rather than J itself). More pertinently, the improvement in power of the linear regression test over the anova test is seen to be real, though relatively modest except for mid-range power values combined with low numbers of replicates.

Table 2.  Power comparisons for the two competing tests (i.e. probability of rejecting the null hypothesis, in a 0·05 significance level test, when the alternative is true), for the case of I = 3 groups of samples with J replicates in each group (for a range of J ). P LIN = power of the test for zero slope in a linear regression of y on x , P AOV = power of the one-way anova test of equal means, both computed under the alternative hypothesis that the means do lie on a straight line. Five cases are considered, for decreasing values of the difference between the largest and smallest means (2β) in proportion to the underlying standard deviation of a single replicate (σ), see Fig. 2
J2β = 4σ2β = 3σ2β = 2σ2β = 1·5σ2β = σ
PLINPAOVPLINPAOVPLINPAOVPLINPAOVPLINPAOV
20·840·540·620·350·340·190·210·130·120·08
30·990·920·880·710·560·380·360·230·190·13
41·000·990·970·900·720·560·480·340·250·17
5 1·000·990·970·830·700·590·450·310·22
6  1·000·990·900·810·680·550·370·27
8   1·000·970·920·820·710·480·37
10    0·990·970·900·820·580·46
12    1·000·990·950·890·660·54
15     1·000·980·950·760·65
20      1·000·990·880·79
30       1·000·970·94
50        1·001·00

Note, in passing, that Table 2 also demonstrates why the real data of Fig. 2 is virtually guaranteed to lead to rejection of the null hypothesis of ‘no effect of distance on diversity’, under either regression or anova tests. There, the point estimate of the (scaled) effect size, 2β/σ, is 2·05/√0·354 = 3·45, falling roughly equidistant between the two left-hand plots of Fig. 3. To be more precise, taking some account of the uncertainty of the estimate of true effect size, one can derive a confidence interval for 2β/σ from the data of Fig. 2. An approximate 95% CI is (2·5, 4·4) and a 70% CI is (3·0, 4·0), implying that the true situation is likely to be bracketed by the first two cases in Table 2. The real data corresponds roughly to J= 9 (there are 27 samples in three groups, although divided slightly unequally into 8, 8 and 11 replicates) and it is therefore evident from Table 2 that the power of both tests is effectively one, i.e. both are certain to reject the null hypothesis (as happened). With such a clear effect in this case, the superiority of the regression test will only become apparent for much smaller numbers of replicates, e.g. two or three per group.

As the number of groups increases from three, however, the contrast can be more marked. Table 3 shows a more subtle, but revealing comparison: the total number of samples (IJ) is now fixed and the decision concerns how best to apportion this fixed effort between a lower/higher number of groups and higher/lower numbers of replicates within each group. Three convenient cases are considered, with total number of replicates fixed at IJ = 12, 24 and 60, respectively, these numbers being chosen to have a range of integer factorizations for I and J. The scaled effect size (2β/σ) is also selected arbitrarily for each of the three cases, to generate an illustrative span from low to high power.

Table 3.  Power comparisons (P LIN , P AOV ) for the two competing tests (see Table 2 legend) but now also illustrating the consequences of using a larger number of groups ( I ), again with J replicates in each group. The total number of samples ( IJ ) is fixed and three cases are considered: IJ   =  12, 24, 60, for conveniently chosen effect sizes (2β = 2σ, 1·5σ and σ, respectively; the difference between the group means at the two ends of the regression line always being denoted by 2β, whatever the value of I )
IJ   =  12, 2β = 2σ IJ   =  24, 2β = 1·5σ IJ   =  60, 2β = σ
IJPLINPAOVIJPLINPAOVIJPLINPAOV
260·880·882120·940·942300·970·97
340·720·563  80·820·713200·880·79
430·640·374  60·750·534150·810·64
620·570·206  40·670·345120·770·53
    8  30·630·256100·740·46
    12  20·600·1510  60·680·29
        12  50·670·25
        15  40·650·21
        20  30·640·16
        30  20·620·12

Reading down the columns of Table 3, the first clear point to note is that, whichever test is chosen, the power decreases as the number of groups (I) increases and the number of replicates per group (J) correspondingly decreases. Thus, in the first case (IJ = 12) and for the linear regression test, four replicates taken at each of three groups gives power 0·72, whereas three replicates at each of four groups gives a lower power of 0·64. The difference is more marked for the anova test, where the power drops from 0·56 to 0·37 as one moves from four replicates of three groups to three replicates of four groups. This contrast is very clear in all three cases shown here, and will be true generally. Thus, while the linear regression power declines only modestly between the extremes (0·88 down to 0·57 for two groups of six compared with six groups of 2; 0·94 down to 0·60 for two groups of 12 compared with 12 groups of two; 0·97 down to 0·62 for two groups of 30 compared with 30 groups of two), the decline is much more precipitous for the anova test (0·88 to 0·20, 0·94 to 0·15, 0·97 to 0·12 for the same three comparisons). In fact, although any experimenter would be intuitively wary of designing quite such an extreme study as two replicates from 30 different treatments, it is instructive to see that the power of such an anova test can be negligible (PAOV= 0·12) in a case in which, if the two opposite extreme treatments are identifiable a priori, 30 replicates of just these two would give near certainty of detecting the effect (PAOV= 0·97).

Another obvious point to note from Table 3 is that, for the special case of only two groups (I = 2), the powers for the linear regression and anova tests (PLIN and PAOV) are identical. This is not surprising: of course with only two x-values, the F-test of slope in a regression is formally identical to that for equality of the two means. (This has an exact parallel for the multivariate case, considered later, where it is also true that, with only two groups, the non-parametric Mantel-type correlation of biological (dis)similarities with intergroup distance provides a test which is formally equivalent to anosim, e.g. Legendre & Legendre 1998.) The main message of Table 3 for this paper, however, is in the direct comparison of PLIN and PAOV values for a specified number of groups. Many studies are likely to demand a fixed number of intermediate gradient positions, rather than sampling only at the extremes, in order to explore the form of the ‘dose–response’ relationship (which may not be linear). As expected from Table 3, for three or more groups, the anova test always has smaller power than that for the corresponding linear regression (when a linear model is justified), but as noted above, this disparity grows as the number of groups increases. Thus, a practically likely design of four replicates at each of six gradient positions – the type of design typically found for monitoring impacts at different distances from oilfield drilling – can mean that the probability of detecting the gradient changes from two-thirds to one-third if one elects to carry out an anova rather than a linear regression test.

MULTIVARIATE SIMULATIONS

Simulating the gradient

By making assumptions of normality of an index, and constant variance (σ2) across the groups, theoretical expressions for the power of appropriate univariate tests to detect a scaled effect of 2β/σ could be derived. In the multivariate case there is no comparable theory to allow general quantification, there is no simple expression for variance, and nor is there a statistical context for defining a scaled effect size. Although we cannot therefore derive relevant theoretical expressions for the power of the different tests, we can examine their performance under simulated conditions to determine whether their relative performance agrees with, or contradicts, the predictions from the univariate analogue. In the univariate analogue, the power of the two tests to detect an effect of 2β/σ was determined for various combinations of J replicates, distributed among I groups. For a multivariate analogue we need to define an effect (2β) that can be scaled and distributed among replicates within groups.

Replicated macrofaunal abundances from the Valhall oilfield, collected during the 1991 survey, were used as a basis for simulations. Similarity percentages analysis (simper; Clarke 1993) was used to determine the taxa contributing up to a cumulative 50% of the average dissimilarities between stations ≥ 4000 m from the platform and stations < 500 m from the platform. A ln(A + 1) transformation was chosen because it places variability in abundance on a common scale of percentage variation. Increases/decreases in the transformed abundances of the 20 selected taxa (Table 4) were defined as 100% of the simulated gradient effect (2β). Effect sizes were then defined as percentages of the total gradient effect. For example, taking a sample containing transformed macrofaunal abundances, a sample differing from it by 2β = 50% is created by adding 50% of the values (= 2b) in the right-hand column of Table 4, and subsequently converting any negative numbers resulting from subtractions to 0.

Table 4.  Average transformed abundances ( inline image ) of taxa contributing to the average dissimilarity inline image (= 66·01) between stations ≥  4000 m and stations < 500 m from the platform. Species are listed in order of their contribution ( inline imagei ) to inline image , with a cut-off when the cumulative per cent contribution (σ inline imagei % ) to inline image reaches 50%. The differences in transformed abundances (2 b ) of these 20 species were then used to construct simulated gradients
Speciesinline image
≥  4000 m
inline image
< 500 m
inline imageiinline imagei % σ inline imagei % 2b 2β = 100%
Capitella capitata0·333·522·984·52  4·52 +3·2
Myriochele oculata4·030·462·904·39  8·91 −3·6
Amphiura filiformis3·470·222·654·0112·91 −3·2
Scoloplos armiger2·980·142·313·5116·42 −2·8
Chaetozone setosa3·484·972·153·2619·68 +1·4
Sthenelais limicola2·780·262·073·1322·82 −2·5
Goniada maculata3·561·231·942·9425·76 −2·3
Falcidens spp. 3·091·031·862·8128·57 −2·0
Paraonis gracilis2·500·681·742·6431·21 −1·8
Glycera alba1·263·081·702·5833·78 +1·8
Eudorellopsis deformis2·770·741·692·5636·35 −2·0
Mysella bidentata2·160·171·652·4938·84 −2·0
Antalis entale1·410·001·141·7340·58 −1·4
Cirratulus cirratus1·212·091·091·6542·23 +0·8
Spiratella retroversa1·611·061·041·5843·80 −0·6
Nephtys cf. longosetosa1·280·091·031·5645·36 −1·2
Nephtys hombergi1·380·311·001·5146·87 −1·0
Platyhelminthes1·971·300·921·4048·27 −0·7
Trichobranchus roseus1·170·000·911·3849·66 −1·1
Lunatia montagui0·351·380·901·3751·03 +1·0

For each simulation a total effect size, 2β, was chosen and the necessary values (2b) to be added to transformed abundances to give that total effect were calculated. J stations, from the total of 11 stations ≥ 4000 m from the oilfield centre, were chosen at random. I replicates were chosen at random from each selected station. One was left unchanged and 2bN(I − 1)−1, where N= 1, 2, … I − 1, was added to the others (2b varying across species, taken from Table 4). This process resulted in J replicates in I groups, which retained the interreplicate and interstation variability in the original data-set, but with a total effect size of 2β evenly distributed amongst the groups, in the same way as for the univariate analogue.

Comparisons of the two multivariate tests

Bray–Curtis similarity was calculated between each pair of samples to create a similarity matrix, which was then input into one-way anosim and into relate. In relate the simulated biotic similarity matrix was correlated with a distance matrix representing J replicates in I groups arranged linearly and equidistantly. Table 5 illustrates the results from parallel anosim and relate‘seriation with replication’ tests based on simulations with increasing numbers of replicates (J) in I= 3 groups, with increasing effect sizes (2β). Being simulations based upon real and variable data these results are less smooth and predictable than those in Table 2. That being said, there are some clear messages to be learned from them. For very weak effects (2β = 2%) the intersample variability in the original data is greater than the simulated gradient. With few replicates (J = 3) both tests give results that are essentially random, and to interpret a significant result as providing evidence of the gradient existing would almost certainly be an error. Increasing the number of replicates (J = 4) allows both tests to detect the difference (negative R and ρ-values) consistently, albeit with varying degrees of significance. Note that because the matching of the stations within each effect group is being ignored the construction of these simulations predisposes towards a negative value of R if the group effect, 2β, is negligible (see Chapman & Underwood 2000). This extends the range of situations in which R and ρ are being compared and allows one to observe the group effect size coming into play and pushing the values of R and ρ back through zero to positive values as the effect size and/or the number of replicates increases.

Table 5.  Results from simulations. anosim (R) and relate (ρ) values and their significance ( P ) from analyses based on ln( A   +  1) transformed abundances. J represents the number of stations, from the total of 11 stations ≥ 4000 m from the centre of the Valhall oilfield in the 1991 survey, chosen at random for a particular simulation. β represents the effect size, a percentage of the mean difference in ln( A   +  1) abundance of each of the 20 species in Table 4. For a simulation, J stations were selected at random, and one replicate from each station chosen, also at random. β was added to a second, and 2β to a third, randomly chosen replicate from that station. This resulted in each case in J replicates in I = 3 groups, which retained the interreplicate and interstation variability in the original data-set, but which differed by β and 2β
J β   =  1 β   =  5 β   =  10 β   =  20 β   =  40
RP ρ PRP ρ PRP ρ PRP ρ PRP ρ P
3 −0·2760·896 −0·1630·847 −0·0530·6140·0280·3780·0120·4610·1560·1330·2260·0860·3160·0170·6460·0040·6920·001
30·4490·0070·3680·008 −0·0040·5040·1460·1640·0040·4710·2080·075 −0·0120·5610·1810·0930·8020·0040·7610·001
3 −0·1190·729 −0·0760·682 −0·2180·846 −0·0070·511 −0·2350·843 −0·1530·8420·0450·4180·2020·1060·9090·0040·7780·002
30·3990·0070·0870·292 −0·5141·000 −0·3200·9990·2760·0790·3270·0100·5720·0180·5770·0020·6380·0070·6460·002
4 −0·1390·835 −0·1480·9170·0320·3750·0920·1880·0670·2880·1560·0800·2500·0450·3550·0010·6480·0000·6270·000
4 −0·1500·895 −0·0660·712 −0·1570·937 −0·0900·788 −0·0280·5850·0630·2430·1850·0890·2850·0050·7800·0000·7480·000
4 −0·2640·994 −0·2430·9920·0160·4120·0580·2810·0560·3260·0790·2280·3730·0050·4400·0010·7180·0000·6840·000
5     −0·0050·4840·0540·2510·0470·2960·1040·1170·4580·0010·5080·000    
5     −0·0210·545 −0·0010·4630·0740·2260·1290·0810·1640·0600·2220·010    
6        0·0700·1770·1230·0550·2580·0020·3220·000    
6        0·1080·0870·1780·0080·3510·0000·3960·000    
7        0·0330·2920·0810·106        
8        0·0240·3120·0740·107        

With a stronger gradient effect (2β = 10%) anosim results for J= 3 are still dominated by the intersample variability in the original data (negative R), and for higher numbers of replicates global R-values remain negative or tend towards 0, indicating that the test does not yet detect the simulated gradient. relateρ-values, are equally variable for J= 3, although less negative, and move closer to 0 and stabilize for higher values of J. In all cases ρ-values are higher than the equivalent R-values, and estimated probabilities (P) for relate tests are consistently lower. Although neither test can detect the simulated gradient against the background variability (the ‘station effect’), these results provide evidence that the powers of the two tests are not the same and interstation variability in the original data is not affecting relate results to the same degree as it affects the results of anosim.

The difference between the two tests, in terms of power, becomes more apparent as 2β is increased to 20%, as the separation between groups of samples resulting from the simulated gradient effect now begins to equal or exceed the variation between stations in the original data. Both tests struggle to reject their appropriate hypotheses at the α= 0·05 level, but again the ρ statistic values are associated with lower values of P, providing strong evidence that the relate‘seriation with replication’ test is the more powerful. It is interesting that increasing J does not appear to increase the ability of either test to detect the 2β = 20% gradient in a consistent fashion. For both tests the values of statistics and their associated P-values drop to a minimum at about J= 6, and then appear to increase again. This is probably a consequence of the fact that there are relationships between groups of replicates in the original data which begin to influence the simulations as more groups are brought into play, and exemplifies the difference between multivariate simulations and the smooth patterns demonstrated in the theoretical univariate analogue.

The simulations for an effect size of 2β = 40%, in which the effect size is now slightly greater than the interstation variability in the original data, best exemplifies the difference in power of the two tests. These simulations, of a mild gradient, apparently match the situation occurring in the original surveys which prompted this investigation in the first place. In all cases the relate‘seriation with replication’ tests perform better than equivalent anosim tests, and in several simulations achieve P < 0·05 when anosim does not. Finally, as J increases for 2β = 40%, or as 2β is increased to 80%, both tests consistently detect the simulated gradient effect, and appropriate null hypotheses may be rejected with confidence.

Thus these simulations tend to confirm the difference in power of the two approaches suggested by the univariate analogue, namely that with weak gradients and low numbers of replicates, the relate‘seriation with replication’ test, analogous to linear regression, has more power to detect differences between groups of samples than the anova analogue anosim. The simulations also raise interesting questions about drawing conclusions from surveys in which replication is low, as apparently significant relationships have arisen by chance quite often (e.g. twice when J= 3, β= 1%, a gradient effect far too small to have given genuinely positive R and ρ-values). It also casts doubt on the usefulness of a modest increase in numbers of replicates in order to detect weak effects, as this does not appear to improve the ability of the multivariate tests to detect those effects markedly.

Distribution of replicates among groups

A second aspect considered in the univariate analogue was the consequence to power of distributing an equal number of replicates among different numbers of groups. The corresponding multivariate simulations are contained in Table 6. Comparing simulations with 12 replicates distributed in three groups (I = 3, J= 4), or four groups (I = 4, J= 3), both tests appear to perform better when replicates are grouped into fewer levels. This appears to be more important in the case of anosim, again reflecting the findings of the univariate case. To see how this might influence results in real cases, we can return to the five surveys examined earlier. Table 7 shows the results of relate‘seriation with replication’ tests based on grouping samples into three categories (as before) or into actual distances from the platforms. It is immediately clear that relate based on actual distances fares no better than one-way anosim based on the three distance classes (Table 1), and the advantages of choosing the more appropriate test are lost if the replicates are spread among too many groups.

Table 6.  Results from simulations. anosim (R) and relate (ρ) values and their significance values ( P ) from analyses based on ln( A   +  1) transformed abundances. J represents the number of stations, from the total of 11 stations ≥  4000 m from the centre of the Valhall oilfield in the 1991 survey, chosen at random for a particular simulation. For details see text. β represents the effect size, the total difference between samples at opposite ends of the simulated gradient being 2β. In these simulations β= 20%. In simulations on the left, each simulated gradient consisted of J replicates in I = 3 treatment levels (values from Table 5). In the simulations on the right, the same treatment effect was distributed over J replicates in each of I = 4 treatment levels
JI   =  3 I   =  4
RP ρ PRP ρ P
30·2260·0860·3160·0170·0220·4320·1340·129
3 −0·0120·5610·1810·0930·0460·3660·2480·033
30·0450·4180·2020·1060·5000·0040·5810·000
30·5720·0180·5770·0020·0120·4460·2990·007
40·2500·0450·3550·0010·1220·1350·2930·001
40·1850·0890·2850·0050·2450·0250·4080·000
40·3730·0050·4400·0010·3680·0040·4690·001
Table 7.  Results from relate ‘seriation with replication’ tests based on three distance categories, as in previous analyses, compared with tests based on I actual distances from the platform/centre of field
SurveyRelate on 3 distance classesRelate on I actual distances
ρ PI ρ P
Veslefrikk 19930·2450·00560·2080·070
Gullfaks A 19890·2000·01970·0940·219
Gullfaks B 19930·2180·01460·2570·107
Statfjord A 19930·3100·03450·2080·121
Statfjord C 19930·3790·00260·3490·025

Discussion

Although the basics of scientific investigation are apparently straightforward, this does not necessarily imply that designing and conducting a study is straightforward. There is rarely only one correct approach; indeed, it can be argued that there is never only one way to perform an experiment or survey. Many extensive (and expensive) ecological investigations are undertaken, with hours spent working up samples in the laboratory, in order to produce data to which many statistical and numerical techniques are often applied indiscriminately (Elliott 1994). Frequently the hypotheses implicit in the statistical tests being applied do not match completely the stated aims of a particular piece of work. Most tests have an implicit alternative hypothesis which they are designed to have some power to detect in the event of the null hypothesis being false. The test actually employed in a particular investigation is often one that the investigator is familiar with, or one that is available in the software and is easy to apply (whether rightly or wrongly), rather than one chosen explicitly because it best reflects the question being posed. There are two problems with this. First, the investigator should know exactly why a particular test is being applied and what the test results actually mean, as otherwise the relevance of the results to the hypothesis being posed may be called into question. Secondly, attention must be given to the design of sampling programmes to maximize precision and to ensure adequate power to detect possible changes (Underwood 1996), and applying an appropriate test (and hence posing a different alternative hypothesis) may influence the ability of the study to detect an effect, owing to differences in the power of the different tests.

Rather than the simple detection of change, any change, followed by an attempt to determine whether or not that change has any biological or environmental relevance, more thought should go into detecting changes of biological relevance defined a priori and, in applied studies, defining the type and magnitude of change which is likely to be of concern (Underwood 1996). This implies appropriate a priori specification of alternative hypotheses and the sampling design consequences are discussed in detail by Underwood (1996, 2000). The importance of power in wider issues of environmental management are also discussed by Peterman (1990a,b) and Fairweather (1991). An ‘acceptable’ probability of rejecting a false null hypothesis is usually taken to be greater than 0·8 or, more conservatively, it can be set equal to the complement (1 − α) of the probability (α) of falsely rejecting a true null hypothesis (Peterman 1990a). The power of a test is a function of the size of the effect to be detected (the predicted difference between control and disturbed sites, for example), the sample variance, the number of replicates and the value of α (usually 0·05) adopted for the test. In some cases it may be possible to select in advance some effect size that is biologically or commercially significant. This can be achieved by using information from previous studies, but unfortunately such information is usually scarce in the literature. Information on variability can be gained from pilot, or previous, studies but most pilot studies tend to be done only once and cannot therefore provide any information on temporal variation (Morrisey 1993). Most journals refuse to accept ‘survey’ information, apparently in the belief that information about where animals occur and what they are doing is no longer of interest to the ‘modern’ ecologist. Even when biological information is published, few journals have a mechanism whereby the raw data may be made available to other workers either in the present or in the future (Elliott 1994). Publication of such results would make a significant contribution to providing data which would allow the definition and calculation of relevant effect-sizes (Morrisey 1993).

If all the above information is available, then in the univariate case the amount of replication required to achieve an acceptable probability of detecting the chosen size of effect may be calculated (Underwood 1997). Conversely, when a test has failed to reject a null hypothesis, a posteriori analysis may be used to calculate the probability that the null hypothesis would have been rejected if it really was false, or to calculate the size of effect that the design of sampling used was capable of detecting, and can be an important tool in situations where prior estimation of an appropriate effect-size is not possible, or where managers have no control over the design of sampling programmes or experiments but are required to make decisions on the basis of the results (Morrisey 1993).

A regression-based statistical test, with replicates grouped into a moderate number of categories, appears to be the most powerful alternative for the detection of a gradient effect in both univariate and multivariate scenarios. Although the data with which this has been demonstrated here come from pollution surveys, this conclusion is entirely general in the univariate case, appears generally applicable to multivariate data, and is therefore of relevance to any ecological investigation in which a gradient is to be expected. This includes natural gradients, and also experimental investigations, experiments examining relationships between species numbers and ecosystem functioning being a topical example. Multivariate techniques have been shown repeatedly to be more ‘sensitive’ (i.e. powerful) than univariate techniques (Warwick & Clarke 1991; Somerfield & Clarke 1997; Clarke & Warwick 2001), and although there is no general framework for determining power in the multivariate context, the repeated demonstration that multivariate techniques produce significant results when univariate techniques do not may be taken as evidence that a survey designed to have adequate power in a univariate context (e.g. for diversity indices) should have adequate power in the multivariate context (of changes in whole community composition). Problems arise, however, when a multivariate test fails to reject the null hypothesis, as there is no clear framework for constructing an estimate of the power of the test a posteriori. How can effect sizes be defined, or the calculations made in order to determine adequate replication? In this paper we demonstrate one way, given adequate prior information, of constructing simulations which begin to provide answers to such questions.

Acknowledgements

Various consultants, institutions and oil companies were involved in gathering the original data used in the analyses and simulations in this paper, and we gratefully acknowledge their work. Parts of this study were supported by grants to F.O. from the Norwegian Research Council and NATO. This study was also funded in part by the UK Department for Environment, Food and Rural Affairs (Projects AE0231, AE1137 and CDEP 84/5/295) and is a contribution to the Plymouth Marine Laboratory's research programme Scaling Biodiversity and the Consequences of Change.

Ancillary