SEARCH

SEARCH BY CITATION

Keywords:

  • Accuracy;
  • error;
  • gold standard;
  • model;
  • prevalence;
  • sample size

ABSTRACT

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. REFERENCE DATA: QUANTITY AND QUALITY
  5. DATA AND METHODS
  6. RESULTS AND DISCUSSION
  7. CONCLUSIONS
  8. ACKNOWLEDGEMENTS
  9. REFERENCES
  10. BIOSKETCH

Aim  To explore the impacts of imperfect reference data on the accuracy of species distribution model predictions. The main focus is on impacts of the quality of reference data (labelling accuracy) and, to a lesser degree, data quantity (sample size) on species presence–absence modelling.

Innovation  The paper challenges the common assumption that some popular measures of model accuracy and model predictions are prevalence independent. It highlights how imperfect reference data may impact on a study and the actions that may be taken to address problems.

Main conclusions  The theoretical independence of prevalence of popular accuracy measures, such as sensitivity, specificity, true skills statistics (TSS) and area under the receiver operating characteristic curve (AUC), is unlikely to occur in practice due to reference data error; all of these measures of accuracy, together with estimates of species occurrence, showed prevalence dependency arising through the use of a non-gold-standard reference. The number of cases used also had implications for the ability of a study to meet its objectives. Means to reduce the negative effects of imperfect reference data in study design and interpretation are suggested.


INTRODUCTION

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. REFERENCE DATA: QUANTITY AND QUALITY
  5. DATA AND METHODS
  6. RESULTS AND DISCUSSION
  7. CONCLUSIONS
  8. ACKNOWLEDGEMENTS
  9. REFERENCES
  10. BIOSKETCH

Presence–absence models are used widely in biogeography. For example, species distribution models employing presence–absence data have numerous applications including the evaluation of variables affecting the spatial distribution of a species and how this may alter with climate change as well as roles in the design of field studies, prioritization of sites for conservation, monitoring of species declines and expansions in range (Farber & Kadmon, 2003; McPherson et al., 2004; Allouche et al., 2006; Guisan et al., 2006). Despite limitations, often relating to factors such as the effects of spatial autocorrelation or interactions between variables and limited inclusion of theory (Austin, 2007), these models are well-established (Guisan et al., 2006). Modelling with presence–absence data is attractive for a variety of reasons, not least the ability to acquire such data more easily and cheaply than alternatives, such as data on cover or abundance. Additionally, species distribution data are becoming increasingly available through archived resources and data-sharing activities (Guisan et al., 2006; Graham et al., 2008) and many methods are available for analysing such data (e.g. Austin, 2007; Parviainen et al., 2009). Ultimately, however, the value of a model is a function of the accuracy of its outputs. While the quality of model predictions is clearly a function of the quality and quantity of the data used in the modelling (Lobo, 2008; Wisz et al., 2008), the main concern here is focused on issues associated with the reference data such as those used in validating model outputs.

The validation or assessment of the accuracy of model output involves comparing model-based predictions with reality. Unfortunately, reality or the ‘truth’ about a species' distribution is rarely known unless a simulated data set is used (Austin et al., 2006; Franklin et al., 2009). When using real data it must be recognized that the acquisition of error-free data would be very expensive and thus few species will ever be represented perfectly in data sets (Turner, 2006). The problem may be particularly severe if using historical data sets, as there may be few meta-data on data quality (Hortal et al., 2008). Nonetheless there is a clear need for the use of reliable data and an inescapable fact that such data will often be lacking.

Problems with species distribution data may be particularly apparent for absence data (Jiménez-Valverde & Lobo, 2007; Graham et al., 2008; Lobo, 2008). A key concern with absence data is that it is typically impossible to be confident that a recorded absence is not simply an undetected presence (MacKenzie, 2005; Cronin & Vickers, 2008; Franklin et al., 2009). Such false absences may arise especially for cryptic species that are difficult to detect and may substantially bias modelling activities (Hartel et al., 2009). Even with easy to identify species there will be errors in presence–absence data sets. For example, marked differences in data quality linked to surveyor experience and training have been observed. Moreover, the errors may be largest for the least common species, having negative impacts on conservation activities that seek to protect rare species or to detect changes in their occurrence (Ringvall et al., 2005). Thus while the acquisition of presence–absence data is believed to be less susceptible to measurement and judgement errors than other types of data (Ringvall et al., 2005) there are still many concerns that can result in error and uncertainty. In addition to variations in species detectability (MacKenzie, 2005; Guisan et al., 2006; Hartel et al., 2009) other concerns noted about presence–absence data in species distribution modelling research include problems such as incomplete information (Conlisk et al., 2009), locational errors (Freeman & Moisen, 2008; Graham et al., 2008; Johnson & Gillingham, 2008; Osborne & Leitão, 2009), imbalanced proportions of presences and absences (Real et al., 2006) and the threshold to convert continuous probabilistic outputs into a binary classification for mapping (Jiménez-Valverde & Lobo, 2007; Freeman & Moisen, 2008). Although it may still sometimes be possible to derive useful information from imperfect modelling analyses, the quality of models and their predictions can have major negative impacts on interpretation (MacKenzie, 2005; Graham et al., 2008; Osborne & Leitão, 2009). Assessment of accuracy is therefore important in helping to evaluate the fitness of a model and its outputs for a specific application. Central to this task is the reference data set against which the model outputs are compared.

Many approaches to validation have been discussed (Fielding & Bell, 1997; Liu et al., 2009). Commonly used approaches are based on a binary or 2 × 2 confusion matrix, which is a cross-tabulation of the class label (i.e. presence or absence) predicted by the model against that contained for the same site in the reference data set for every case in an independent testing set. This matrix summarizes the allocations made in the two classifications (Fig. 1a). The cases upon which the classifications agree in labelling lie on the main diagonal of the matrix, while the off-diagonal elements highlight the two types of error that may occur: omission (false absence) and commission (false presence). The magnitude of these errors clearly impacts on the accuracy of the classification, although the relative importance of the errors of omission and commission may vary between studies (Fielding & Bell, 1997; Jiménez-Valverde & Lobo, 2007).

image

Figure 1. Accuracy assessment. (a) The binary confusion matrix. The highlighted elements of this matrix show the number of correctly allocated cases of presence (a), the number of correctly allocated cases of absence (d) as well as the number of cases that represent omission (c) and commission (b) errors. (b) Equations for the estimation of prevalence and popular measures of accuracy. Many other measures of accuracy may be derived from this matrix, for example, overall accuracy = (d)/n and the user's accuracy for presence = a/(b).

Download figure to PowerPoint

Many measures of accuracy may be derived from a confusion matrix (Fielding & Bell, 1997; Liu et al., 2009). Widely used measures include the proportion of correctly allocated cases, sensitivity (S1), specificity (S2), true skills statistic (TSS), overall accuracy and the kappa coefficient. A further measure that is not directly derived from the confusion matrix but which is based upon sensitivity and specificity is the area under the receiver operating characteristic (ROC) curve (AUC). These various measures reflect different aspects of accuracy and may vary in their value to a particular study. Some measures of accuracy may, however, have undesirable characteristics and researchers have been cautioned about their use.

A key attribute of a useful measure of accuracy is that it be independent of the prevalence of the species (Manel et al., 2001), with prevalence typically defined as the proportion of the test site occupied by the species. Thus, the magnitude of the accuracy measure should not vary with the relative occurrence of the species under study. Failure to correct for the impacts of prevalence can lead to erroneous assessments and misleading interpretations (Valenstein, 1990; Manel et al., 2001). The use of measures of accuracy that are sensitive to variations in prevalence has therefore been discouraged and prevalence-independent measures promoted. For example, popular measures like overall accuracy and user's accuracy (positive predicted value) are widely criticized for being prevalence dependent and their use in ecological applications discouraged (Fielding & Bell, 1997; Manel et al., 2001; Farber & Kadmon, 2003; Freeman & Moisen, 2008). Strangely, the literature has been inconsistent on some other accuracy measures. The kappa coefficient, for example, is widely promoted in ecology even though it is known to be prevalence dependent (Manel et al., 2001; McPherson et al., 2004; Freeman & Moisen, 2008) and prevalence correction may be unsuitable (Hoehler, 2000). Moreover, there are concerns with the kappa coefficient as an accuracy measure that lessen its value for many studies (Foody, 2008a). Of the remaining accuracy measures, the most widely used and promoted are, or are based on, sensitivity and specificity; Fig. 1(b) gives formulae for these measures, yielding the apparent or true value depending on whether an imperfect or gold-standard reference is used.

Sensitivity is the probability of correctly predicting a presence while specificity is the probability of correctly predicting an absence (Valenstein, 1990; Fielding & Bell, 1997; Farber & Kadmon, 2003; McPherson et al., 2004). The magnitude of sensitivity is influenced by omission errors and hence by false absences. Alternatively, specificity is influenced by commission errors and so varies as a function of the false presences. Although sensitivity and specificity each have, from a theoretical standpoint, the desirable feature of being unaffected by prevalence neither is always useful alone as it can only summarize the accuracy of the model in relation to one category (presence or absence) and conveys no information on the other (McPherson et al., 2004; Allouche et al., 2006). Hence, measures that combine sensitivity and specificity such as the TSS (Allouche et al., 2006; Freeman & Moisen, 2008) and AUC (Fielding & Bell, 1997; Freeman & Moisen, 2008) are widely promoted and, as founded on measures believed to be prevalence independent, are also taken to be unaffected by variations in prevalence. However, these measures are not problem free and some important concerns have been raised. For example, although the TSS has the attractive feature of combining sensitivity and specificity, it does so in a way that gives each equal weighting which may not always be appropriate (Allouche et al., 2006). Similarly, the AUC has the attractive feature of being based upon the entire spectrum of sensitivity and specificity but it may sometimes be appropriate to consider the shape of the curve, weight sensitivity and specificity differently and only base calculations upon their meaningful range for the task in hand (Kazmierczak, 1999; Williams & Peterson, 2009). Other limitations, not least a variation in AUC with the extent of the study area, have led to calls for the AUC not to be used alone but in combination with the sensitivity and specificity (Lobo et al., 2008). This concurs with other suggestions that sensitivity and specificity be reported when working in areas of differing prevalence (Kazmierczak, 1999).

Critically, sensitivity and specificity are key measures in the evaluation of species presence–absence models, either directly or through measures founded upon them such as the TSS and AUC. Central to their use and widespread promotion in ecology is a commonly accepted view that their mathematical basis ensures that they are, in theory, independent of prevalence (McPherson et al., 2004). This paper evaluates the validity of this view in practice given the aforementioned imperfections that exist in species presence–absence data sets used in validation activities. Specifically, the paper: (1) investigates the impacts of reference data error on the apparent accuracy of model predictions; (2) explores the effect of the size of the reference data set on the ability to detect differences between model predictions; and (3) suggests possible means to reduce the negative impacts arising from the use of imperfect reference data.

REFERENCE DATA: QUANTITY AND QUALITY

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. REFERENCE DATA: QUANTITY AND QUALITY
  5. DATA AND METHODS
  6. RESULTS AND DISCUSSION
  7. CONCLUSIONS
  8. ACKNOWLEDGEMENTS
  9. REFERENCES
  10. BIOSKETCH

To effectively use a selected measure of accuracy it is vital that the accuracy assessment is based upon high-quality reference data (Farber & Kadmon, 2003). The fundamental issue of concern in this paper is that the reference data set is often assumed to be a gold standard (i.e. error-free) and that an appropriate sample size is used in the analysis. Indeed the oft-claimed prevalence independence of some measures of accuracy is based on the assumption of validation relative to a true gold-standard reference. However, a gold standard is unlikely to exist, and then possibly only for presence cases (Cronin & Vickers, 2008). The absence of a gold standard arises from a variety of imperfections, such as those noted above, which result in uncertainty that is inherent in much ecological data. This paper focuses on two issues connected to the reference data set: the accuracy of the labelling, which is an issue of data quality, and the sample size or number of cases in the testing set, which is an issue of data quantity.

The ideal reference data for the validation of a species presence–absence model and the estimation of species prevalence should have two key attributes. First, the data should be accurate. That is, each case should be correctly labelled as representing the presence or absence of the species. As noted above, however, perfect labelling is often not achieved, especially in relation to absences. Critically, the reference data set is typically not a gold standard (Kazmierczak, 1999). Instead the reference data set is imperfect, but generally believed to be highly accurate and, as a minimum, more accurate than the data set it is being compared against.

The second key attribute of an ideal reference data set is that it should contain a number of cases sufficient to meet the objectives of the study; the sample design used to acquire the data is also important but it will be assumed that an appropriate design such as simple random sampling is used. The precision with which a property is estimated varies positively with sample size, and sometimes small samples will be of limited value (Wisz et al., 2008). However, the size of the reference data sets used is also an important issue in comparative studies such as those seeking to determine changes in species prevalence over time. The latter is an important application in conservation studies that often use presence–absence data (Strayer, 1999). The sample sizes used in comparative studies is important, with both unduly small and unduly large sample sizes being cause for concern (Foody, 2009a). This situation arises notably because of two types of error that can arise in popular statistical hypothesis test-based analyses. First is a type I error in which the null hypothesis, normally of no difference, is incorrectly rejected and a difference declared to exist when in reality it does not. This type of error might lead to a researcher suggesting that a species was declining in a region where the population was actually stable (Strayer, 1999). Second is a type II error in which the null hypothesis is incorrectly upheld and the existence of a meaningful difference goes undetected. With this type of error, a researcher might conclude that a population was stable when important changes were actually occurring (Strayer, 1999). Linked to the latter is a broader concern with the interpretation of non-significant results (Hoenig & Heisey, 2001; Trout et al., 2007). The probability of making a type I error (α) together with that for a type II error (β) should be considered in the design and interpretation of a hypothesis test and is a function of sample size; sizes that are too small may fail to detect a meaningful difference that does exist while those that are too large may be used to ascribe statistical significance to trivial and unimportant differences. Consequently, the sample size should be set to meet the study objectives. As context, the sample size used in ecological studies has varied greatly, but is often of the order of tens to thousands of cases (e.g. Stockwell & Peterson, 2002; Farber & Kadmon, 2003; Joseph et al., 2006; Wisz et al., 2008).

The primary aim of this paper is to illustrate some of the impacts arising from the use of an imperfect reference on species distribution modelling, especially the effect of reference data error on the perceived accuracy of a species presence–absence model and its prediction of species prevalence. A secondary aim is to present a concern with the sample size used in modelling studies. Finally, the paper will show that there are methods to address the concerns connected to the quality and quantity of reference data. The latter should help enable studies to be designed and results interpreted in a manner that allows the derivation of useful information.

DATA AND METHODS

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. REFERENCE DATA: QUANTITY AND QUALITY
  5. DATA AND METHODS
  6. RESULTS AND DISCUSSION
  7. CONCLUSIONS
  8. ACKNOWLEDGEMENTS
  9. REFERENCES
  10. BIOSKETCH

The impacts of the quality and quantity of reference data on the accuracy and interpretation of presence–absence model outputs was assessed using real and simulated data. Most attention was focused on the latter since it enabled the effects of other sources of error and uncertainty to be controlled, ensuring that only the imperfections of the reference data set were responsible for any impact observed.

Attention focused on the estimation of selected key measures of accuracy and the prevalence of a species. The accuracy measures used were sensitivity (S1), specificity (S2), TSS and AUC. Of these, all except the AUC may be derived from a confusion matrix (Fig. 1). The AUC is the area under the curve derived when sensitivity is plotted against 1 − specificity for all possible thresholds applied to the probabilities of occurrence derived from a model. The calculation of each measure of accuracy is, however, made relative to a reference data set.

An error in the reference data labelling results in a case being placed into an incorrect element of the confusion matrix. The impact of reference data error will vary with the distribution of error amongst the matrix's elements, depending on the relative balance of errors of omission and commission introduced as well as the relative abundance of presences and absences in the data set. Assuming that the two classifications and their errors are conditionally independent (i.e. there is no tendency for the reference data and model predictions to err on the same cases), the impacts of reference data error on the apparent accuracy of models and their predictions may be modelled using long-established relationships amongst reference and model properties (Gart & Buck, 1966). For example, the apparent accuracy of the model may be evaluated if the accuracy, expressed in terms of sensitivity and specificity, of the reference and model classifications and the prevalence are known using equations 1 and 2. The apparent sensitivity is

  • image(1)

and the apparent specificity is

  • image(2)

where the superscript R indicates that the measure relates to the reference data, the ∼ indicates that the measure represented is the apparent value derived with the use of an imperfect reference data set and Δ is the species prevalence. The relationships in equations 1 and 2 allow the apparent sensitivity and specificity to be estimated over the full range of prevalence (as opposed to the single estimate from a confusion matrix) for scenarios when the actual quality of the data sets is known. These equations are, however, based on the assumption of conditionally independent labelling (Gart & Buck, 1966). If the labelling is conditionally dependent (e.g. there is a tendency for the model outputs and the reference data to err on the same cases) the direction and magnitude of biases introduced may be changed (Valenstein, 1990).

The impacts of an imperfect reference data set on the apparent accuracy of species distribution models were evaluated for a diverse set of scenarios (Table 1). Although the parameters used in generating the data sets contain a degree of arbitrariness, the settings used were informed by the literature. Scenarios A and B provided an indication of the impacts of reference data error arising when the classifications were both very accurate and both of low accuracy, respectively. In scenario A, the reference data set was of exceedingly high accuracy (inline image) while the model was of very high accuracy (S1 = S2 = 0.925). Consequently, scenario A might be considered as approaching a near ideal state in an imperfect world; highly accurate model outputs evaluated against extremely accurate, but slightly imperfect, reference data. In contrast, scenario B represented a less desirable situation in which both model (S1 = S2 = 0.700) and reference data (inline image) classifications were of a low accuracy, but still within the range of values often encountered in the literature and represented what may be considered as a ‘good’ level of accuracy. Scenarios C and D illustrated the situations in which the two classifications compared were both highly accurate but the reference data, although imperfect, had either perfect sensitivity or perfect specificity. Note that in scenario C there was error in the reference data on species absence but certainty on species presence, reflecting a situation, albeit an extreme example, that may be encountered given the previously discussed data acquisition problems. Finally, scenarios E and F were based on real values of sensitivity and specificity reported in the literature. These scenarios used data on the accuracy of presence and absence labelling by two groups of surveyors for a common and rare species reported by Ringvall et al. (2005). The information on labelling by the set of surveyors with the most experience and training was used to represent the accuracy of the reference data while the labelling from other surveyors provided the information used for model output accuracy. Specifically, scenario E was derived using data for Pleurozium schreberi and Hylocomium splendens while scenario F used the data for the less common Cladina species; the data used were derived from Table 3 of Ringvall et al. (2005, p. 114). For each of the scenarios, equations 1 and 2 were used to illustrate the variation in apparent accuracy of the model outputs, in terms of sensitivity and specificity and their combination in TSS, over the entire spectrum of prevalence. To help illustrate the results, a confusion matrix for scenario B was derived. The latter required presentation of the results at a specific level of prevalence and a value of 0.1 (10%) was selected.

Table 1.  Accuracy of the model outputs and reference data for scenarios A–F.
ScenarioModel outputReference data
  1. S 1, sensitivity; S2, specificity; R indicates measure relates to reference data.

A S 1 = S2 = 0.925 inline image
B S 1 = S2 = 0.700 inline image
C S 1 = S2 = 0.900 inline image inline image
D S 1 = S2 = 0.900 inline image inline image
E S 1 = 0.970 S2 = 0.940 inline image inline image
F S 1 = 0.550 S2 = 0.980 inline image inline image

Although based upon sensitivity and specificity, the AUC is calculated from continuous probabilistic model outputs. To illustrate the impacts of reference data error on the AUC a real data set was used with the impact of reference data quality evaluated through the degradation of the existing reference data set. The model outputs were derived from a study investigating the impacts of climate change on the potential spatial distribution of spotted meddick (Medicago arabica) in Great Britain (Foody, 2008b). Specifically, the data used related to a set of model predictions (n = 9757) derived from a geographically weighted regression analysis for the latest period (1960–90) included in the study and an associated reference data set for that period; further details are given in Foody (2008b). For the purposes of this study, it was assumed that the reference data set was perfect. The impact of imperfect reference data may therefore be illustrated through degradation of the reference data set. The degradation was achieved by changing the actual class label between presence and absence in the reference set for 104 cases (1.06% of the total) selected at random. This was achieved by changing manually the labels for 86 cases from absence to presence and 18 cases from presence to absence. The AUC and its 95% confidence interval were calculated relative to the original and degraded reference data.

The impact of sample size on the ability to detect changes in species prevalence was modelled using fundamental relationships based on sampling theory. Assuming a scenario focused on the evaluation of change over time using paired data, the McNemar test may be used to test for a difference (Fielding & Bell, 1997; Messam et al., 2008). With this test, the required sample size (n) may be estimated from the selected significance level (α), power (1 − β) and minimum meaningful difference or effect size (δ). Here, the focus was on the relationship between sample size and power for different effect sizes, and initially, the prevalence values used were accurate and determined from use of a gold-standard reference. In keeping with common practice, the significance level α = 0.05 was used. The effect size is a function of study objectives. Some studies seek to detect relatively broad or severe changes in prevalence, perhaps of > 0.5 (Strayer, 1999), while others may wish to detect much smaller changes. Here, three effect sizes were evaluated: 0.50, 0.10 and 0.01. For each, the comparison was undertaken for a relatively rare species with the focus on detection of the differences in prevalence of 0.55 vs. 0.05, 0.15 vs. 0.05 and 0.06 vs. 0.05. The standard equation for the calculation of the required sample size for the McNemar test (e.g. Miettinen, 1968; Connor, 1987; Messam et al., 2008),

  • image(3)

was used, where Zα/2 is the critical value of the normal distribution for the two-tailed significance level α, Zβ is the corresponding value for a test with the power 1 − β and Ψ is the probability of obtaining a discordant pair; Zα replaces Zα/2 for a one-sided test (Miettinen, 1968). Equation 3 may be used directly to estimate the required sample size for given α, β and δ or rewritten to indicate power, via Zβ, at a given sample size. Attention focused on the power to discriminate differences represented by the three selected effect sizes and the sample size required to achieve a particular level of power, following Miettinen's (1968) advice on the estimation of Ψ. Since the probability of making a type I error (α) is often viewed as four times more serious than that for a type II error (β), a power (1 − β) of 0.8 is often used. Sometimes, however, a type II error may be of great importance and a higher power required (Zedaker et al., 1994). If, for example, one wishes to detect a change associated with a rare species it would be important to not allow a change to go undetected, and so a larger than normal power may be desired.

Other scenarios, such as those arising from the comparison of independent samples or not focused on a test of difference, may occur (Messam et al., 2008; Foody, 2009b) but the set evaluated is of relevance to common ecological investigations. Also, while there are concerns with power analyses (Hoenig & Heisey, 2001), the aim here is to simply illustrate some of the key features related to power, effect size and sample size.

Finally, this paper indicates the potential to address some of the concerns linked to the quality and quantity of reference data. With regard to the latter, it will be stressed that the sample size required for a study may be determined in advance and this may include accommodation for the impacts of imperfect reference data. In relation to quality of the reference data, many methods, ranging from simple algebraic correction if the sensitivity and specificity of the reference data set are known to more complex approaches based typically on multiple classifications (Enøe et al., 2000), are available to correct estimates. As there are often limited meta-data on data quality, one approach to estimate the real accuracy of model outputs and prevalence in the absence of a gold-standard reference will be briefly discussed. This latter approach is based on a latent class model (Espeland & Handelman, 1989; Qu et al., 1996; Enøe et al., 2001). With such a model it is assumed that the classifications derived from a series of species distribution models are imperfect indicators of the unobserved (latent) status of the species and that the associations observed amongst the model outputs, which may be of unknown sensitivity and specificity, can be explained by the latent variable (Yang & Becker, 1997). Additionally, under the assumption of conditional independence, the only parameters of the latent class model are the latent class probabilities and the conditional probabilities. Here, it is important to note that the latter define the sensitivity and specificity of the classifications (Yang & Becker, 1997); sensitivity is, for example, the conditional probability that a presence is predicted for a site where the species does occur. This is evident in the formulation of a standard latent class model for a scenario in which the true status on species occurrence is represented by a single latent variable, O, in which there are two latent classes (presence O = 1 and absence O = 0) which, if based on the outputs of four independent models, may be written as

  • image(4)

where G–J are the models whose outputs are labelled g, h, i, j = 0,1, inline imageis the conditional probability that the pattern of class labels derived from the models is (g, h, i, j) given that the case has a status t (1 or 0) and inline imageis the probability that a case has the status t (Vermunt, 1997; Yang & Becker, 1997). The conditional probabilities that represent the sensitivity and specificity of each species distribution model are parameters of the latent class model (e.g. inline imageis the sensitivity of model G). Here, the latent class model represented by equation 4 was used to illustrate the potential to assess the accuracy of species distribution models and derive estimates of the prevalence of a species. For this, data were simulated with known properties: G (S1 = 0.900, S2 = 0.923), H (S1 = 0.670, S2 = 0.693), I (S1 = 1.000, S2 = 0.951) and J (S1 = 0.970, S2 = 1.000). The output of each model was evaluated against a single reference data set, which was generated with inline image. Note that the data used in models G–J have some similarity with the scenarios used earlier and also included challenging features (e.g. the reference data set is sometimes less accurate than the model outputs).

RESULTS AND DISCUSSION

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. REFERENCE DATA: QUANTITY AND QUALITY
  5. DATA AND METHODS
  6. RESULTS AND DISCUSSION
  7. CONCLUSIONS
  8. ACKNOWLEDGEMENTS
  9. REFERENCES
  10. BIOSKETCH

Data quality

Although sensitivity and specificity are widely perceived to be prevalence-independent measures of classification accuracy, the results in Fig. 2 show that they can vary greatly with prevalence. In each scenario, marked systematic variations in sensitivity and/or specificity were observed and arose purely as a consequence of error in the reference data set. Additionally, the general trend was for the measures of accuracy to be underestimated by an amount that was generally highest at extreme values of prevalence. Note also that the magnitude of the bias introduced by the use of an imperfect reference set could be very large even if the reference set was of very high accuracy. For example, with scenario A, in which both classifications were of a very high accuracy, sensitivity and specificity could be substantially underestimated at extreme values of prevalence (e.g. at Δ = 0.1 the S1 of 0.925 was perceived to be 0.765). The exact nature of the relationship between sensitivity or specificity and prevalence was a function of the quality of the two classifications cross-classified, hence the range of shapes in the relationships depicted in Fig. 2. Note that the latter includes scenarios in which the sensitivity (specificity) may be prevalence independent, as the reference test had perfect specificity (sensitivity) and so produced no false presences (absences), but the specificity (sensitivity) varied with prevalence. This is an important feature to note, as for some applications it may be reasonable to use just sensitivity (specificity) and it is possible for such a measure to be accurately estimated and independent of prevalence, even if the reference data set was imperfect. Note, for example, that scenario C represents the situation in which the reference data set contained error in relation to species absences (i.e. it had imperfect specificity) but was correct in relation to species presences (i.e. it had perfect sensitivity). In this scenario, the apparent specificity was independent of prevalence while the apparent sensitivity varied greatly with prevalence. As it is often suggested that information on both sensitivity and specificity be used, perhaps combined into a measure such as TSS, a negative impact of reference data error may still occur even if the reference data is perfect in relation to presences or absences but not both. Critically, for the assumed situation of conditionally independent labelling, the general trend was that the use of an imperfect reference led to a systematic underestimation of sensitivity and specificity, the magnitude of which varied with prevalence.

image

Figure 2. Relationships between sensitivity (grey solid circle), specificity (open black circle) and true skills statistic (TSS; black solid triangle) with prevalence: (a)–(f) for scenarios A–F, respectively (see text).

Download figure to PowerPoint

As a ‘sense check’ on the results, it may help to focus on a confusion matrix generated for 1000 cases following scenario B for which the perceived accuracy values and prevalence were substantially mis-estimated. For example, at a prevalence of 10% (Δ = 0.1), if the reference was a gold standard then 100 of its 1000 cases would have been labelled as presences. However, as the inline image, 25% (i.e. 25 cases) of actual presences would be mislabelled as absences and as inline image, 25% (i.e. 225 cases) of actual absences would have been mislabelled as presences, resulting in a net overestimation of presences by 200 and an apparent prevalence of 0.3 (Fig. 3). Sensitivity and specificity were also mis-estimated, at inline imageand inline image, respectively (Figs 2b & 3), relative to reality of S1 = S2 = 0.70. Reference data error therefore led to potentially substantial mis-estimation of sensitivity, specificity and prevalence. The magnitude of the mis-estimation itself varies as a function of prevalence (Fig. 2b).

image

Figure 3. Results for scenario B when Δ = 0.1: (a) confusion matrix and (b) equations that may be used to correct for known reference data errors and, given conditional independence, which yield correct estimates of sensitivity, specificity and prevalence. Further details on the equations may be found in Staquet et al. (1981) and Enøe et al. (2000). S1 is sensitivity and S2, specificity.

Download figure to PowerPoint

Variation in the prevalence of a species also had an impact on the estimation of the TSS calculated for scenarios A–F. Given that the TSS is derived from the direct combination of sensitivity and specificity, it should not be a surprise that it too varied with prevalence. As with sensitivity and specificity, there was no simple general trend, with the nature of the relationship between TSS and prevalence a direct function of the quality of the two classifications compared (Fig. 2). In all of the scenarios illustrated, however, TSS was underestimated by an amount that varied with prevalence in a manner determined by the quality of the cross-classified data sets. For example, while the TSS for scenario A was 0.85 its perceived value ranged from 0.0 at the extreme values of prevalence to a peak at c. 0.81 at a prevalence of 0.5.

Unlike sensitivity, specificity and TSS, the AUC cannot be calculated from a basic confusion matrix. However, as the AUC is a function of sensitivity and specificity, the results above suggest that it too may be expected to vary with prevalence through the use of imperfect reference data. The latter was explored with the data on the occurrence of spotted meddick in Great Britain using the original and degraded reference data sets. With the original reference data, the AUC was estimated to be 0.883 with a 95% confidence interval of 0.876–0.891. With the degraded reference data, the AUC was estimated to be 0.866. Although this may not seem greatly lower than the original estimate, the difference was large and significant at 95% level of significance. Note, for example, that the 95% confidence interval derived with the degraded reference was 0.858–0.874, which does not overlap with that derived from the use of the original reference data. Thus, the degradation of just c. 1% of the reference data set resulted in the production of a ROC with an AUC that would appear to differ significantly from the AUC calculated from the original data.

Data quantity

Turning to the issue of reference data quantity, the ability to detect a difference in prevalence and ascribe it statistical significance is a function of the size of the data set used. Using equation 3 with each of the three effect sizes considered, it was clear from Fig. 4 that test power increased with sample size and that the sample size required to detect a very small difference (e.g. δ = 0.01) was substantially greater than that required to detect a large difference (e.g. δ = 0.50). Given that the sample sizes used in ecology often vary from tens to thousands of cases, Fig. 4 may suggest that the former may be too small to detect a small change while the latter could be too large and ascribe statistical significance to trivial differences.

image

Figure 4. Relationship between sample size and power for tests of the difference in prevalence for three scenarios. Note the logarithmic scale on the x-axis.

Download figure to PowerPoint

Researchers need to consider the power of a test in the design and interpretation of investigations. These results confirm that small sample sizes may be used if the desire is to detect large differences. However, the results also highlight that large sample sizes may be required for high power and/or detection of small effects. For example, the sample size required to detect the three effect sizes of 0.50, 0.10 and 0.01 with a power of 0.8 was 15, 143 and 8161, respectively. These values rose, respectively, to 22, 235 and 13,510 if it was decided that the risk of a type I or type II error was equal at 0.05 and so a power of 0.95 required. Note that undertaking a test with inadequate power in relation to the study objectives is of little, if any, value as the study may be incapable of yielding useful results.

Implications and remedial actions

The results highlight important concerns about the quantity and quality of data used in species presence–absence modelling. The results call for careful selection of a sample size to meet the study objectives; noting sizes smaller or larger than required can cause problems. There are also likely to be associations between data quality and quantity. If, for example, a large data set is required this may result in data acquisition being rushed or resources spread thinly, with a consequent decrease in data quality. Data quality is, however, important, and deviation from the often assumed gold-standard can be a cause of major error. Critically, the results highlight, contrary to widely accepted views, that some measures of accuracy such as sensitivity and specificity may in practice be prevalence dependent. Moreover, the results actually concur with those provided in some of the ecological literature, including some that argue an independence of prevalence. For example, Allouche et al. (2006), McPherson et al. (2004) and Manel et al. (2001) argue an independence of prevalence despite presenting results with trends similar to those shown in Fig. 2, speculating that the trends observed may arise from some other source such as an unspecified ecological process or chance. Here, prevalence dependence of popular measures of accuracy was shown to arise through the use of an imperfect reference and, as a gold-standard is unlikely to ever occur, may be a widespread if not near universal problem.

Thus far, the focus has been on negative impacts of imperfect reference data. It should be stressed, however, that the relationships are of a systematic nature which highlights a potential for reducing their effect. Indeed, it is sometimes possible to take constructive action against these negative impacts. It is, for example, possible to calculate the required sample size in a manner that also allows for imperfections in the data. In the situations discussed above, sensitivity and specificity were underestimated and this means a larger sample is required in order to estimate actual prevalence to a given degree of precision (Rahme & Joseph, 1998; Messam et al., 2008). Reference data imperfections also need to be considered in the calculations of power or sample size of a comparative test. Messam et al. (2008) suggest that the effect of interest be defined in terms of real prevalence and then converted into apparent prevalence for use in sample size formulae. As a guide to how this type of recommendation may be implemented, Rahme & Joseph (1998) and Messam et al. (2008) show that the sample size required to estimate prevalence to a given degree of precision is

  • image(5)

where h is the half-width of the 100(1 − α)% confidence interval and inline image an a priori estimate of the apparent prevalence. Note that if the data were perfect, with S1 = S2 = 1, the equation is the standard binomial sample size formula. Imperfections to the data result in the denominator of equation 5 decreasing and hence a requirement for a larger sample size, to the extreme at S1 + S2 = 1 at which an infinite sample is indicated (Rahme & Joseph, 1998).

It is also possible to derive an accurate estimate of model accuracy and species prevalence in the absence of a gold-standard reference. If the accuracy of the reference data set is known, this may be achieved algebraically, following equations included in Fig. 3(b); use of the equations in Fig. 3(b) yields the true values, and equations for other measures may be found in the literature (e.g. Staquet et al., 1981). One alternative approach illustrated briefly here is latent class modelling which may be applied not only when there is no gold-standard but, in some situations, when there are no reference data. The model represented by equation 4 was solved with the lem software (Vermunt, 1997). A model with a good fit was formed (L2 = 6.36, d.f. = 6; indicating the data and model did not differ significantly) and the estimated accuracy of each model derived (Table 2). The latter were close to reality and derived without a gold-standard reference. The results suggest a considerable potential for latent class modelling but the method and its suitability for ecological applications requires more investigation.

Table 2.  The predicted sensitivity and specificity derived from the latent class model.
  S 1 S 2
  1. The estimate of prevalence derived was Δ = 0.097 (actual Δ = 0.100).

  2. S 1, sensitivity; S2, specificity.

ModelPredicted (actual)Predicted (actual)
G0.9072 (0.9000)0.9214 (0.9230)
H0.6804 (0.6700)0.6932 (0.6933)
I1.0000 (1.0000)0.9480 (0.9511)
J1.0000 (0.9700)1.0000 (1.0000)

Additional considerations

Finally, this paper has assumed that the data sets are conditionally independent. This may not always be the case, with the reference data and some models tending to err on the same cases. In this situation, the use of an imperfect reference data set will still introduce substantial bias into estimates, but the magnitude and direction of the bias may differ from that in the case of conditionally independent data (Valenstein, 1990). It is possible that dependent errors may result in the overestimation of accuracy leading to misplaced confidence in a model and its outputs. Although the equations in Fig. 3(b) may no longer be used to correct for reference data error when conditional dependence exists, the latent class analysis may sometimes be modified to allow for correlated errors (Yang & Becker, 1997). For example, if the outputs of models I and J were conditionally dependent, the latent class model in equation 4 could be modified to

  • image(6)

from which the prevalence as well as sensitivity and specificity of each model could be derived; note the final parameter in equation 6 represents the dependence between I and J. An example in a remote sensing context for conditionally independent and correlated errors is given in Foody (2010). Similarly, approaches based on the ROC curve may be used when a gold standard is unavailable and errors are independent or correlated (Choi et al., 2006). There are, of course, limitations with the techniques which require further study. In particular, it would be valuable to more fully assess the potential and limitations of latent class modelling for ecological applications. This paper has highlighted some of the potentials of this approach but there are concerns that require attention and the approach is not a panacea, being based on sometimes difficult to define dependence structures and strong assumptions (e.g. Albert et al., 2001; Albert and Dodd, 2004; Pepe & Janes, 2007). Failure to satisfy the assumptions or correctly represent the dependence structures could lead to error and misinterpretation. Critically, however, this paper has shown that although the use of an imperfect reference may have large negative impacts on a study these can sometimes be identified and actions taken to reduce them. It is hoped that this paper provides a step towards accurate and useful species distribution modelling.

CONCLUSIONS

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. REFERENCE DATA: QUANTITY AND QUALITY
  5. DATA AND METHODS
  6. RESULTS AND DISCUSSION
  7. CONCLUSIONS
  8. ACKNOWLEDGEMENTS
  9. REFERENCES
  10. BIOSKETCH

The key conclusion of this paper is that the quantity and quality of the reference data used in the assessment of presence–absence models have major impacts on the interpretation of model results. Sample sizes should be determined to meet the objectives of a study. Although some of the literature suggests that small sample sizes may be used, it should be recognized that this gives only a power to detect relatively large differences. Small samples would result in a study having insufficient power to detect small differences that may sometimes be important, yielding potentially non-significant results that are difficult to interpret and perhaps represent wasted effort and resource use. The quality of the data used is also important. Here, it has been stressed that the theoretical independence of prevalence of accuracy measures such as sensitivity, specificity, TSS and AUC may not be realized in real world applications. Reference data sets are unlikely to be error-free, and even a small amount of error could introduce a prevalence dependency which was linked to substantial mis-estimation of model accuracy and predictions. The magnitude and direction of the errors introduced varied as a function of the quality of the reference data set. As imperfections in reference data are likely to be commonplace, the results of modelling studies should be interpreted with care and not used unquestioningly. The quality of a model may not be as it seems, with substantial mis-estimation of accuracy (in either direction) possible, and this may have a negative impact on its practical application. Although only discussed briefly here, it should be recognized that the problems noted can often be reduced by careful study design (e.g. determination of the required sample size) and the adoption of appropriate correction methods.

ACKNOWLEDGEMENTS

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. REFERENCE DATA: QUANTITY AND QUALITY
  5. DATA AND METHODS
  6. RESULTS AND DISCUSSION
  7. CONCLUSIONS
  8. ACKNOWLEDGEMENTS
  9. REFERENCES
  10. BIOSKETCH

The data on spotted meddick were derived from analyses of data sets provided by the Meteorological Office and National Biodiversity Network gateway. I am very grateful to the editor and three referees who provided helpful reviews on the original manuscript.

REFERENCES

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. REFERENCE DATA: QUANTITY AND QUALITY
  5. DATA AND METHODS
  6. RESULTS AND DISCUSSION
  7. CONCLUSIONS
  8. ACKNOWLEDGEMENTS
  9. REFERENCES
  10. BIOSKETCH

BIOSKETCH

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. REFERENCE DATA: QUANTITY AND QUALITY
  5. DATA AND METHODS
  6. RESULTS AND DISCUSSION
  7. CONCLUSIONS
  8. ACKNOWLEDGEMENTS
  9. REFERENCES
  10. BIOSKETCH

Giles Foody is Professor of Geographical Information Science at the University of Nottingham. His main research interests concern the interface between remote sensing, informatics and biogeography.

Editor: José Alexandre F. Diniz-Filho