Capture-recapture estimates of abundance using photographic identification data are sensitive to the quality of photographs used and distinctiveness of individuals in the population. Here analyses are presented for examining the effects of photographic quality and individual animal distinctiveness scores and for objectively selecting a subset of data to use for capture-recapture analyses using humpback whale (Megaptera novaeangliae) data from a 2-year study in the North Atlantic. Photographs were evaluated for their level of quality and whales for their level of individual distinctiveness. Photographic quality scores had a 0.21 probability of changing by a single-quality level, and there were no changes by two or more levels. Individual distinctiveness scores were not independent of photographic quality scores. Estimates of abundance decreased as poor-quality photographs were removed. An appropriate balance between precision and bias in abundance estimates was achieved by removing the lowest-quality photographs and those of incompletely photographed flukes given our assumptions about the true population abundance. A simulation of the selection process implied that, if the estimates are negatively biased by heterogeneity, the increase in bias produced by decreasing the sample size is not more than 2%. Capture frequencies were independent of individual distinctiveness scores.
Photographs of natural markings are used to identify individual animals in a process known as photo-identification. Photo-identification has been used in a number of capture-recapture studies to estimate animal abundance and survival rates of several species (e.g., Buckland 1990, Hammond et al. 1990, Flatt et al. 1997, Langtimm et al. 1998, Baker 1999, Zeh et al. 2002, Calambokidis and Barlow 2004, Mizroch et al. 2004). When photo-identification is used in capture-recapture studies, the ability to accurately identify individuals from photographs needs to be examined. Accurate identification of individuals will depend on the quality of the photographs used and the distinctiveness of the natural markings of the individuals (Hammond 1986). Errors in identification can occur as false negatives and false positives and affect the accuracy of the resulting estimates.
Photographic quality (hereafter referred to as quality) includes the clarity and contrast of the photograph, the angle of the markings to the plane of the photograph (angle), and the proportion of the animal or identifying feature that is photographed. Natural markings include color patterns, the shape of features, and scars. The distinctiveness of markings can depend on the complexity of a color pattern or a feature's shape and the number and size of scars.
False-positive errors occur when different individuals are incorrectly identified as the same individual, and can negatively bias abundance estimates (Hammond 1986). Assuming that twins do not exist, the occurrence of false positives can be reduced, or even eliminated, by using markings with a high level of variability between individuals. A matching procedure, which is conservative in its definition of a match (e.g., requiring multiple features of the markings to match), is also recommended.
False-negative errors occur when previously photographed individuals are not recognized, and can positively bias abundance estimates (Hammond 1986). False negatives can occur because identifying features are obscured in photographs of poor quality or because natural markings lack distinctive features. The difficulty of matching photographs of individuals with less distinctive markings is magnified by the use of poor-quality photographs; markings, which may be recognized in high-quality photographs, may be obscured in poor-quality ones.
Only Calambokidis et al. (1990) examined the effects of individual animal distinctiveness (hereafter referred to as distinctiveness). Calambokidis et al. (1990) examined the effects of distinctiveness and quality on the capture rates of humpback whales off California. They found that animals with the most heavily scarred flukes were seen more frequently and that animals that were seen more frequently were more likely to have high quality photographs in the database. However, capture rates were not related to the amount of white on the flukes, the distinctiveness of the trailing edge of the flukes, or the distinctiveness of the flukes. In a test of observers' ability to match photographs of fin whales (Balaenoptera physalus) to a catalog of individual whales, Agler (1992) found that the distinctiveness of the markings affected the observers' matching success.
Humpback whales are individually identifiable from variations on the ventral side of the flukes; in particular, the shape of the trailing edge of the flukes varies, and the color pattern of the flukes range from mostly white to mostly black (Katona et al. 1979). The natural markings on the ventral flukes are generally stable for adults, but can change in young animals (Carlson et al. 1990, Blackmer et al. 2000). Because markings on the ventral flukes of North Atlantic humpback whales are highly variable between individuals, it is generally assumed that false positives are unlikely to occur in photo-identification studies of this population. This assumption has been supported by the results of a double tagging study where whales had been double tagged at least twice using photographic and genetic identification each time (Stevick et al. 2001). They found no false-positive photographic errors and only 14 false-negative errors out of 414 cases of double tagging.
The distribution of most humpback whale populations ranges from low-latitude breeding and calving areas to high-latitude feeding areas (Winn and Reichley 1985). It is believed that the majority of the North Atlantic population currently migrates to the principal breeding and calving area in the West Indies during winter months (Smith et al. 1999). During summer and fall months, this population distributes among several high-latitude feeding areas, which range from the Gulf of Maine to the Arctic pack ice (Smith et al. 1999). During 1992 and 1993, researchers from seven countries conducted a large-scale photo-identification and genetics study of humpback whales across much of their North Atlantic range (Smith et al. 1999). This study, called the Years of the North Atlantic Humpback (YoNAH), was designed to study abundance, genetic structure, exchange among feeding regions, and reproductive behavior and vital rates.
The photo-identification data from YoNAH were reanalyzed here to assess the effects of quality and distinctiveness on estimates of abundance. We present estimates of the variability in evaluating three quality variables, a method to select an appropriate quality level for estimating abundance, an analysis of the effect of evaluation variability on the selection method, and an analysis of the frequency of capture of individuals by their distinctiveness levels.
The YoNAH project collected photo-identification samples in 1992 and 1993 from animals in the primary winter breeding and calving area of the West Indies and in the five principal summer feeding areas: the Gulf of Maine, eastern Canada, West Greenland, Iceland, and Norway (Smith et al. 1999). Search areas were defined, which maximized the amount of area covered, but high-density areas were sampled more heavily to avoid giving individuals in these areas lower probabilities of capture. The sampling protocol defined how to select whales, or groups of whales, for sampling and how to end sampling. The protocol was designed to maximize sample sizes and minimize the number of whales that were missed because they were difficult to sample. Full details of the YoNAH field protocols are given in Smith et al. (1999).
Photographs of the ventral side of the flukes were taken with 35-mm cameras equipped with 70–210-mm or 300-mm lenses and ISO 400 black-and-white print film. For each day a whale was photographed, one or more photographs were compared manually to all whales identified to date using a modification of procedures developed previously for the North Atlantic Humpback Whale Catalog (Katona et al. 1979, Katona and Beard 1990). All photographs were compared, in the sequence received, to all previously processed photographs by experienced technicians. When a photograph was recognized as being of a previously identified individual, the photographic laboratory manager confirmed the match. When a new photograph did not match any previously identified whale, it was compared a second time to photographs from the same sampling area. If it was still unmatched, the whale was assigned a new identification number and included in all future comparisons of incoming photographs. This procedure deviates from Katona and Beard (1990) in that all unmatched photographs were assumed to be new whales and given an identification number. Full details of the YoNAH photographic laboratory protocols are given in Smith et al. (1999).
Friday et al. (2000) found that agreement among judges was higher for an overall evaluation of quality than for a quality variable predicted from specific variables such as contrast, clarity, angle, and visible portion of the flukes. Because of these results, the quality of the YoNAH flukes photographs was evaluated for three criteria (overall quality, partial, and half; Fig. 1) using a modification of the methods presented by Friday et al. (2000). Although a single technician (TF) was used for the YoNAH photographs, the agreement results from Friday et al. (2000) were relevant when designing the evaluation procedure because of future comparisons with other photographic data.
For overall quality, the judge was asked to mentally construct a scale from the best photographs he had experience working with to the worst, divide the scale into thirds, and decide to which third each photograph belonged. Because this was a mental scale of best to worst, it was not anticipated that the YoNAH photographs would divide into thirds. Those in the lower third were subdivided into those for which the information content was not obscured by quality (3+) and those for which it was (3−). The half criteria evaluated whether a specific part of the flukes was visible in the photograph and was scored as “left,”“right,”“upper,” or “whole” referring to the part of the flukes that is visible. The partial criteria evaluated whether or not 80% or more of the flukes in each photograph were obscured. The number of photographs in each overall quality, half, and partial category was examined by year, and the consistency between years was evaluated using a chi-squared test of homogeneity.
In addition to the three quality criteria described above, a subset of the photographs was evaluated randomly during the evaluation process for three additional criteria: the contrast of the photograph, clarity of the photograph, and the angle of the flukes in the photograph as defined by Friday et al. (2000). Evaluating additional criteria reminded the judge to keep all quality criteria in mind when judging overall quality and to better maintain the experimental setting of the previous study.
A subset of 153 photographs were evaluated a second time in a randomized, blind fashion to examine the replicability of the evaluation process. Of these 153 photographs, 33 were the photographs that had been coded for all six quality variables at the time of the first coding. Because overall quality was the primary focus of this experiment, the remaining 120 photographs were chosen such that there were 30 photographs from each of the four overall quality categories. These 120 photographs were selected systematically from those photographs that had initially been assigned to each overall quality level. The systematic selection process ensured that the photographs from each quality level were evenly spread throughout the range of pigmentation levels for humpback flukes from mostly white to mostly black such that the subset was representative of the YoNAH data. Pigmentation level may be important because all-black and all-white flukes are likely less distinctive than other flukes. The resulting set was predominantly whole, nonpartial flukes. The order of the 153 photographs was randomized to ensure that they were presented to be evaluated in a different order than during the first evaluation. During the second evaluation, the set of 153 photographs was evaluated for all six quality variables.
Two-way tables of the number of photographs in each class from the first and second evaluations were constructed to describe the differences in quality, half, and partial scores between the two evaluations. To determine the variability in the evaluation of overall quality, the probability of changing scores between the two evaluations was estimated using a combination of binomial and trinomial models that estimated the probabilities of changing scores up or down one quality level. The diagonal of the matrix is the probability of not changing score and was calculated from the other probabilities in the row. The best fitting model from a series of four nested models, which assumed various hypotheses about which parameters are equal, was selected using maximum likelihood methods (see Appendix). Pairs of the four models were compared using a likelihood ratio test (LRT) with the null hypothesis that the probabilities estimated from the more complex model with i+j parameters were equivalent and equal to the probability estimated from the less complex model with i parameters (Burnham and Anderson 1998). The LRT is assumed to be distributed as χ2 with j degrees of freedom. Confidence intervals (CI) for the estimated probabilities were calculated using the inverse of Fisher's information as an asymptotic estimate of the variance (Burnham and Anderson 1998).
The first model assumed that all probabilities of changing scores by one level were equal (1-probability model). The second model estimated each of the six probabilities of single-quality level changes separately (6-probability model). The third model assumed that the probability of single-quality level changes from quality A to quality B was the same as changing from quality B to quality A (3-probability model). The fourth model assumed that the probabilities of single-quality level changes were equal except changes from 3+ to 3− and from 3− to 3+ that were equal to each other (2-probability model).
To determine the variability in the evaluation of half, the probability of changing scores between the two evaluations was estimated using a combination of binomial models that estimated the probabilities of changing scores between whole flukes and a right or left fluke. For this analysis, a single-probability was used to model the transition from whole flukes to a right or left fluke. The best fitting model from a series of three nested models was selected using maximum likelihood methods (see Appendix) and comparing pairs of the three models using a LRT. The first model assumed that all probabilities of changing scores were equal (1-probability model). The second model estimated each of three probabilities of changes separately (3-probability model): whole flukes to right or left fluke, right to whole, and left to whole. The third model assumed that the probability of changing from a right fluke to whole flukes was equal to the probability of changing from left to whole (2-probability model).
For the variability in the evaluation of partial, the probability of changing scores was estimated using a combination of binomial models that estimated the probabilities of changing scores between partial and nonpartial (see Appendix). Two models were compared using a LRT. The first model assumed that all probabilities of changing scores were equal (1-probability model). The second model estimated each of the two probabilities of changes separately (2-probability model): nonpartial to partial and partial to nonpartial.
The whales represented in the YoNAH database were evaluated for the individual distinctiveness of their markings by a single technician (TF) using a modified version of the method presented by Friday et al. (2000). All whales were evaluated for their level of overall distinctiveness on a three-point scale from very distinctive (1) to not distinctive (3) (Fig. 2) or as UNKNOWN. Whales were scored as UNKNOWN if the judge could not assess the information content of the flukes because 80% or more of the flukes were not visible or were obscured because the quality of the best available photograph was too poor. Whales scored as UNKNOWN were removed from all distinctiveness analyses. In addition to overall distinctiveness, a subset of the whales were evaluated for two additional distinctiveness criteria: the distinctiveness of the color pattern and the distinctiveness of the trailing edge of the flukes. Evaluating additional criteria reminded the judge to consider all aspects of distinctiveness when evaluating overall distinctiveness and to maintain the experimental setting in Friday et al. (2000).
Because the evaluations of photographic quality and of individual distinctiveness are subjective judgments, it is possible that the level of one may affect the evaluation of the other. In particular, poor-quality photographs may be scored as low distinctiveness because the judge cannot see the identifying marks on the flukes. A Pearson chi-squared test of independence, with fixed marginals, was used to evaluate the relationship between the distinctiveness and photographic quality scores. The best photograph of each of the identified whales was used. In an attempt to isolate a subset of whales with unbiased distinctiveness scores, this test was also conducted on the subset of whales formed by removing whales whose best photograph was quality 3− and again on the subset of whales formed removing whales whose best photograph was quality 3+ or 3−. A likelihood ratio chi-squared test statistic (G2) produced similar results.
Effects on Abundance Estimates
To determine which photographs are acceptable for estimating abundance, a series of four nested estimates of abundance were calculated where photographs were systematically removed in order of increasing quality. The first removal step in the series omitted photographs judged to be of the left fluke or right fluke or judged to be of a partial fluke. In general, whales identified from partial photographs are likely to have a lower probability of reidentification because only a small portion of the flukes is visible and identifying marks are less likely to be present. Whales identified from left- or right-half photographs cannot be reidentified from photographs of the opposite half. The second step omitted quality 3− photographs and the third step omitted quality 3+. Omitting quality 2 photos resulted in too few matches to be reliable (Seber 1982). The results presented here do not examine upper flukes as a separate group. Upper flukes are missing the mid section of the flukes that typically has identifying information. However, adding a removal step where upper flukes are removed after left, right, and partials produced comparable results.
Chapman's modification to the Petersen estimator (Seber 1982), which is appropriate for sampling without replacement, was used rather than Bailey's modification for sampling with replacement (Seber 1982) for two reasons: to mirror Smith et al. (1999) and to reduce the effects of unequal probabilities of capture as recommended by Calambokidis et al. (1990). Chapman's modified Petersen estimates of abundance and their variances (Seber 1982) were calculated for 1992 and 1993 separately where n1 is the number of whales identified during the winter breeding and calving season of a given year, n2 is the number identified during the following summer feeding season, and m2 is the number identified in both seasons of a given year.
For each abundance estimate the mean squared error was calculated as the bias squared plus the variance, which gave equal weights to bias and precision. Estimates are expected to be positively biased from false negatives when poor-quality photographs are included, but less precise due to smaller sample sizes when they are removed. The estimate with the minimum mean squared error was believed to be the estimate with the best balance between bias and precision.
The true abundance of the population was not known. The abundance estimates calculated after the third removal step, including only quality 1 and 2 photographs, were assumed to be the true abundance (Ntrue) for the purposes of calculating bias for each year. These abundance estimates were selected because they were calculated from the best-quality photographs. However, because of this assumption, the mean squared errors of the estimates that include only quality 1 and 2 photographs are, by definition, equal to their variances. It should be noted that these estimates of Ntrue might be biased if there are violations of the assumptions of the Chapman's modified Petersen estimator.
The selection procedure depends on the subjective evaluation of photographic quality. The sensitivity of the selection procedure to the variability in quality, half, and partial scores was evaluated by incorporating this variability in a simulation of the selection process as follows. Conditional on the assigned quality, half, and partial scores, each photograph was randomly reassigned quality, half, and partial scores according to the estimated probability of changing scores. For quality and half, the 1-probability models were used, and for partial, the 2-probability model was used (see Appendix). For each of 2,000 replicates, the series of four nested abundance estimates for each year was recalculated and the subset of photographs associated with the minimum mean squared error was determined. Because the subset of photographs used to evaluate the variability in the evaluation process was heavily weighted toward whole and nonpartial flukes, estimated probabilities of changing half and partial scores were imprecise. Because of this imprecision, this analysis was rerun where only overall quality scores were reassigned, and the results were comparable.
The selection procedure described above does not address potential bias due to heterogeneity of capture probabilities that can negatively bias estimates with the bias being more severe with smaller sample sizes (Gilbert 1973, Hammond 1986). To determine if the decrease in sample size might increase the possible effects of heterogeneity, we simulated this sample size decrease by randomly removing photographs equal to the number removed during the three photographic quality removal steps. For each of 2,000 replicates, the series of four nested abundance estimates for each year was recalculated.
Finally, the potential effect of individual distinctiveness on population estimates was examined by comparing the capture frequency of whales in each distinctiveness category. If less distinctive whales are not being recognized when recaptured, they will have a lower number of captures than more distinctive whales because they have a higher probability of being misidentified as new captures. This assumes that an individual's distinctiveness level does not affect its probability of being photographed. To eliminate or reduce the effects of a correlation between individual distinctiveness and photographic quality, we limited the whales used in this analysis to those for which we had a quality 1 photograph. However, we used all photographs of these whales that had an acceptable quality level, as determined by the selection procedure, to tabulate capture frequencies. Only one photograph per day of each whale was counted as a capture to eliminate multiple photographs from the same sampling event.
Capture frequency was modeled as a function of distinctiveness and as a null model using Poisson regression. Distinctiveness was modeled as an ordered factor. The two regression models were compared using Akaike's information criterion (AIC) to determine if capture frequency was a function of distinctiveness. The relationship between distinctiveness and capture frequency was also examined using Pearson's chi-squared test of independence with fixed marginals. To eliminate empty cells, capture frequencies greater than or equal to 4 were pooled into a single category. A chi-squared test of independence on the data without pooling produced comparable results.
The photographic quality of 5,459 YoNAH fluke photographs representing 2,979 individual whales was evaluated. Although the definition of photographic quality implied equal numbers of photographs in each main category (1, 2, and 3, where quality 3+ and 3− are combined into a single-quality level), only 18% (n= 988) were quality 1, compared to 40% (n= 2,180) and 42% (n= 2,291) for quality 2 and 3, respectively. The catalog was 83% whole flukes (n= 4,555) and 98% nonpartial flukes (n= 5,347). There were 482 right or left flukes that were spread across the overall quality scores with the highest number being scored as quality 3+: 6% quality 1 photographs, 24% quality 2, 42% quality 3+, and 27% quality 3−. There were 112 partial flukes that were also spread across the overall quality scores but with the highest number being scored as quality 3−: 2% quality 1 photographs, 3% quality 2, 27% quality 3+, and 68% quality 3−.
Photographic quality improved significantly between 1992 and 1993 (χ23= 44.07, P < 0.001, Table 1). In 1993, 62% (n= 1,672) of the photographs were scored as either quality 1 or 2, compared to only 54% (n= 1,496) in 1992. In contrast, there were no significant differences in the distribution of half (χ23= 5.10, P= 0.16) and partial (χ21= 3.18, P= 0.07) categories between years (Table 1).
Table 1. Proportion of humpback whale photographs in each photographic quality, half, and partial category in 1992 and 1993.
1992 (n= 2,762)
1993 (n= 2,697)
A subset of 153 photographs were evaluated for the overall quality, half, and partial variables a second time (Table 2). The overall quality scores for 104 of these photographs were identical for both evaluations, and changes for the remaining 49 were by a single-quality level only. Between the first and second evaluations, 3 photographs changed half scores and 4 changed partial scores.
Table 2. Number of humpback whale photographs in each photographic quality, half, and partial category during the first and second evaluations for the set of 153 photographs.
For overall quality, the maximum likelihood estimate for the 1-probability model where a single probability is estimated for all single-quality level changes was 0.21 (95% CI 0.16–0.25). Comparisons of the 1-probability model to the 6-probability (LRT = 5.01, df = 5, P= 0.414), 3-probability (LRT = 2.71, df = 2, P= 0.258), and 2-probability (LRT = 2.71, df = 1, P= 0.100) models found no significant improvement with the more complex models.
For half, the maximum likelihood estimate for the 1-probability model where a single probability is estimated for all changes was 0.020 (95% CI 0.000–0.042). Comparisons of the 1-probability model to the 3-probability (LRT = 3.33, df = 2, P= 0.189) and 2-probability (LRT = 3.17, df = 1, P= 0.075) models found no significant improvement with the more complex models.
For partial, the maximum likelihood estimate for the 1-probability model where a single probability is estimated for all changes was 0.026 (95% CI 0.001–0.051). Comparisons of the 1-probability model to the 2-probability model (LRT = 11.99, df = 1, P <0.001) found significant improvement. The probability of changing from a nonpartial to a partial score is 0.013 (95% CI 0.000–0.032) and the probability of changing from partial to nonpartial is 0.667 (95% CI 0.13–1.00).
The individual animal distinctiveness of the 2,979 whales represented in the YoNAH database was evaluated using the best photograph of each whale. The use of UNKNOWN for the distinctiveness score increased with decreasing photographic quality scores (Table 3), with 85 of 284 of the whales represented by quality 3− photographs being scored as UNKNOWN. However, even with the use of the UNKNOWN score, individual distinctiveness scores were not independent of overall quality scores for the entire data set (χ26= 111.41, P < 0.001) or for either of the two subsets formed by omitting quality 3− photographs (χ24= 53.79, P < 0.001, n= 2,640) and quality 3− and 3+ photographs (χ22= 10.34, P= 0.006, n= 2,035). These results indicate that the judge's evaluation of distinctiveness was influenced by the quality of the photograph for quality 2 and 3+ photographs as well as quality 3− photographs. Distinctiveness scores tended to decrease as photographic quality scores decreased (Table 3).
Table 3. Proportion of humpback whales in each individual animal distinctiveness category for each of the four overall photographic quality categories. For whales represented by more than one photograph, the photograph with the best quality score was used.
Individual animal distinctiveness
1 (n= 814)
2 (n= 1,231)
3+ (n= 650)
3− (n= 284)
1 (n= 415)
2 (n= 2,100)
3 (n= 324)
U (n= 140)
Effects on Abundance Estimates
For 1992 and 1993, the estimated abundance decreases as the data are limited to photographs with increasingly higher-quality scores (Table 4). In both years, the mean squared error was at a minimum for estimates using only quality 1, 2, and 3+ photographs, excluding images of right fluke, left fluke, and partial flukes. However, these results assume that the bias is accurately estimated using the abundance estimate from quality 1 and 2 photographs as an estimate of the true population abundance, Ntrue.
Table 4. Abundance estimates (N*) and mean squared error (mse) results for subsets of the 1992 and 1993 YoNAH data formed by sequentially omitting photographs scored as poorer quality. Also given are the number of whales captured during each sampling period (n1 for the winter sampling period and n2 for the summer sampling period), the number captured in both sampling periods (m2), the estimated standard error (SE), the estimated bias, and the reassignment proportions (prop). Estimated bias was calculated as (N*−Ntrue), where Ntrue was assumed to be N* for the quality 1, 2 subset of photographs. By definition, the bias for “quality 1, 2”N* equals zero. The reassignment proportions were the proportions of 2,000 simulations giving a minimum mean squared error for each quality removal step in the selection procedure when quality scores are reassigned.
a Minimum mean-squared-errors for each series.
No halves or partials
Quality 1, 2, 3+
Quality 1, 2
No halves or partials
Quality 1, 2, 3+
Quality 1, 2
The variability in the photographic quality evaluation process had a smaller effect on the selection procedure for 1992 than for 1993 (Table 4). For 1992, the subset of data that included overall quality 1, 2, and 3+ had the minimum mean squared error in 44% (n= 873) of the simulations compared to 22% (n= 438) for the subset including all photographs except right fluke, left fluke, and partial flukes, and 25% (n= 501) for the subset of data including overall quality 1 and 2. However, for the 1993 data, 44% (n= 884) of the simulations resulted in a minimum mean squared error for the subset of data including overall quality 1, 2, and 3+, agreeing with the original analysis, but 45% (n= 911) had a minimum mean squared error for the subset of data including overall quality 1 and 2.
For the simulation using random removals of photographs, the median abundance estimates from 2,000 replicates showed a consistent decreasing pattern at each of the three removal steps (Table 5). However, the estimate with the smallest sample size using only overall quality 1 and 2 photographs was only 2% less than that using all the data.
Table 5. The mean, standard deviation (SD), and median of 2,000 replicate abundance estimates and the percent decline of the median from random removals of photographs without regard to photographic quality. The percent decline is calculated as the percentage difference between the median estimate and the estimate using all the photographs from Table 4. Subsets of data were created by randomly removing photographs equal to the number removed during the three photographic quality removal steps. Thus the subset titles match the selection process titles in Table 4 but do not reflect the quality of the photographs being analyzed.
“No halves or partials”
“Quality 1, 2, 3+”
“Quality 1, 2”
“No halves or partials”
“Quality 1, 2, 3+”
“Quality 1, 2”
There were 795 whales that had a quality 1 photograph in the YoNAH catalog, and they were represented by 1,435 quality 1, 2, and 3+ photographs, excluding right fluke, left fluke, and partial flukes. There was a higher percentage of single captures for whales scored as distinctiveness 3 (67%, n= 38) than for those scored as distinctiveness 1 (58%, n= 91) or 2 (54%, n= 316) (Table 6). Poisson regression was inconclusive regarding whether capture frequency was a function of distinctiveness (ΔAIC = 0.9, Table 7) because the two models were not significantly different from each other (Burnham and Anderson 1998). There was no relationship between the capture frequency of whales and their distinctiveness scores as measured by Pearson's chi-squared test of independence (χ26= 6.79, P= 0.34). Similar results were obtained when capture frequencies greater than or equal to 4 were not pooled.
Table 6. Proportion of humpback whales for each capture frequency for each distinctiveness category. Only whales represented by a quality 1 photograph in the YoNAH database were included, and 1,435 photographs of quality 1, 2, and 3+ of these 795 whales were used to tabulate capture frequency. Multiple photographs taken on the same day in the same region were counted as a single capture.
1 (n= 157)
2 (n= 581)
3 (n= 57)
Table 7. Poisson regression models of capture frequency as a function of distinctiveness. Only whales represented by a quality 1 photograph in the YoNAH database and photographs of quality 1, 2, and 3+ were included in this analysis. Multiple photographs taken on the same day in the same region were counted as a single capture.
Capture frequency ∼ Distinctiveness
Capture frequency ∼ .
We evaluated the effects of photographic quality and of individual animal distinctiveness on capture-recapture estimates of abundance from the YoNAH photographic identification data. We evaluated photographic quality and individual distinctiveness, measured the variability in the evaluation of quality, and tested for relationships between quality scores, distinctiveness scores, abundance estimates, and capture frequencies. We also attempted to select a subset of photographs that minimized the effects of photographic quality and individual distinctiveness.
For photographic quality, we focused on an evaluation of three variables: overall quality, whether the image was of right, left, whole, or upper flukes (half), and whether the image contained less than 20% of the flukes (partial). Other quality variables, which contribute to overall quality, are the contrast and clarity of the image and the angle of the flukes in the image. We did not explore how contrast, clarity, angle, partial, and half combine to influence scores of overall quality as this was explored in Friday et al. (2000). We also did not use an overall quality score predicted from these specific variables following the recommendation in Friday et al. (2000).
The proportion of photographs scored as quality 1 (0.18, n= 988) in the YoNAH database was less than those scored as quality 2 or 3 (0.40, n= 2,180 and 0.42, n= 2,291, respectively), when quality 3+ and 3− are considered as a single category. Although the photographs were not evaluated in chronological order, the quality scores for photographs from 1993 were higher than from 1992.
No overall quality scores changed by more than a single quality level. The probability of quality scores changing by a single level (0.21) was independent of photographic quality. This independence is surprising given the point estimates. The lack of significance between the 1-probability model and other more complex models may be due to insufficient sample size. In addition, in a larger test set changes by two and three quality levels might have occurred.
The probability of changing half scores was low (0.02) and was independent of the original score. The probability of changing partial scores was dependent on the original score. The probability of changing from a nonpartial fluke to a partial fluke was low (0.013) but the probability of the reverse transition was high (0.667). However, there were only three partial flukes in the subset of photographs that were reevaluated, so the estimate of the latter probability is imprecise.
Individual distinctiveness was evaluated separate from photographic quality in an attempt to eliminate confusion between the two variables. However, evaluating individual distinctiveness without being influenced by photographic quality was more difficult than anticipated as demonstrated by a lack of independence between the individual distinctiveness scores and the photographic quality scores. It is also possible that the evaluation of quality was affected by the distinctiveness of the whales. However, the experienced technicians overseeing the evaluation process thought that it was more likely that quality was affecting distinctiveness.
The use of poor-quality photographs from the YoNAH data resulted in positively biased abundance estimates, assuming that bias is accurately estimated from the quality 1 and 2 photographs. Poor-quality photographs have a higher probability of false negatives. This error in matching increases the apparent number of whales sampled during each sampling period and decreases the number sampled in both sampling periods, biasing abundance estimates upward. The selection procedure balances the positive bias from poor-quality photographs and the reduced precision resulting from smaller sample sizes. The selection procedure indicated that using only quality 1, 2, and 3+ photographs, excluding right fluke, left fluke, and partial flukes, provided the best tradeoff. The effect of using a subset of data that only included overall quality 1, 2, and 3+ photographs was to reduce the estimates by 21% for 1992 and 18% for 1993 and to increase their coefficients of variation by 6% and 7%, respectively (Table 4). For this analysis, we weighted variance and bias equally. This weighting could be changed depending on the requirements of the analysis; for example, if an unbiased point estimate is more important than a precise estimate, bias can be weighted more heavily.
The effect of the evaluation variability on the selection procedure was smaller for 1992 than 1993. For 1993, more of the simulations that incorporated this variability selected the subset including only overall quality 1 and 2 photographs than the subset including overall quality 1, 2, and 3+. This difference in effect may be due to the difference in the distributions of quality levels between 1992 and 1993 (Table 1). There are fewer quality 3+ photographs in 1993, both in terms of numbers and proportions, than in 1992. Thus the reduction in sample size and the increase in variance are less in 1993 than in 1992.
The selection procedure attempts to balance bias due to photographic quality and precision due to decreased sample size. It does not address bias due to heterogeneity of capture probabilities that can produce negatively biased estimates with the bias being more severe with smaller sample sizes (Gilbert 1973, Hammond 1986). If the abundance estimates are negatively biased, this bias will be greatest in the estimate using only photographs scored as quality 1 and 2 because of the reduced sample size and the higher probability of frequently captured whales being represented by a high-quality photograph in the data. If an increasing negative bias occurs with decreasing sample size, the mean squared error could be artificially inflated for estimates with greater sample sizes and deflated for estimates with smaller sample sizes, which could result in greater tendency for choosing a smaller, negatively biased estimate.
The median abundance estimates from the simulation of the removal procedure did show a consistent decreasing pattern at each of the three removal steps (Table 5). However, the estimate with the smallest sample size was only 2% less than that using all the data. Therefore, if the estimates are negatively biased from heterogeneity, any increase in that bias due to the decrease in sample size is negligible compared to the decrease in the bias due to photographic quality.
It has generally been assumed that North Atlantic humpback whales possess sufficient variation in the pattern and trailing edge of their flukes to ensure that all animals are individually identifiable given a photograph of sufficient quality. However, this assumption has not been previously tested. The results presented here did not find differences in capture frequencies for whales with different individual distinctiveness scores. This result supports the assumption of equal probability of being reidentified with regard to distinctiveness categories. However, whales scored as distinctiveness 3 had a higher percentage of single captures than whales in the other categories, allowing for the possibility that a portion of the least distinctive whales had a lower probability of recognition. If poorly marked but recognizable whales and unrecognizable whales have been lumped into the distinctiveness 3 category, evaluating distinctiveness on a finer scale, which further divides this category, might provide a stronger test of the assumption. In addition, the power of this test might be low because only 31% (n= 795) of the 2,528 whales identified from overall quality 1, 2, and 3+ photographs in the YoNAH catalog were used in this analysis and only 7% (n= 57) of these were scored as nondistinctive. Because the assumption is that whales that are scored as distinctiveness 3 will be misidentified as new captures and thus have lower recapture frequencies, the low number of whales in the distinctiveness 3 category in the test may reduce its power.
The use of a second marking technique, such as genotypic (Palsbøll et al. 1997) markers, can be used to identify any individuals that are not recognized on photographic recapture. The YoNAH project genetically and photographically marked 1,410 individuals with 414 double tagged more than once. Stevick et al. (2001) found no false-positive and 14 false-negative errors in YoNAH's photographic identification data. The error rate was a function of photographic quality, increasing with decreasing quality. No errors occurred in quality 1 photographs, but 5 of the 14 errors were because of photographs that were scored as right fluke, left fluke, or partial flukes resulting in a 0.125 error rate for this quality category. Of the remaining 9 errors, 2 occurred with whales that were scored as distinctiveness 3 and 7 with whales scored as distinctiveness 2. Stevick et al. (2003) applied the mean squared error method presented here and Stevick et al.'s (2001) modified Petersen estimator, which corrects for false negatives, to the YoNAH data and found that excluding photographs scored as right and left flukes was still necessary to balance bias and precision. However, Larsen and Hammond (2004) applied these methods to humpback photographic identification data from West Greenland and found that the mean squared error technique recommended using photographs of all quality levels with Stevick et al.'s (2001) modified Petersen estimator.
Accurate estimates of abundance and survival rates are critical for the management of any population. For some populations, the most accurate and feasible method currently available for estimating abundance and other population parameters is from capture-recapture techniques using photographic data of individual markings. The use of photographic data introduces the issue of what minimum level of photographic quality is needed for accurate identification in capture-recapture analyses. The use of natural markings introduces the issue of whether all individuals in the population are sufficiently marked to be equally identifiable given a photograph of sufficient quality. Both these issues imply that acceptable data for each analysis need to be identified from the field data to produce estimates useful for management.
Selecting the best data to use for capture-recapture estimates of abundance is a complex issue, particularly if one attempts to minimize the potential effects of errors in matching photographs. The analysis presented here, which systematically examines the effects of photographic quality and individual animal distinctiveness, is in contrast to the intuitive selection of photographs made in many other studies (Whitehead 1982, Baker et al. 1992, Slooten et al. 1992, Childerhouse et al. 1995, Cerchio 1998, Wilson et al. 1999). Surprisingly, however, the YoNAH data selected intuitively by Smith et al. (1999) are the same as the data selected here and, therefore, the estimates of abundance are the same.
Forcada and Aguilar (2000) developed a similar selection procedure for Mediterranean monk seals that used the minimum residual sum of squares to select the best balance between bias due to poor-quality photographs and precision due to sample size. Their selection procedure indicated using excellent and good-quality photographs and removing poor-quality ones. Read et al. (2003) used the mean squared error selection method presented here for bottlenose dolphins (Tursiops truncatus) but modified it to select acceptable quality photographs and acceptable marking levels. They found that the recommended data were limited to individuals with the most and intermediate distinctiveness scores and excellent to good photographs.
The complexity of the potential effects involved implies that a structured and repeatable method, such as the analysis presented here, is warranted; researchers should not rely on intuition alone. In addition, the results of the selection process are likely to change from collection to collection and even among subsets of data from the same collection as details of the photographic collection vary, such as the sample size, the evaluation schemes, the distribution of photographic quality, and the distinctiveness of animals in the population. Although the techniques presented here can be applied to any species, the results are specific to the YoNAH photographic data. For these data, factors that influence the decision of which data set to use include the sample size, the distribution of photographic quality categories, and the variability in the evaluation of photographic quality. The decision of which data to include will be even more complex for populations where the ability to correctly reidentify individuals is reduced because not all individuals are distinctively marked, for example, bowhead whales (Balaena mysticetus) (Rugh et al. 1998, da Silva et al. 2000, Zeh et al. 2002), bottlenose dolphins (Wilson et al. 1999, Read et al. 2003), and Mediterranean monk seals (Forcada and Aguilar 2000).
We thank the international collaborative project Years of the North Atlantic Humpback (YoNAH) for supplying the photographs used in these analyses. We thank Phil Hammond, Colleen Kelly, and Mark Bravington for their guidance with the data and the statistical analysis. We thank Rosie Seton for her help in creating the final, high-resolution figures. The thoughtful reviews of Richard Merrick, Phil Clapham, Sal Cerchio, Per Palsbøll, and two anonymous reviewers are greatly appreciated. The National Oceanic and Atmospheric Administration funded this work through a Cooperative Marine Education and Research Program grant to the University of Rhode Island. Part of this work was conducted while Friday held a National Research Council Research Associateship at the Northeast Fisheries Science Center. The evaluation of the YoNAH catalog for photographic quality and for individual distinctiveness was funded by the National Oceanic and Atmospheric Administration through a National Marine Fisheries Service contract to the College of the Atlantic. The views expressed herein are those of the authors and do not reflect the views of NOAA, NRC, or any of its subagencies.
Maximum likelihood estimates of the probability of changing quality scores using binomial and trinomial models
Overall Quality Variable
The general model structure assumed a binomial distribution with one probability of changing quality scores by one level for original quality levels of 1 and 3− and a trinomial distribution with two probabilities of changing quality scores by one level up or down for original quality levels of 2 or 3+. Changes in quality scores by more than a single level were not modeled, as they did not occur in our data. The data were the number in each quality category from first evaluation, s1, s2, s3+, s3-, the number of photographs within a category i that changed to category j, xi,j (where j=i± 1), and the number that remained in category i, xi,j (where i=j). The probabilities of changes, pi,j where i indicates the original quality score, j indicates the new quality score, and j=i± 1 were modeled as:
Quality 1 score change to quality 2: P(x1,2; s1, p1,2)
Quality 2 score change to quality 1 or 3+: P(x2,j; s2, p2,1, p2,3+), where j= 1 or 3+
Quality 3+ score change to quality 2 or 3-: P(x3+,j; s3+, p3+,2, p3+,3-), where j= 2 or 3−
Quality 3- score change to quality 3+: P(x3-,3+; s3-, p3-,3+).
The likelihood function was:
To maximize this likelihood, the values of pi,j that minimize the negative of log(L) were determined numerically using the nlminb function in S-PLUS version 2000 (MathSoft 1999).
The four nested models each used different assumptions about the relationships between the individual pi,j's. These assumptions are defined as:
1-probability model: p1=pi,j for all j=i± 1
6-probability model: p1=p1,2p2=p2,1
3-probability model: p1=p1,2=p2,1
2-Probability Model: p1=p1,2=p2,1=p2,3+=p3+,2
The general model structure assumed a binomial distribution for all changes. Changes from whole flukes to a right or left fluke were considered equivalent. Changes between left and right flukes were not modeled, as they did not occur in our data. The data were the number in each half category from first evaluation, sW, sR, sL, the number of photographs within a category i that changed to category j, xi,j (defined as xW,RL, xR,W, xL,W), and the number that remained in category i, xi,j (where i=j). The probabilities of changes, pi,j, were modeled as:
Whole score change to right or left: P(xW,RL; sW, pW,RL)
Right score change to whole: P(xR,W; sR, pR,W)
Left score change to whole: P(xL,W; sL, pL,W).
The likelihood function was:
The likelihood was maximized as described above. The three nested models explored were:
1-probability model: p1=pW,RL=pR,W=pL,W
3-probability model: p1=pW,RL
2-probability model: p1=pW,RL
The general model structure assumed a binomial distribution for all changes. In notation below, 0 indicates a nonpartial score and 1 indicates partial. The data were the number in each partial category from first evaluation, s0, s1, the number of photographs within a category i that changed to category j, xi,j (defined as x0,1, x1,0), and the number that remained in category i, xi,j (where i=j). The probabilities of changes, pi,j, were modeled as:
Non-partial change to partial: P(x0,1; s0, p0,1)
Partial change to nonpartial: P(x1,0; s1, p1,0).
The likelihood function was:
The likelihood was maximized as described above. The two nested models explored were: