Image preprocessing and enhancement
The preprocessing step consists of normalizing the in-plane orientation of the manta ray within the image (the user simply performs two mouse clicks to identify the orientation of the medial axis line), followed by selection of a ROI encompassing the characteristic markings on the ventral surface (Marshall et al. 2011) of the ray (this requires a further two mouse clicks to select the two opposite corners of a rectangle). Preprocessing is currently done manually, although an automated approach using image segmentation is under development. Figure 1 shows an orientation normalized manta ray image with rectangular spot pattern region.
Figure 1. Image preprocessing: The manta ray orientation is normalized, and a candidate region encompassing the spot pattern is selected.
Download figure to PowerPoint
An automated image enhancement step is then applied to the ROI image. Underwater images exhibit enormous variation in lighting depending upon factors such as depth, clarity of the water, use of flash (which often leads to backscatter caused by suspended particles), and the relative position of the sun (manta rays are frequently photographed swimming overhead with the sun behind them, causing glare and a “corona effect”). As noted in Schettini and Corchs (2010), this leads to limited visibility, low contrast, nonuniform lighting, blurring, lack of coloration, and various noise artifacts.
Given the heterogeneity of photographic conditions and equipment used in acquiring manta ray photo ID images, and the lack of usable calibration information, automated image restoration or illumination correction methods such as those proposed in Schettini and Corchs (2010) are not applicable. We therefore investigated a large number of generic and robust techniques for automated image enhancement that would be universally applicable to deal with the most commonly encountered image degradations without negatively impacting images that were already of a high quality.
Best overall results were achieved by a combination of median filtering and histogram equalization (Gonzalez and Woods 2008). Images are first converted into grayscale and then size normalized such that their maximal dimension does not exceed 800 pixels. We then calculate the standard deviation to assess noise levels and apply a 3 × 3 (for less noisy images) or 5 × 5 (for images with high noise levels) median filter to reduce noise. To improve fidelity and enhance the contrast of characteristic spot patterns, we then apply contrast-limited adaptive histogram equalization (CLAHE). The CLAHE algorithm performs local rather than global contrast adjustment, which is especially important when different parts of a manta ray in a given image exhibit widely different illumination levels, for example, due to rapid attenuation of flash lighting causing significant differences in white balance between areas of the ray that are proximate to the camera and those that are further away.
An example of the effects of image enhancement is shown in Figure 2.
Figure 2. Top row, left to right: Original image; grayscale candidate region; enhanced image after noise filtering and contrast adjustment. Bottom row: Visualization of features extracted using SIFT (left), SURF (middle), and ORB (right).
Download figure to PowerPoint
Feature extraction and representation
In order to encode the characteristic information contained within the natural body markings of manta rays, we make use of the Scale-Invariant Feature Transform (SIFT) (Lowe 2004). We also implemented and evaluated two other recent feature extraction algorithms, namely SURF (Bay et al. 2006) and ORB (Rublee et al. 2011). Our implementation makes use of the OpenCV3 Computer Vision library (Bradski and Kaehler 2008), and we have adapted SIFT code originally written by Hess (Hess 2010).
All three algorithms detect distinctive features at keypoints in the image, and then represent those features in terms of a parametric description of the local image variation in the vicinity of the keypoints at a carefully chosen scale of analysis. The algorithms were chosen due to their ability to extract and match features in a way which is robust to changes in size and 2D rotation, and also resilient to changes in 3D viewpoint, addition of noise, and change in illumination. As we achieved best results using SIFT, and as we developed various novel improvements to the original SIFT algorithm to enhance matching performance on manta ray images, we will briefly describe some key aspects of our SIFT implementation.
To find stable features that are invariant to size, SIFT detects features using a scale-space approach. This is achieved by convolving the image with Gaussian filters G at different scales of analysis σ and differencing the resulting blurred images at neighboring scales to find local minima and maxima. Formally, the scale space L of an image is created by convolving the input image I with Gaussian filters G at different scales σ:
Neighboring scales (σ and kσ for some constant k) are then subtracted from each other to produce the Difference-of-Gaussian images D:
Only scale-space extrema of D(x, y, σ) that have strong contrast are chosen as keypoints. We also reject keypoints that are closely spaced along an edge as these are unstable and not useful for identification.
In order to achieve invariance to 2D orientation, a keypoint descriptor based on local gradient directions and magnitudes is used. The descriptor is invariant to image rotations as the bins of the orientation histograms are normalized relative to the dominant gradient direction in the vicinity of the keypoint. The scale of analysis, and hence the size of the local region whose features are being represented, corresponds to the scale at which the given keypoint was found to be a stable extremum (subject to constraints on local contrast and contour membership).
In terms of the characteristic patterns present on the ventral surface of manta rays, SIFT keypoints are typically localized at significant spots and other markings. Information on the shape, contrast, and dominant orientation of markings is represented by the feature descriptors.
Figure 2 illustrates image enhancement and features extraction using an example manta ray image. SIFT features are marked by arrows whose length and direction illustrate the scale and dominant orientation of the given keypoint feature (note that some keypoint features may be localized at or near the same pixel coordinates in the image, but will differ in their scale or orientation). SURF and ORB features are indicated by small circles.
Pattern matching for automated identification in ecological databases
In order to identify a manta ray automatically from an image, it needs to be matched against all images in a database. If other images of the given individual ray are already present in the database, then the software should rank that individual highly in the list of search results. If the image represents an as-yet-unidentified individual, then the matching algorithm should return very low matching scores and indicate a low confidence of having achieved a successful match.
Matching therefore requires the software to efficiently compute all possible pairwise matches between the features representing the query image and each image in the database. Configurations of SIFT keypoints from different images can be compared via a distance metric to find correspondences between instances of objects in different poses. Most approaches to matching of SIFT features are designed for tasks such as image stitching or detection of man-made objects. In these cases it is usually straightforward to identify subsets of features representing similar or identical image structures. To achieve partial pose invariance, a candidate match can be confirmed or rejected by establishing a projective mapping of one (sub)image to another via homographies.
However, our experiments showed that this approach does not yield usable results in the case of manta ray images. The diffuse nature of their natural body markings, especially when photographed underwater, typically results in relatively low numbers of matching features, and the much greater variation in appearance caused by the factors mentioned in previous sections makes it very difficult to automatically recover the 3D pose of the ray.
We therefore developed a novel image-to-image matching method which is more akin to methods for computing similarity of visual textures as opposed to rigid transformations of geometrical or strongly patterned objects. The overall similarity score between the unknown query image I and another image J is computed as
where Fi and Fj represent the sets of SIFT features for I and J, respectively. In Lowe's original algorithm (henceforth referred to as “classic SIFT”), score(I, J) is simply computed as the number of features that are deemed to match divided by the larger of the number of features in the two images. Our MM algorithm refines this by weighting each matched pair of features based on their significance and the strength of the match, as will be explained below. The final score is normalized based on the maximum possible value of in order to ensure that scores range from 0 (worst score) to 1 (perfect match).
First, our tool considers all possible pairings of individual features from the unknown query image and all the images it is to be compared against. As each image may have hundreds of features and the database may contain thousands of images, matching feature pairs are identified efficiently using a “Best-Bin-First” (BBF) k-d-tree approximation (Beis and Lowe 1997) to the nearest neighbor Euclidean distance between feature vectors, resulting in a significant (100×) speedup.
In keeping with classic SIFT, we only consider features to be matched if their nearest neighbor to second-nearest neighbor feature distance ratio is greater than a threshold (in this case, a threshold of 0.75 was empirically chosen). This ensures that chosen features are sufficiently distinctive relative to the overall feature set.
In addition, candidate matching feature pairs are rejected if the ratio of their absolute-scale difference divided by the greater of the two scales exceeds a threshold of 0.5. This ensures that features must not differ greatly in size (i.e., by no more than one scale-space octave) to be considered to be matched. As features at a very fine scale are less likely to be significant, we ignore features with a very low scale (a scale-space value of 1.1 was empirically selected as a good cut-off) and weight the contribution (see wn above) of matching keypoint pairs based on their absolute and relative scales:
where fi ∊ Fi and fj ∊ Fj are the candidate features in images I and J, is their mean scale, and a value of P = 0.10 yielded best results on an evaluation set.
The crucial factor which is captured by this algorithm is that the distinctiveness (discriminability) of features is the most important factor in finding good matches, rather than just the number of similar features. Images with a fairly random distribution of features could lead to many incidentally matching features, but these may not be considered as useful matches unless they are also relatively distinctive (as determined using the nearest neighbor distance ratio), and even if they pass the test for distinctiveness, their contribution to the overall matching score is normalized based on scale and the overall number of features in both images.
We also ensure that keypoint matches are unique, that is, the same feature in one image can only be “paired” with a feature in another image once. Furthermore, the image-to-image feature comparison is computed bidirectionally, with the lowest of the two directional scores being used as the final matching score:
This removes bias that might otherwise result if the two images differ greatly in their feature complexity (due to the BBF approximation when computing feature distances), and also ensures that the final scores are symmetric.
In practice we are interested in matching an image I depicting an unknown manta ray against a set of labeled images J to establish whether I shows one of the M manta rays in J or whether it is a “new” (as-yet-unidentified) ray. Hence, J is effectively partitioned into subsets Jm for each manta ray m ∊ M, as there are likely to be multiple images of most rays (i.e., each Jm usually contains more than one image) in an ecological database. Our MM algorithm exploits this fact by first computing the pairwise (bidirectional, as above) comparisons between I and all the elements of each set Jm. We then combine the resulting image-to-image scores for Jm by computing their mean, and use this as the overall similarity score between image I and manta ray m:
In our experiments we considered using the mean, median, maximum, and minimum as the score combination criterion, but consistently achieved best results using the mean.
We then sort these scores and output a ranked list of manta rays in decreasing order of how well they match the query. Consequently, the image I is then deemed to be best matched to the manta ray m with the highest similarity score, but by outputting a ranked list of results we allow the user to make the final decision as to which manta ray (if any) is the “correct” match. Figure 3 shows two examples of this: in each case, an unknown manta ray is used as the “query image” and the system displays a ranked list of the best matching manta ray images from a database (only the top three matches are shown in the figure). In order to give the user some indication of how reliable the system's rankings are likely to be, it also computes a “confidence score” based on the ratio of the scores between the first and second ranked results:
A high confidence number is an indication that the best matching image is significantly more similar to the query than any other image. If the confidence is low, then the user may need to inspect a larger number of matching results to ascertain which (if any) of them actually matches the query image.