Learning features from georeferenced seafloor imagery with location guided autoencoders

Although modern machine learning has the potential to greatly speed up the interpretation of imagery, the varied nature of the seabed and limited availability of expert annotations form barriers to its widespread use in seafloor mapping applications. This motivates research into unsupervised methods that function without large databases of human annotations. This paper develops an unsupervised feature learning method for georeferenced seafloor visual imagery that considers patterns both within the footprint of a single image frame and broader scale spatial characteristics. Features within images are learnt using an autoencoder developed based on the AlexNet deep convolutional neural network. Features larger than each image frame are learnt using a novel loss function that regularises autoencoder training using the Kullback–Leibler divergence function to loosely assume that images captured within a close distance of each other look more similar than those that are far away. The method is used to semantically interpret images taken by an autonomous underwater vehicle at the Southern Hydrates Ridge, an active gas hydrate field and site of a seafloor cabled observatory at a depth of 780 m. The method's performance when applied to clustering and content‐based image retrieval is assessed against a ground truth consisting of more than 18,000 human annotations. The study shows that the location based loss function increases the rate of information retrieval by a factor of two for seafloor mapping applications. The effects of physics‐based colour correction and image rescaling are also investigated, showing that the improved consistency of spatial information achieved by rescaling is beneficial for recognising artificial objects such as cables and infrastructures, but is less effective for natural objects that have greater dimensional variability.

means that seafloor imaging typically requires underwater terrains to be followed at close altitudes of between 2 and 10 m and the use of strobed illumination. Under these conditions, even small fluctuations in altitude strongly affect the colour balance, spatial resolution, and area covered, reducing the consistency between image frames. Furthermore, the footprint of each image is limited to an edge length of just a few metres, which is significantly smaller than many of the geological and ecological features that are of interest for scientific analysis and statutory monitoring. Although efforts to develop shared annotation schemes and datasets exist within the marine imaging community (Bewley, Friedman, et al., 2015;Langenkämper, Zurowietz, Schoening, & Nattkemper, 2017), the variability of seafloor environments, imaging systems and the limited number of experts with domain specific knowledge mean that the development of comprehensive annotated training datasets similar to those on land (e.g., SpaceNet, ImageNet, COCO, Pascal VOC) are unlikely to be developed. This paper investigates the use of unsupervised learning to extract features and perform semantic interpretation of seafloor imagery.
A key advantage of unsupervised methods is that they do not require annotated datasets for training. Although the use of unsupervised learning for clustering seafloor acoustic imagery (Hasan, Ierodiaconou, Laurenson, & Schimel, 2014) and visual imagery (Kaeli & Singh, 2015;Steinberg, 2013;Steinberg, Friedman, Pizarro, & Williams, 2011) have been reported, most previous work has used manually selected features that have been defined based on domain specific knowledge, limiting their ability to generalise across datasets. More recently, unsupervised frameworks that learn suitable features from a dataset have shown great promise as a general tool for semantic interpretation of seafloor imagery (Flaspohler, Roy, & Girdhar, 2017;Rao, De Deuge, Nourani-Vatani, Williams, & Pizarro, 2017). This study aims to further develop this concept and improve the extraction of information about geological and ecological features that exist on spatial scales larger than the footprint of a single image. This is achieved by developing an autoencoder framework that regularises features learning using georeferencing information. The main contributions of this paper are: • Development of an autoencoder feature learning framework that can take into account georeferencing information using a novel loss function based on Kullback-Leibler divergence.
• Investigation of the effectiveness of georeference regularisation, physics based colour correction and spatial scale information on learning using an expert labelled ground truth.
• Demonstration of semantic mapping applications of learnt features through clustering and content-based image retrieval.
The autoencoder developed in this study learns features using a deep learning convolutional neural network based on AlexNet (Krizhevsky, Sutskever, & Hinton, 2012). A novel loss function that uses georeference information is used to regularise learning by minimising the Kullback-Leibler divergence between affinity in the latent feature space and geographic location. The proposed method is applied to semantically interpret images collected by the AUV ae2000f of the University of Tokyo, Japan. The dataset consists of more than 12,000 images collected from an altitude of 6 m off the seafloor at the Southern Hydrate Ridge, an active gas hydrate field at a depth of 780 m and site of the Ocean Observation Initiative's seafloor cabled observatory (Cowles, Delaney, Orcutt, & Weller, 2010). The effectiveness of the proposed method is assessed using more than 18,000 expert annotations.
2 | BACKGROUND 2.1 | Semantic interpretation of seafloor imagery Feature engineering is crucial to effectively interpret visual imagery.
Seafloor images have unique characteristics compared to terrestrial datasets, and several studies have demonstrated semantic interpretation using manually selected features that are tailored to specific subsea applications (Beijbom, Edmunds, Kline, Mitchell, & Kriegman, 2012;Maki, Kume, Ura, Sakamaki, & Suzuki, 2010;Pizarro, Rigby, Johnson-Roberson, Williams, & Colquhoun, 2008;Thornton, Asada, Bodenmann, Sangekar, & Ura, 2012). These have been used for classification and segmentation within images and mosaiced reconstructions. More recently, an attempt to develop a generic feature extraction method by Steinberg et al. (2011) used Local Binary Pattern (LBP;Ojala, Pietikäinen, & Mäenpää, 2002) features derived from greyscale images together with three-dimensional (3D) rugosity features and colour features for unsupervised clustering of seafloor stereo images. In Steinberg (2013), the author proposed Sparse Coding Spatial Pyramid Matching (ScSPM; Yang, Yu, Gong, & Huang, 2009) as a more generic approach. However, this required additional techniques to reduce the dimensionality of ScSPM outputs to perform classification. In Kaeli and Singh (2015), the accumulated histogram of oriented gradients from keypoints were used to describe each image, and this was applied to clustering and anomaly detection. A common characteristic of these methods is that they can preserve multiscale features in the images, which is important as the size of seafloor targets can vary. Though these approaches are intrinsically robust to the scale variance, there exist many hyperparameters which require manual tuning to optimise performance for each dataset. Moreover, features larger than the footprint of a single frame cannot be captured.
For supervised learning applications, LBP and ScSPM features have been shown to be effective (Bewley, Nourani-Vatani, et al., 2015;Rao et al., 2017). Deep learning techniques can optimise feature learning and classification simultaneously within the same end-to-end training process. In Mahmood et al. (2018)

| Autoencoders
The autoencoder is a variation of the artificial neural network that is useful for unsupervised feature learning. It consists of two parts; an encoder and a decoder. The encoder maps original data x into a latent representation h of lower dimensionality and can be expressed , and reconstructs x r to be as similar to the original sample x as possible for a given latent representation. When the values in x are continuous, the difference between x and x r can be measured as the mean squared error. Given n samples in a dataset, the autoencoder's objective function can be formulated as follows, where ϕ and θ denote the parameters of the encoder and decoder, respectively. The biggest advantage of the autoencoder is that the networks used can be trained without the need for expert annotations. Since x r is reconstructed from a latent representation h that preserves key information in x in a lower dimensional space, h can be thought of as the set of features of a given size that best represents the original data. In Rao et al. (2017), autoencoders are applied partly to learn mid-level features in visual imagery after extracting low-level features with ScSPM. Flaspohler et al. (2017) applies convolutional autoencoders for unsupervised feature learning from seafloor imagery and shows that they outperform hand-designed features in discovering characteristic patterns.
To enhance the unsupervised feature learning performance of autoencoders, several studies have demonstrated training of autoencoders with additional loss functions designed to maximise clustering in the latent representation space (Aljalbout, Golkov, Siddiqui, Strobel, & Cremers, 2018;Min et al., 2018). A typical loss function can be formulated as where L clust is a clustering loss, and λ is a hyperparameter designed to balance L rec and L clust . In (Yang, Fu, Sidiropoulos, & Hong, 2017), the use of such a loss function for k-means clustering significantly improved clustering performance. In (Xie, Girshick, & Farhadi, 2016), L clust is formulated as follows, where k μ is the centroid of cluster k in the latent representation space, p ik and q ik are the [ ] i k , th elements of the probabilistic distributions P and Q, and f j is soft cluster frequency which is defined as ∑ q i ij . ∑ ′ k means that the values of ( + ‖ − ‖ ) μ are calculated for all the clusters ( ′ k ) and summed for use as a normalisation factor. The element q ik can be interpreted as the probability of assigning h i to cluster k, defined with the Student's t-distribution as a kernel following t-SNE algorithm (Maaten, 2008). The element p ik is the target value derived from q ik to maximise the separation between cluster k and the other clusters. L clust is trained after training L rec by minimising the Kullback-Leibler (KL) divergence between P and Q. Since L clust in Equation (3) is derived as a soft cluster assignment and is differentiable, it can be efficiently optimised using back-propagation. For public datasets, the use of a clustering loss was shown to improve clustering accuracy by up to 2.5% for the MNIST dataset. However, both studies require the number of clusters to be manually set, which is not practical for seafloor images or other natural scenes where the appropriate number of clusters is not known.
Another important application of autoencoders is anomaly detection since anomalous data which are rarely observed in the dataset can not be reconstructed precisely and have a large value of L rec . Zurowietz, Langenkämper, Hosking, Ruhl, and Nattkemper (2018) applies autoencoders to detecting anomalous regions in seafloor images as candidates for living organisms, since they are less frequently observed than backgrounds (i.e., rocks and sand).

| Preprocessing
Images taken underwater are distorted by the water column. Colour and geometry corrections can be applied to improve the consistency of datasets before feature learning. In this study, colour correction parameters are estimated based on the altitude each image was taken at to compensate for wavelength dependent attenuation in the water column. Altitude is also used to rescale undistorted images and reduce the scale variance caused by differences in range to targets. altitudes together with their colour histograms. The methods used for correction are described in the following sections.

| Colour correction
Light attenuation in water differs for each wavelength that constitutes the RGB channels. Since red attenuates more aggressively than green or blue wavelengths, uncorrected underwater images appear blue and green (Jaffe, 1990; Figure 1a). Seafloor images captured at low altitudes ( Figure 1a top) are also brighter than the images captured at high altitudes (Figure 1a bottom). Often wide angle lenses are used to maximise the imaged area, and this can cause pixels at the centre of each image to be brighter than those at its edges. Pixel-wise colour correction normalises each pixel by the mean and standard deviation of the same pixel across an entire dataset based on the grey-world assumption (Buchsbaum, 1980). This can improve the imbalance between colour channels and uneven brightness within each image ( Figure 1b). However, pixel-wise normalisation cannot correct colour variations caused by altitude differences within a dataset. To compensate for these variations (Bryson, Johnson-Roberson, Pizarro, & Williams, 2013) proposed a practical method that improves colour consistency by taking into account the attenuation of the different colour channels. This study applies a similar approach, where the attenuation is approximated as follows: , , exp , , , , . This method assumes the seafloor is flat, which is reasonable when the vertical profile in each image is small relative to the altitude. The pixel-wise normalisation also corrects for vignetting. Since outliers in the dataset disturb the normalisation, the 10% extreme intensity values for each pixel location are trimmed before determining the model parameters. Figure 1c shows the result of the proposed colour correction. Compared to Figure 1b, the brightness between images taken at different altitudes is more uniform.

| Geometry correction
The 3D information needed to fully compensate for scale effects within an image frame is not always available. Therefore, this study approximates scale effects from the imaging altitude and the lens field of view on a per image basis. Geometric distortions are also corrected using lens calibration data. Each image is downsampled to a consistent spatial resolution of 10 mm/pixel, which is considered appropriate for the imaging setup used for the experiments analysed in this paper (see Table 1). In this study, the roll and pitch of the images are not taken into account. This is reasonable for correctly trimmed underwater vehicles with downward-looking imaging systems. Equation (1) is calculated as the difference between augmented images before noising and the reconstructed images.

| Georeference regularised learning
Geological and ecological features of the seafloor such as sediments, bacterial mats and seafloor infrastructures and background substrates such as sands and rocks exist over spatial scale larger than the footprint of a single image frame. To capture this property, the following assumption is made: Assumption. Two images captured within a close distance tend to look more similar than two that are far away.
In general, a favourable feature learner should embed h i and h j closer in the latent representation space, if the original data x i and x j are similar. Based on the assumption, the affinity between h i and h j in the latent representation space should be modified to account for the affinity between the geographical locations y i and y j at which x i and x j are measured. For seafloor imagery, it is reasonable to assume that y is known since they are captured by AUVs or other platforms with navigational sensors and methods to determine position are well documented (Paull, Saeedi, Seto, & Li, 2013). To implement this idea, the Student's t distribution is used as a kernel to measure affinity (Maaten, 2008;Xie et al., 2016) in both the latent representation (h) space and the geographical (y ) space. Thus ′ q ij , which is the value of the affinity matrix at index ( ) i j , in the latent representation space ′ Q can be defined as, Likewise, p ij which is the element of the affinity matrix ′ P in physical space for the georeferenced data can be defined as and d max is the user-defined upper limit of the distance between the two locations that are correlated. This limit prevents ′ P from overfitting images that are at a large distance apart. To capture features larger than a single image, d max should be enough larger than the physical footprint sizes of images. An appropriate value d max works for most targets and backgrounds, though the sizes of them vary highly. This is because even if two images belonging to a continuous pattern are separated farther away than d max , some images belonging to the same pattern would exist continuously between them, and all of these images would be embedded close together in the latent representation space. The autoencoder is trained so that the KL divergence between the two affinity matrices ′ P and ′ Q is minimised.
The proposed loss function becomes, λ is a hyperparameter for balancing L geo and L rec . This can be optimised iteratively using a mini-batch. However, if the many of images in a mini-batch are sampled from the locations separated farther than d max , the elements of ′ P become similar thus the training is not regularised as intended. To avoid this issue, at each mini-batch sampling, the first image is randomly chosen from the whole dataset and the other images to populate the batch are chosen according to their physical proximity to the first image.
Since t distribution is a heavy-tailed distribution, L geo loosely regularises the latent representation space to follow the assumption.
The appropriate setting of λ is also necessary because if the regularisation is too forceful, L rec is ignored and the latent representations become meaningless as features for semantic interpretation. In addition, to deal with the uncertainty of georeference information, images are randomly shifted within 25% of each image patch size at every sampling step. The similarity assumption has no hypothesis on the rotation of images, and so random rotations are applied to the images to avoid fitting rotation variances in the dataset. Figure 2 gives an overview of the proposed feature learner. The proposed autoencoder learns local features within an image using a convolutional neural network. Learning is also regularised by the georeference loss function to account for patterns larger than each image footprint. Once the autoencoder has been trained to minimise Equation (7), its encoder can be used as a feature extractor.
Denoising, random rotation and shifting are not applied to extract features by the encoder. It should be noted that georeference information y is also not used in the feature extraction phase. This is because the aim of embedding georeference information is not to map the absolute coordinates of the images to the latent representation space, but to control the feature mapping by embedding the assumption into the trained autoencoder. This allows the encoder to extract features from datasets unrelated to the training dataset, and datasets without georeferencing information.  (Wu et al., 2013). Since the similarities are defined in the latent representation space, georeference information is unnecessary for this application once the autoencoder has been trained. However, predicting the performance of the two metrics is difficult for features learnt by an autoencoder since the interpretation of their meaning is nontrivial. Therefore, this study compares their performance experimentally.
F I G U R E 2 Flow diagram for calculating the proposed loss function L all (Equation (7)). L rec is the reconstruction loss of the autoencoder (Equation (1)). L geo is the divergence loss between the two affinity matrices in the latent representation space (Equation (5)) and in the physical space (Equation (6)

| EXPERIMENT
The methods developed in this study are applied to seafloor imagery obtained at the Southern Hydrate Ridge, a gas hydrate field that is home to a seafloor cabled observatory (Cowles et al., 2010) located 100 km off Oregon, USA (Table 1). Over 12,000 images of the site were collected using the AUV ae2000f of the Institute of Industrial Science, University of Tokyo, Japan, during the Schmidt Ocean Institute's FK180731 #Adaptive Robotics campaign in August 2018. Table 1 gives an overview of the dataset, and Figure 3a shows an ortho-projected mosaic created from the images in the dataset using a stereo SLAM  Mahon et al. (2008). This has been applied to data collected by the AUV's navigational sensors, consisting of an iXblue Quadrans IMU, RDI 300 kHz DVL, Paroscientific depth sensor, iXblue Gaps USBL and stereo imagery collected by the SeaXerocks mapping system of the University of Tokyo, Japan (Thornton et al., 2016).
The relative position accuracy using this combination is estimated to be <1 m across the dataset. This is of a similar order to the randomly allocated shifting of images applied for data augmentation when training the autoencoder (25% of 2.27 m). This allows the autoencoder to take localisation uncertainty into consideration and avoids overfitting to the georeference information.

| Ground truth for evaluation
Ground truth annotations were generated using SQUIDLE+ (Bewley, Friedman, et al., 2015) by experts for 18,740 (approx. 30%) image patches randomly selected from the original 62,875 image patches. Figure 3 shows the spatial distributions, the numbers and the examples of each category. Boundaries between some categories are ambiguous, especially for natural features, for example, "Rock," "Sand," and "Carbonate," where the density of the relevant targets vary on a continuum. From the appearances of the ground truth categories shown in Figure 3c, it is noticeable that these categories form the larger patterns than the footprint of images, thus the proposed georeference regularisation is assumed to be effective. In this experiment, only the dominant label is given to each image patch based on individual annotator's judgement. Although this complicates the quantitative evaluation of performance, the relative performance between different conditions of the proposed feature learning can be used to verify how effective the methods developed in this paper are for semantic interpretation.

| Autoencoder training
To evaluate the effectiveness of the novel aspects of the proposed method, the autoencoder is trained to learn features in the dataset with/without colour attenuation correction (Section 3.1.1), rescaling (Section 3.1.2), and the georeference regularisation (Section 3.2.2).
The dimensionality of h is set as 16 since the L rec does not vary significantly even if larger values are used. The weights in the autoencoder are initialised with the original AlexNet trained with the ImageNet dataset. The mini-batch size is fixed as 256, and an Adam optimiser (Kingma & Ba, 2014) is used. The value for d max , which limits ( ) y y d , i j in Equation (6), is set as 8.0 m. This is approximately 3.5 times the edge length of each image patch, where this value is appropriate for describing even large scale features as the images in this dataset constitute a dense grid with continuous cover between adjacent image pairs. A value of = e 1 5 λ is used in Equation (7) for the georeference regularisation, and the number of epochs is set as 2,000. These parameters are empirically determined as values where both L rec and L geo in Equation (2)

57
Total area covered (m 2 ) 118,000 Mapping Method Dense grid with 30% overlap between images Ground truth categories 7 categories as shown in Figure 3c Annotation platform SQUIDLE+ It can be said that a better feature extractor outputs smaller distances between samples for the same category and larger distances for the different categories in the latent representation space.
Since this viewpoint is the same as internal evaluation metrics for clustering performance, the proposed feature learning can be evaluated through the metrics by inputting ground truth instead of clustering results. Silhouette score (Rousseeuw, 1987), Calinski and Harabasz score (CH; Caliński & Harabasz, 1974) and Davies-Bouldin score (DB; Davies & Bouldin, 1979) are used for the evaluation in this experiment. However, it should be noted that while these are the most widely used metrics to assess clustering performance, it has been reported that these existing metrics cannot completely take into account imbalances in datasets (Krawczyk, 2016). Although the dataset analysed in this study is highly skewed (see Figure 3c) these metrics are used since no standard methods are available that can overcome these limitations.

| Results
The internal evaluation metrics corresponding to each training condition, labelled C 1 to C 9 , are shown in Table 2. The latent representations h are normalised in each dimension as standard scores. Table 2 shows that the proposed georeference regularisation improves performance significantly for all metrics. The attenuation correction also increases performance, but the effectiveness of rescaling is less clear from these results alone. Figure 4 illustrates the distribution of expert annotations in the latent representation space h using t-SNE visualisation (Maaten, 2008). Figure 4a,b are for autoencoders trained without/with the georeference regularisation, respectively (corresponding to C 4 and C 8 in Table 2). The most distinguishing characteristic of the resulting representation is that the distribution corresponding to "Cable" forms an obvious cluster at the centre of Figure 4b with clear separation from other categories, while it is widely distributed in Figure 4a without the georeference regularisation. This illustrates how the georeference regularisation allows the autoencoder to prioritise features that are common between images taken in close proximity to each other over features that would be learnt without this regularisation. The other ground truth categories also gather more closely in Figure 4b than in Figure 4a, as reflected by the improved evaluation metrics in Table 2.
A NMI score is bounded between 0 (no mutual information) and  Table 2 shows the number of clusters and the NMI scores for each autoencoder. The proposed georeference regularisation improves the NMI scores by a factor of 1.6 (C 2 to C 6 ) to 2.2 (C 4 to C 8 ) compared to equivalent analysis without this regularisation. The modification of the loss function in Equation 7 is effective at controlling the training process so that it obtains solutions closer to human interpretation. This can be expected as it leverages an assumption about the scale of seafloor habitats and features, compensating for the limited image footprints that can be achieved underwater.

| Results
T A B L E 2 Evaluation results of the proposed feature learning and clustering Condition label C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8 C 9 Pixel-wise normalisation - Note: The check (✓) and dash (-) marks illustrate whether each preprocessing or regularisation is applied or not, respectively. Each condition is labelled from C 1 to C 9 and these labels are referred to in the later sections. The best scores (the lowest for DB and the highest for Silhouette, CH and NMI) are shown in bold. Abbreviations: CH, Calinski and Harabasz score; DB, Davies-Bouldin score; NMI, normalised mutual information.
F I G U R E 4 A t-SNE visualisation of the latent representation h for the expert annotations (a) without georeference regularisation (C 4 in Table 2); (b) with georeference regularisation (C 8 in Table 2) [Color figure can be viewed at wileyonlinelibrary.com] When georeference regularisation is used, the proposed light attenuation correction improves the NMI score by 23% (C 7 to C 9 ) and 38% (C 6 to C 8 ) compared to a simple grey-world assumption. In contrast, no increase in performance is observed when the georeference regularisation was not used. A possible explanation is that when autoencoder training is regularised to the local neighbourhood, colour information is used in the latent space since adjacent images will tend to show a similar colour of seafloor. Under this assumption, any colour artefacts will degrade clustering performance. With no georeference regularisation, the autoencoder can easily end up being trained using images that are far apart, where the actual seafloor colour would tend to be more varied. In this scenario, the autoencoder would not prioritise colour information in the latent representation space, and so be less sensitive to differences in the colour correction method used. The results for rescaling are inconclusive with no significant difference observed in the NMI scores compared to equivalent experiments without rescaling. Although it is thought that rescaling would be effective for images of objects with consistent physical sizes, objects in the natural scenes that dominate the dataset vary widely in size, and so no significant gains in NMI performance could be achieved. The maximum NMI score achieved is not high (0.227), which is in part due to the impact of imbalanced categories as reported by (Krawczyk, 2016), and therefore a category based evaluation is also necessary.
Representative images from each cluster in the result with the highest NMI score (C 8 in Table 2) are shown in Figure 5. The relationship between the ground truth and this clustering result are shown in Table 3. To obtain a better understanding of each identified cluster, a treemap (Bruls, Huizing, & Van Wijk, 2000) is shown in Figure 6, which allows the relative sizes of each cluster and their representative samples to be visualised simultaneously. To discuss the performance of the clustering result quantitatively, the confusion matrix is shown in Figure 7. Since the non-parametric Bayesian method optimises the number of clusters automatically, some clusters are manually merged based on the appearance of their representative samples so that the number of merged clusters corresponds to the number of ground truth categories. For example, cluster 'A', 'B' and 'F' are merged and regarded as 'Rock', and they appear at the first column of the confusion matrix as a single merged cluster. Since the number of 'Artificial Object' in ground truth is extremely small compared to other categories, the category is merged with 'Cable' and a × 6 6 confusion matrix is shown.

| Habitat map
Habitat maps are useful as they summarise the geological and ecological patterns observed in a seafloor region. Figure 8 shows the Note: Rows and columns correspond to the ground truth and the clustering result using C 8 in Table 2, respectively.
F I G U R E 6 Visualisation of the size of each cluster (C 8 in Table 2) using a tree-map representation. The same colours as Figure 5 are assigned for each cluster and the areas are proportional to the number of image patches in each cluster [Color figure can be viewed at wileyonlinelibrary.com] habitat map obtained by plotting the semantic clusters generated by the proposed method. Figure 8b shows the result with the highest NMI score (C 8 in Table 2), and Figure 8a is the clustering result for the same preprocessing steps but without the georeference regularisation (C 4 in Table 2). Comparison with the distribution of ground truth in Figure 3b illustrates that the habitat map in Figure 8b can identify areas corresponding to categories such as "Bacterial Mat," "Shell Fragment," and "Cable" more effectively than the habitat map in Figure 8a. Since these categories have geographic distribution patterns larger than the footprint of an image, the proposed georeference regularisation is effective at extracting the features that are representative of these categories.

| Image search
The performance of content based image search using Euclidean distance and cosine similarity are quantitatively evaluated by taking the average values of top-10 accuracy, defined as the rate of images retrieved with the same ground truth category as the query image for each ground truth category (Wu et al., 2013). Feature learning is achieved for the proposed autoencoder trained with/without the georeference regularisation and with/without rescaling. These correspond to autoencoders labelled C 4 , C 5 , C 8 , and C 9 , respectively, in Table 2.

| Results
The results in Table 5 show that the proposed georeference regularisation improves the performance in every category, with an overall increase in accuracy across all categories from 47% to 59%.
The largest improvement is for "Cable," from 10% to 15% accuracy without the georeference regularisation to a maximum value of F I G U R E 7 Confusion matrix between ground truth categories and the unsupervised clustering result using C 8 in Table 2. Some clusters and ground truth categories are manually merged based on the appearance of representative images. The values in the matrix are normalised, and diagonal elements correspond to the recall values in Table 4 T A B L E 4 Precision, recall, and F 1 -score for the clustering result using C 8 in Table 2 Category Note: The same cluster merging as in Figure 7 is applied. The total accuracy across all categories is 0.56.
F I G U R E 8 Habitat maps based on unsupervised clustering result. The clusters corresponding to "Bacterial Mat" ("J"), "Shell Fragment" ("H") and "Cable" ("L") appear clearly in (b). The results demonstrate that the proposed georeference regularisation enhances clustering performance over wide spatial distributions (a) without georeference regularisation (C 4 in Table 2); (b) with georeference regularisation (C 8 in Table 2 Regarding the similarity metrics, Equation (7) for the proposed georeference regularisation assumes that the similarities of h are related to a t-distribution, which is derived from Euclidean distance . However, interpretation of the autoencoder learnt feature space is challenging, and the results indicate that Euclidean distance and cosine similarity are almost equivalent for the dataset used in this study.
Identifying the location of similar images is important to interpret spatial patterns of interesting targets. In comparison to clustering, which interprets the representative patterns in the dataset, content based image search can generate target specific distribution maps using the same unsupervised feature space. This can be useful when specific targets within a cluster are of interest, or where the target is rare and so does not form an independent cluster. Since the query target is known, the autoencoder and similarity metric used can be tailored to the type of object, where for human-made objects such as "Cable" and "Artificial Object," the georeference regularisation with rescaling and cosine similarity provided the best performance.

| Utility map
The utility maps in Figure 9 show some results of image search and the locations of images that have a similar appearance. These utility maps can form a useful tool for rapidly understanding complex, multiparameter spatial patterns in georeferenced imagery. An important point is that the distributions in Figure 9 are spread widely and are not limited within the neighbouring area of the query images. This fact confirms that the proposed georeference loss function in Equation (7) allows meaningful features to be extracted from the images themselves without over-regularising the results of the image search. Looking more closely at Figure 9d shows that some of the results of the search do not include crabs, but instead contain other types of benthic organisms. To obtain a more precise result for these categories, supervised learning based approaches are more T A B L E 5 Mean top 10 accuracy of search in each category (%). 'l2' and 'cos' in the similarity metric correspond to the euclidean distance and cosine similarity, respectively. As with the clustering result in Table 2, the proposed georeference regularisation significantly improves the accuracy scores, especially for 'Cable' which has a characteristic spatial distribution Condition in Table 2 C 4 C 5 C 8 C 9 Georeference regularisation ----✓ ✓ ✓ ✓

| CONCLUSION
This paper has described a novel, unsupervised feature learning method for semantic interpretation of seafloor visual imagery and applied it to a seafloor dataset consisting of more than 12,000 images. Although this study has focused on subsea visual mapping applications, the methods are equally applicable to terrestrial georeferenced imaging applications such as drone and satellite imaging. The study has demonstrated that • Autoencoders implemented using deep convolutional neural networks form an effective and generic method to learn features in seafloor visual imagery.
• The use of georeference regularisation implemented using the Kullback-Leibler divergence criteria leads to a factor of two improvement in the retrieval of information from the seafloor images analysed in this study. This includes geomorphological and ecological patterns that occur on spatial scales larger than a single image frame.
• Correction of colour information in seafloor imagery using physics based techniques improves information retrieval rates by more than 20% when the georeference regularisation is used.
• Correction for spatial scale and distortion of images before feature learning improves the recognition of artificial structures on the seafloor. However, for natural objects that exhibit significant variability in size and shape, the gains in performance achieved through scale correction are minimal.
• Nonparametric Bayesian unsupervised clustering and content-based image search can be implemented directly on features learnt by the proposed autoencoder for effective semantic interpretation and visualisation of spatial patterns in seafloor visual mapping data.
• No significant difference was found between the performance of content-based retrieval of images when using Euclidean distance and cosine similarity metrics in the latent feature space.
The images used in this study can be accessed via SQUIDLE+