Leveraging attention-based visual clue extraction for image classiﬁcation

Deep learning-based approaches have made considerable progress in image classiﬁcation tasks, but most of the approaches lack interpretability, especially in revealing the decisive information causing the categorization of images. This paper seeks to answer the question of what clues encode the discriminative visual information between image categories and can help improve the classiﬁcation performance. To this end, an attention-based clue extraction network (ACENet) is introduced to mine the decisive local visual information for image classiﬁcation. ACENet constructs a clue-attention mechanism, that is global-local attention, between the image and visual clue proposals extracted from it and then introduces a contrastive loss deﬁned over the achieved discrete attention distribution to increase the discriminability of clue proposals. The loss encourages considerable attention to be devoted to discriminative clue proposals, that is those similar within the same category and dissimilar across categories. The experimental results for the Negative Web Image (NWI) dataset and the public ImageNet2012 dataset demonstrate that ACENet can extract true clues to improve the image classiﬁcation performance and outperforms the baselines


INTRODUCTION
Image classification aims to categorize a set of unlabelled images into several predefined classes based on their visual content. Image classification has become a critical task in multiple related areas, such as object detection/recognition [1], visual concept learning [2] and visual knowledge learning [3]. Image classification has made considerable progress in achieving high performance in recent years. However, there are some important issues existing for the task, such as the interpretability of deep neural network-based approaches. The progress in image classification can be partitioned into two stages: human-designed feature-based methods and deep learning-based methods. Regarding the former, people design the features in advance in terms of colour, texture and gradient and apply classifiers to these extracted features. Some representative human-designed features, such as the histogram of oriented gradients (HOG) [4], local binary pattern (LBP) [5], the scale invariant feature transform (SIFT) [6] and speeded up robust feature (SURF) [7], have been proposed. In general, these human-designed features have good interpretability but may have limitations in achieving high image classification accuracy, especially for complicated data. Over recent years, convolutional neural networks (CNNs) [8] have made great success in a variety of visual tasks including image classification. CNNs consist of a stack of convolutional layers interleaved with non-linearities and downsampling, and are capable of capturing hierarchical patterns with global receptive fields as powerful image descriptions. Since the introduction of AlexNet [8] in 2012, a variety of new CNN architectures, including visual geometry group network (VGGNet) [9], Inception net [10], residual network (ResNet) [11], dense convolutional network (DenseNet) [12] and NASNet [13], have been proposed. Besides the design of CNN architectures, a number of works have sought to introduce new modules into networks to meet particular requirements [14][15][16]. The attention mechanism, which adaptively learns the effect of elements of the input on the output, is an important module for deep learning applications, such as image classification and image captioning [17,18]. Wang et al. [19] proposed a residual attention network for image classification, in which the spatial attention and channel attention on the feature maps are introduced to enhance meaningful contents and suppress IET Image Process. 2021; 1-11. wileyonlinelibrary.com/iet-ipr In general, most previous approaches solve the image classification problem based on the global visual context or important local visual information. Actually, there are three types of visual information causing classification results: (1) the global content of images widely used in a number of approaches; (2) the discriminative local visual clues appearing in a category while not existing in the others, such as some important objects; and (3) the common visual information that appears across different categories and is indiscriminative, such as some backgrounds. Figure 1 shows an example illustrating the effect of the last two types of visual information on image classification. In this figure, we show a few region proposals representing different contents of the images from two categories that is 'sea lion' and 'red wolf'. In the same category, there are region proposals (marked by red bounding boxes) corresponding to the key objects with the same semantics and reflecting the category information while there are also some other region proposals (marked by yellow and blue bounding boxes) with different semantics that are irrelevant to the category across different images. In addition, we find that these region proposals irrelevant to the category may appear in the images from different categories. Hence, a well-formed image classification approach needs to capture and enhance the discriminative features effectively while suppressing the common visual information across different categories.
This paper proposes a novel approach called the attentionbased clue extraction network (ACENet) to the learning of visual clues in image classification and seeks to answer the question of what clues encode the discriminative visual information between image categories and can help improve classification performance. First, we utilize Faster R-CNN [21] and selective search to obtain the visual clue proposals that consist of objects and backgrounds. Second, we present a clue-attention mechanism, called global-local attention, between images and visual clue proposals to adaptively measure their correlation. To extract the true clues from the proposals and reinforce their dis-criminability, we introduce a contrastive loss defined over the attention distribution, which encourages those clue proposals that are similar within the same category while being dissimilar across different categories to be given considerable attention. Finally, the image classification is performed based on the combination of the representations of images and the clues. The experimental results conducted on ImageNet2012 and our NWI dataset demonstrate that our approach is effective in extracting the true visual clues and improving the image classification performance.
In summary, our contributions are three-fold: 1. We present the ACENet approach to improve image classification performance by leveraging local visual clues and seek to answer the question of what clues encode the discriminative visual information between image categories. 2. We propose the global-local attention mechanism, which adaptively measures the correlation between visual clue proposals and images in both training and testing and permits the clue proposals to make varying contributions to image classification. 3. A contrastive loss is introduced to reinforce the discriminability of visual clue proposals and sparsify the attention distribution with consideration of cross-sample information for the extraction of true clues.
The remainder of this paper is organized as follows. Section 2 presents a brief overview of related work. Section 3 describes the details of our proposed method, Section 4 reports the experimental results for the two datasets, and Section 5 concludes the paper.

RELATED WORK
At present, CNNs have been proven to be the most effective model for solving the image classification problem. Thus, we review the CNN-based methods for image classification in this section. CNN architectures. Since the introduction of AlexNet [8] in 2012, a variety of representative CNN architectures, including VGGNet [9], Inception net [10], ResNet [11], GoogLeNet [10], and DenseNet [12], have been proposed. These variants seek to explore a better model mainly from the following three aspects: (1) different network structures, (2) deeper networks, and (3) high efficiency. For example, Huang et al. [12] designed a new network structure named DenseNet for image classification, which connects each layer to every other layer in a feed-forward fashion and is different from traditional convolutional networks that generally connect each layer and its subsequent layer. Regarding the depth of networks, VGGNet and GoogLeNet demonstrated that a deeper network would extract higher-quality image features. He et al. [11] presented ResNet with a depth of up to 152 layers, a substantially deeper network than those proposed previously, and demonstrated its powerful image classification performance on ImageNet. Convolutional networks are computationally expensive as they become increasingly deeper and wider. With the growing complexity of networks, the issue of how network architectures can be made more efficient has received considerable attention. In particular, group convolutions [22] have been proven to be a popular method to decrease the number of model parameters and reduce the computational complexity. MobileNet [23] and ShuffleNet [24], which could be viewed as extensions of the grouping operator, constructed efficient network architectures based on the combination of depthwise convolution and pointwise convolution with 1 × 1 convolutional filters. Tzelepis et al. [25] introduced a generic model compression method that had minimal impacts on accuracy while reducing the inference time and memory footprint of a network.
Attention mechanism. Regarding the progress of deep learning for image classification in terms of different network structures introduced above, the introduction of an attention mechanism module is a significant way to improve classification performance and interpretability. In general, attention can be interpreted as a means of biasing the allocation of available computational resources towards the most informative components of a signal [16,[26][27][28]. In recent years, spatial attention and channel attention have been widely used for CNNs, especially in visual tasks. Woo et al. [26] proposed a convolutional block attention module (CBAM) for CNNs that combined channel attention and spatial attention for adaptive feature refinement. Hu et al. [27] proposed a network called squeeze-and-excitation network (SENet) and introduced an architecture unit named the 'squeeze-and-excitation' block (SE block) to enhance the representational power of deep networks by explicitly modelling the interdependencies between the channels of the convolutional features. The SE block can be regarded as a self-attention function on channels. Attention mechanisms have been widely applied to the image classification and recognition tasks. Jaderberg et al. [16] introduced a new learnable module called the spatial transformer, which allows networks to select the visual regions of an image that are most relevant. Zheng et al. [28] proposed a multi-attention convolutional neural network to learn more discriminative fine-grained features in the fine-grained image classification task. By contrast, our proposed ACENet seeks to find those visual regions from images that encode the most discriminative visual information for image classification with the attention mechanism and enhance the classification performance.
Image classification applications. From the view of applications, image classification has been used in various areas, such as retrieval [29], surveillance/monitoring [30,31] and medical analysis [32]. In these areas, researchers focus on natural images and specific-domain images related to traffic, remote sensing, industry, medical imaging and so forth. A variety of issues have been explored in these image classification applications. For example, as an extension of traditional image classification, fined-grained image classification [29] aims to recognize subcategories under some basic-level categories, where the objects of different subcategories are both semantically and visually similar to each other. Image classification has also been used in visual concept learning to mine a large range of concepts from the vast resources of online data [33]. To accurately imple-ment object detection that is widely employed in surveillance applications, He et al. [34] presented a novel algorithm called the Mask R-CNN that can generate a high-quality segmentation mask and label for an object. Based on the Mask R-CNN, Wu et al. [1] introduced an improved Mask R-CNN to mitigate the performance deterioration when samples are reduced for object detection.

Framework
In the task of C -class image classification, the data are given extracted from x i , where each proposal p m i is represented by a feature vector p m i ∈ ℝ l . Figure 2 illustrates the framework of the proposed ACENet, which includes four main processes: clue proposal extraction, visual representation, the clue-attention network and classification. ACENet first extracts multiple visual regions from an image and takes them as the clue proposals. The convolutional neural network is then employed to obtain the representation of both the original images and clue proposals. To learn the clues that have important effects on recognizing categories of images, an attention network is introduced to learn the distribution of the attention on the clue proposals and highlight those clue proposals helping to improve the classification with the contrastive loss function. Finally, the classification is performed based on the concatenation of the features derived from the image and clue proposals.

Clue proposal extraction
In this work, the clue proposals are considered typical local regions in images that have fine and rich visual information, such as important objects and background. We extract the clue proposals based on the cooperation of Faster-RCNN [21] and selective search [35]. First, we implement the Faster R-CNN algorithm to produce M r object region proposals: , where p r,i describes the ith proposal with its coordinates and c r,i denotes the confidence (predicted probability) of this proposal containing an object. In this work, the Faster-RCNN is pre-trained on the MSCOCO dataset and does not perform fine-tuning. Although the datasets used for image classification in this work consist of a proportion of categories that do not appear in MSCOCO, the extracted regions also cover many meaningful objects that can be employed as the clue proposals. In the implementation, we empirically choose 5 proposals with the highest probability from the Faster-RCNN. Second, we utilize a selective search algorithm, which performs the computing in terms of texture, colour, size and overlap, to generate a

Attention-based visual clue extraction
The determination of visual clues for image classification depends on two factors: (1) a specific image and its category, and (2) the discriminability of clue proposals. For the first factor, we present a global-local attention mechanism between an image and the clue proposals extracted from it. Since the category label is unknown for a test image, we do not actually build an attention mechanism between the clue proposals and class labels. Regarding the discriminability, we consider that the true clues for a specific image should be similar to the clues in the images of the same category and dissimilar to the images from the other categories. First, we introduce the global-local attention mechanism between the original image and clue proposals to generate a discrete attention distribution over clue proposals. Generally, the true clues should be given more attention and lead to a high probability at the corresponding entries of the attention distribution. In general, an attention mechanism can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key [36]. The general attention mechanism can be formulated accordingly as follows: where Q, K, V and Attention(⋅, ⋅, ⋅) refer to the query, key, value and output, respectively; and sim(⋅, ⋅) denotes a certain function to measure the correlation of queries and keys. In this work, we have the representation of images and the clue proposals , which correspond to the query and key (here, the key and value are the same) in Equation (4), respectively; and our clue-attention model can be specified as follows: where a m i , corresponding to the function sim(⋅) in Equation (4), encodes the correlation of x i and p m i ; i denotes the discrete attention distribution on all clue proposals for image x i ; W X , W P ∈ ℝ d ×l are the transformation matrices that map the visual representation into low-dimensional spaces, w a ∈ ℝ d , and a i = (a m i ) T . The clue context vector associated with the clue proposals is computed by: where m i is the mth entry of i , andp i , corresponding to Attention(⋅, ⋅, ⋅) in Equation (4), denotes the output of the global-local attention. The final representation of image x i for image classification is achieved by the concatenation of x i and . The global-local attention mechanism formulates the relationship between the global information (i.e. images) and the local information (i.e. clue proposals) extracted from it. We consider that the global-local attention is different from the following widely used attention mechanisms: (1) self-attention, which describes the importance of elements in images or sentences by measuring the relationship of these elements, and (2) encoder-decoder attention, which generally builds the relationship between the elements in outputs and inputs and has been widely used in language translation and image captioning. Encoder-decoder attention can be considered as the alignment of outputs and inputs.
Second, we introduce a contrastive loss defined over the clue attention distribution, which encourages considerable attention to be paid to the true visual clues in the training process of models. According to the aforementioned analysis of the discriminability of clue proposals, we should pay considerable attention to the similar clue proposals in images belonging to the same category and little attention to the similar clue proposals across different categories. Thus, the contrastive loss is defined in the following form: where i 1 and i 2 refer to two images belonging to the same category; j 1 and j 2 refer to two images from different categories; T p and T n denote the numbers of image pairs from the same category and different categories, respectively; and W i 1 i 2 (W j 1 j 2 ) is the similarity matrix of clue proposals from the i 1 th ( j 1 th) image and the i 2 th ( j 2 th) image. We define the (m 1 , m 2 )-entry of matrix W i 1 i 2 as follows: where ‖ ⋅ ‖ 2 denotes the L2-norm. W m 1 m 2 j 1 j 2 can also be calculated as Equation (6). Clearly, W is also large. In this case, little attention will be given to them when we minimize the second term in Equation (5), and thus those visually different clue proposals are more likely to attract considerable attention. Consequently, minimizing the contrastive loss in Equation (5) encourages the attention being paid to the true clues.
The final loss function is given by combining the crossentropy loss and the contrastive loss: where is a balance parameter. L cls denotes the cross-entropy loss defined in the following form: where y i andŷ i indicate the true label vector and the predicted label vector, respectively. In essence, the contrastive loss can be considered a regularizer of the regularization framework or prior knowledge from the viewpoint of Bayesian learning.

EXPERIMENTAL RESULTS AND ANALYSIS
This section introduces the experimental details, including the datasets, implementation details and main results. In addition, we compare our approach with multiple representative models for image classification.

Datasets
We conduct the experiments and test the performance of ACENet over a small dataset we built and the public Ima-geNet2012 dataset. The small dataset, called negative web image (NWI) dataset, is crawled from social media websites and consists of the data that are undesirable and sometimes harmful to young people. The NWI dataset has a total of 10,500 images and comprises approximately the same number of 6 negative categories, including pornography, luxury watches, luxury cars, luxury bags, cash, and jewellery; and a category of natural images.
In the experiment, the dataset is evenly divided into two parts for training and testing. ImageNet2012 comprises 1.28 million training images and 50K validation images from 1000 categories, and has been widely used in the image classification task. We train the networks on the training set and report the performance on the validation set for ImageNet2012. In the training, we follow standard practices and perform data augmentation with random cropping using scale and aspect ratio to a size of 224 × 224 pixels (or 299 × 299 for the models associated with Inception-ResNet-v2) and perform random horizontal flipping. When evaluating the models we apply centre-cropping so that 224 × 224 pixels are cropped from each image, after its shorter edge is first resized to 256 (299 × 299 from each image whose shorter edge is first resized to 352 for the models associated with Inception-ResNet-v2).

Implementation details
For each image, we extract 10 visual regions as clue proposals, that is M p = 10. We employ VGG , for classification on ImageNet (NWI). In the training process, we use the stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and a minibatch size of 256 for NWI and 1024 for Ima-geNet. To reduce the computational load, we randomly choose a fixed amount of image pairs in each minibatch when computing the contrastive loss shown in Equation (5). Empirically, we choose T p , T n = 4096 for ImageNet and 1024 for NWI. If the number of image pairs belonging to the same category or different categories in a minibatch is less than the fixed amount, we choose all the image pairs for the computation of the contrastive loss. All the experiments are conducted on a platform with 8 Nvidia Titan V GPUs using PyTorch.

4.3.1
Experimental results on the NWI dataset Table 1 shows the classification results of our approach and the compared models for the NWI dataset. From this table, we The illustration of visual clue exaction for the six negative categories of the NWI dataset observe that most of the methods employed here achieve high performance for the small dataset. Compared with the baselines and the state-of-the-art methods, our ACENet achieves the lowest error rates. For example, ACENet obtains a 2.16% top-1 error when using Inception-ResNet-v2 as the backbone, exceeding SENet (2.91%) by 0.75%. ACENet also works well for other backbones: it decreases the top-1 error by 1.48%, 1.22% and 0.88% compared with SENet when using the basic VGG-16, ResNet-50 and ResNet-101 models, respectively. Compared with BoTNet-S1-128, ACENet with the EfficientNet-B7 backbone decreases the top-1 error by 0.22%. Figure 3 illustrates the examples of visual clue exaction for 6 negative categories, where the visual clue proposals with the most attention (i.e. corresponding to the largest m i for the ith image) are marked by red rectangles. For the NWI dataset, we observe that the category of an image is dominantly determined by the important visual objects in it, and thus it is effective to improve the classification performance using visual clue extraction.

4.3.2
Experimental results on the ImageNet dataset Figure 4 shows the change in the top-1 accuracy (i.e. 1 − top-1 error) with the balance parameter in Equation (7) for ACENet on the validation set of ImageNet2012. It is observed that the performance reaches a peak at = 10 −4 and then begins to degrade. For all the experiments in this work, we choose = 10 −4 . In this case, we find that L ctr ∕L cls ≈ 0.12 when the training stops, which can be empirically considered reasonable.
We report the classification results of ACENet, the baselines and state-of-the-art models for ImageNet in Table 1. From the table, we observe that our ACENet model largely decreases the The top-1 accuracy versus parameter in Equation (7)  ResNet-50 baseline (44.7M), respectively. As shown in Table 1, we observe that ACENet with the backbone EfficientNet-B7 achieves a decrease of 0.83% in terms of the top-1 error rate compared with BoTNet-S1-128 although it uses less parameters.
In Table 2, we show the results of the ablation study for three configurations: (1) removing the clue-attention network and contrastive loss (i.e. performing classification based on the combination of original images and clue proposals using backbones), (2) keeping only the clue-attention network, and (3) keeping both the clue-attention network and contrastive loss (i.e. the proposed ACENet). The three configurations correspond to the three rows for each backbone in the table. Compared with the baselines in Table 1 that employ only the global image information, we observe that the introduction of clue proposals to the classification (i.e. configuration 1) can enhance the performance. From Table 2, we find that the clue-attention network and contrastive loss can improve the image classification performance. For example, for the backbone ResNet-101, the top-1 error decreases from 23.27% to 22.58% when introducing the clue-attention network only and reaches 22.09% if we further consider the contrastive loss term. The results show that both clue attention and the contrastive loss play important roles in improving the classification performance by strengthening discriminative features and suppressing common visual information across the different categories. In Figure 5, we show the change in the top-1 error rates in the training and testing of our approach and the ResNet-50 and ResNet-101 baselines. From the figure, we can observe that ACENet achieves lower error rates in training and testing processes than the baselines. Figure 6 illustrates the change in clue-attention distribution during training, where the height of a bar reflects the degree of attention given to a clue proposal in an epoch. We observe that, as the training epochs increase, the attention distribution changes and converges to the state that the clue proposals that The change in attention distribution during training for an image of the 'toy terrier' category, where cp1 to cp10 denote 10 clue proposals extracted from the original image FIGURE 7 An example illustration of extracting clues for several categories in the image classification task look more discriminative are given more attention to e.g., 'cp10' in this figure. Clearly, this clue proposal seems to possess good discriminability to classify the image into the 'toy terrier' category. In Figure 7, we show an example illustration extracting clues for several categories in the image classification task. The red, green and blue bounding boxes shown in each image represent the top three clue proposals in descending order of attention. From the figure, we observe that these clue proposals, especially the red proposals, possess good discriminability to determine the category to which images should belong. For example, as shown in the second image of the category 'red wolf', the top clue proposal (red) accurately corresponds to the region of a red wolf that excludes the background and other objects independent of this category.

CONCLUSION
This paper proposes an attention-based visual clue extraction approach that seeks to determine what clues encode the information of categories and can help improve the image classification performance. In this approach, we first construct a clue-attention mechanism called global-local attention between an image and its clue proposals, which permits the clue proposals to make varying contributions to image classification. Then, we introduce a contrastive loss in the training to encourage considerable attention to be devoted to true visual clues that possess discriminative information. The experimental results show that the proposed approach can effectively extract the visual clues that reflect the category information, and can improve the classification performance compared with the baselines and the state-of-the-art method. The visual clue extraction is an important issue in image classification that can make the results more interpretable, and thus can be used in the scenarios with the requirement of high interpretability, such as medical image classification and negative web image recognition. In addition, the current work in the paper is performed on the standard NWI and ImageNet datasets. Actually, there are a considerable amount of data with noisy labels or without labels on the Internet that are useful to enhance the image classification performance. In the future, it will be addressed to incorporate these non-standard data into ACENet under the weakly-supervised or semi-supervised learning paradigm for online intelligence applications.