Hierarchical bilinear convolutional neural network for image classification

Image classification is one of the mainstream tasks of computer vision. However, the most existing methods use labels of the same granularity level for training. This leads to ignoring the hierarchy that may help to differentiate different visual objects better. Embedding hierarchical information into the convolutional neural networks (CNNs) can effectively regulate the semantic space and thus reduce the ambiguity of prediction. To this end, a multi ‐ task learning framework, named as Hierarchical Bilinear Convolutional Neural Network (HB ‐ CNN), is developed by seamlessly integrating CNNs with multi ‐ task learning over the hierarchical visual concept structures. Specifically, the labels with a tree structure are used as the supervision to hierarchically train multiple branch networks. In this way, the model can not only learn additional information (e.g. context information) as the coarse ‐ level category features, but also focus the learned fine ‐ level category features on the object properties. To smoothly pass hierarchical conceptual information and encourage feature reuse, a connectivity pattern is proposed to connect features at different levels. Furthermore, a bilinear module is embedded to generalise various orderless texture feature descriptors so that our model can capture more discriminative features. The proposed method is extensively evaluated on the CIFAR ‐ 10, CIFAR ‐ 100, and ‘Orchid’ Plant image sets. The experimental results show the effectiveness and superiority of our method.


| INTRODUCTION
Convolutional Neural Networks (CNNs) have demonstrated their outstanding capability on the task of image classification and achieved impressive performances [1][2][3][4][5]. However, most state-of-the-art methods have a common defect: the training labels lose the hierarchical structure information of categories since they are at the same level of granularity. Ignoring the hierarchical semantic relations implied in an object would result in losing useful discriminative information for the classification task.
Psychologists suggest that the human visual system has a hierarchical structure, and it already has the basic-level visual classification (e.g. dog) ability before fine visual classification (e.g. pomeranian) [6]. A typical hierarchical classification scenario is: when identifying a snake, humans commonly follow a coarse-to-fine flow 'animal → serpentiformes → snake'. One big advantage of hierarchical classification is that the error can be restricted to a subcategory. This hierarchy improves classification performance for three reasons: (1) This hierarchical structure encodes rich context information across different levels. For instance, a calamondin orange may not be easy to differentiate from a yellow ping pong ball. As we all know, the yellow ping pong ball belongs to sports equipment and calamondin orange belongs to fruits. Therefore, it is easy to tell them apart because the environment (context information) between the fruit and the sports equipment is so different. (2) This hierarchy can capture multi-granularity semantic concepts, thus enriching the object features. For instance, a nereid may not be easy to differentiate from a centipeda. In biology, nereid (fine-level category label) belongs to the annelid (coarse-level category label) and centipede (fine-level category label) belongs to the arthropoda (coarse-level category label). The biggest difference between the annelid and the arthropoda is that the annelid is homonomous metamerism (each segment of an animal is basically the same) and the arthropoda is heteronomous metamerism (each segment of an animal is distinctly different). Specific distinguishing features (homonomous metamerism vs. heteronomous metamerism) that are ignored by the fine-level category label (nereid vs. centipeda) can be captured from the coarse-level category label (annelid vs. arthropoda). (3) This hierarchical classification is a multi-task learning method, which can effectively learn coarse-to-fine feature representation by jointly optimising coarse category classifier and fine category classifier. In order to better simulate the human visual system, machines should have the ability to attend more specific information at different levels of granularity. An intuitive alternative is to organise the network structure in a hierarchical manner according to the divide-andconquer strategy.
Herein, a multi-task learning framework, named as Hierarchical Bilinear Convolutional Neural Network (HB-CNN), is developed by seamlessly integrating CNNs with multi-task learning over the hierarchical visual concept structures. Specifically, it contains multiple branch networks along the main convolution workflow (e.g. VGG16 [2]) to do predictions hierarchically. In this way, the model can learn coarse to fine concepts at the output stage. To smoothly pass hierarchical conceptual information and encourage feature reuse, a connectivity pattern is proposed to connect the features of multiple output branches. Furthermore, we embed the bilinear module [7] into this special CNN to better model subtle visual differences, thereby improving classification performance. Besides, inspired by the work in Ref. [8], we use a hierarchical training strategy that can enhance the model performance and prevent the impact of vanishing gradient problem. This training strategy enables lower-level parameters to be activated and trained earlier than higher-level parameters. Our results on CIFAR-10, CIFAR-100, and 'Orchid' Plant image sets suggest that the proposed method outperforms the corresponding CNN-based baselines.
In Section 2, We briefly review the related works of image classification. Then we describe the proposed HB-CNN method in Section 3. The experimental results and analysis are shown in Section 4. Last, We summarise this work in Section 5.

| RELATED WORK
Convolutional neural networks (CNNs) have recently achieved great success in image recognition. Recently, many methods were proposed to improve classification performance. For example, a bilinear model [7] was proposed to compute the pairwise feature interactions by two independent subnetworks. Branson et al. [9] introduced the combination of high and low level features to learn more discriminative representations. Wang et al. [10] showed that mid-level representation learning can be enhanced with the CNNs. Therefore, they proposed a discriminative mid-level patches with a CNNs to improve classification accuracy. In recent years, visual attention models have been widely applied to computer vision, and various studies have successfully applied it to image classification. Zheng et al. [11] proposed a multi-attention convolutional neural network that learns discriminative part localization and fine-grained feature representation for image classification. Besides, Liu et al. [12] introduced part-based attributes to guide learning more discriminative features for image classification. However, these methods primarily focus on classification of categories of one particular level, and overlook hierarchical structure information of categories.
It is an interesting research to embed the hierarchy of object categories in image recognition. Because an inherent hierarchical data structure can often be seen in image datasets, where the images are at a subordinate level, and the human visual system is already capable of basic-level visual classification before fine visual classification. Therefore, it is in line with the human way of thinking for a computer to learn the hierarchy of embedded object categories for image recognition. Srivastava et al. [13] proposed a method that augments standard neural networks with tree-based priors over the classification parameters. They exploited class hierarchy to transfer knowledge among similar classes for transfer learning. Although [13] achieves a good performance, it does not use the hierarchical nature within the layers. Yan et al. [14] introduced hierarchical deep CNNs (HD-CNN) by embedding deep CNNs into a two-level category hierarchy. Specifically, the HD-CNN uses a coarse category classifier to separate easy categories, and a fine category classifier to separate difficult categories. One limitation of the HD-CNN is that it requires the coarse and fine category components to be pre-trained, which is quire time-consuming. Zhu et al. [8] proposed a branch convolutional neural network for image recognition. Similar to our approach, they use hierarchical label tree to train a multiple branch networks. Although this approach achieves good results, it does not combine the specific information at different levels, which is not sufficient to represent the object. Liu et al. [15] build a Confusion Visual Tree (CVT) based on confused semantic level information to identify the confused categories, which can pay more attention on confused categories, thereby effectively boosting the classification accuracy. Chen et al. [16]. built a new Hierarchical Semantic Embedding (HSE) network. It predicts the score vector for each level, and leverages the predicted score vector of the coarse level to guide learning fine level feature representation. Some works [17][18][19][20][21][22][23][24] also adopt class hierarchy to combine different models for image classification. Although these existing approaches have achieved a substantial improvement on image classification, one potential problem is that they difficult to capture more specific information (e.g. context information) at different levels of granularity, and this problem may seriously limit their classification accuracy.
Based on the above works, our HB-CNN integrates the category hierarchy into the CNNs to progressively regularise label prediction and guide representation learning. In addition, it exploits a connectivity pattern to smoothly pass hierarchical conceptual information, and embeds bilinear module to model subtle visual differences.

| ARCHITECTURE OF THE PROPOSED HB-CNN
In this section, we first review Bilinear Convolutional Neural Network (B-CNN). Then, we introduce the Hierarchical Bilinear Convolutional Neural Network (HB-CNN) architecture. Finally, we discuss possible training strategies of HB-CNN.

| Bilinear convolutional neural network
The bilinear convolutional neural network (B-CNN) is proposed by Lin et al. [7]. B-CNN represents an image as a pooled outer product of features derived from two feature extractors. It can model local pairwise feature interactions, and generalise various orderless texture feature descriptors. The overall architecture of B-CNN is illustrated in Figure 1. The B-CNN model is expressed by the following equation: Here, F 1 and F 2 represent the feature functions, P is a pooling function, and C denotes a classification function. The feature outputs are combined at each location using the matrix outer product. The bilinear combination of F 1 and F 2 at a location l can be expressed by using the following expression: Then, the sum pooling aggregates the bilinear combination of features across all locations (L represents the set of locations) in the image to obtain a global image representation Φ(I).
The resulting bilinear vector x ¼ Φ(i) is then passed through signed square root step (y ¼ signðxÞ ffi ffi ffi ffi ffi ffi ffi |x| p ), followed by l 2 normalisation (z ¼ y/‖y‖ 2 ). More details of the B-CNN can be found in Ref. [7].

| Hierarchical Bilinear Convolutional Neural Network
The overall architecture of Hierarchical Bilinear Convolutional Neural Network (HB-CNN) is illustrated in Figure 2 and a corresponding label tree is in Figure 3. The label tree can usually be generated by unsupervised methods [25] or manually constructed based on visual similarity.
The HB-CNN constructs a network with internal output branches on the existing CNN (e.g. VGG16 [2]). Each level in the label tree of the data is captured by at least one branch, and each branch outputs a prediction at the corresponding level in the label tree ( -199 used to produce the output. To smoothly pass hierarchical conceptual information and embed multi-granularity semantic features, a connectivity pattern is proposed to connect the features of multiple output branches. On the top of last branch, a bilinear module [7] is embedded, which can model subtle visual differences to support visual-similar objects recognition.

| Connectivity pattern
To smoothly pass hierarchical conceptual information and encourage feature reuse, we propose a connectivity pattern (CP): for each output branch layer, the feature-maps of preceding branch layer and it own feature-maps are used as input into subsequence branch layer. This connectivity pattern is illustrated in Figure 2. Consequently, the k th level branch receives the feature-maps are expressed as follows: where x kÀ 1 refers to the concatenation of the feature-maps produced in branch layers k À 1, b k represents the network output of the k th block (see Figure 2), and FC is the fully connected layer. We define Concat(⋅) as a concatenate operation. In recent years, some works [26,27] have also used ideas similar to connectivity pattern, but they are just to pass the low-level features to the high-level. Different from the these methods, the proposed connectivity pattern passes coarsegrained features to fine-grained features, thereby enhancing the distinguishability of features. For instance, for a three-branch HB-CNN, an image of a ship will contain a hierarchical label of [transport, water, ship]. When the image is fed into HB-CNN, the network first learns the information at the Transport, Water and Ship levels. Then, the connectivity pattern is used to integrate these different granular information to boost the classification accuracy.

| Network details
We implement our framework based on the VGG16 [2].
The HB-CNN model uses VGG16 as base network to construct a network with internal output branches. The exact network configurations are shown in Table 1. Specifically, we implement the block 1 with the preceding two convolutional layers of the VGG16 and a pooling layer, and on the top of first branch, we build an additional fully connected layer and a softmax layer to produce a coarse prediction. The following structure is similar to block 1. It is worth noting that after block 5, we embed the bilinear module (see Table 1). Note that the HB-CNN with the bilinear module (B-CNN [7]) removed is called Hierarchical Convolutional Neural Network (H-CNN).

| Loss Function
Cross-entropy loss is used as the loss of each branch of HB-CNN. The cross-entropy loss of the i th sample on the k th level in the label tree is expressed by the following equation: where f j is the j th element in the class scores vector f from the last layer of the model. The final loss is defined in equation (6): K represents the number of levels in the label tree, and W k is the loss weight of the k th level. The loss function takes into account the loss of each branch layer, ensuring that the structural priors play an internal guiding role for the entire model, making it easier to flow the gradients back to the shallow layers.

| Training strategies
Value W k in Equation (6) defines how much contribution a level makes to the final loss function. We use a hierarchical training strategy to jointly optimise multiple losses, and each branch uses different supervisory information to train the network. The supervision labels of each branch shall be determined according to the label tree structure of the dataset. There are three different patterns of this hierarchical training strategies. (1) Branch-by-branch training strategy (BBT-strategy): we use the parameters trained by the previous branch as the initialisation parameters, and then use the supervision labels to fine turn the module of the latter branch. For example, for a three-branch HB-CNN structure, the loss weights is first set to [1, 0, 0] and changed to [0, 1, 0] after 50 epochs and assigned as [0, 0, 1] after 70. This training strategy makes each branch to be fully trained, but ignores the connection of information between tasks. (2) Multi-branch joint training strategy (MBJ-strategy): According to importance of the task, a fixed weight is assigned to each loss, and all branches are trained simultaneously. For example, for a three-branch HB-CNN structure, the loss weights are first set to [0.1, 0.1, 0.8]. This training strategy can be regarded as a multi-task learning strategy, which can jointly train the related classifiers among them to enhance objects discrimination power. (3) Multi-branch dynamic joint training strategy (MBDJ-strategy): this training strategy is similar to the multi-branch joint training strategy except that it dynamically modifies the loss weight distribution while training HB-CNN. For example, for a three-branch HB-CNN structure, the loss weights is first set to [0.8, 0.1, 0.1] and changed to [0.1, 0.8, 0.1] after 50 epochs and assigned as [0.1, 0.1, 0.8] after 70. Throughout the training process, it shifts loss weight distribution from coarse-level to fine-level and guides the HB-CNN model to pay more effort to learn particular level at different training stages, thus paying attention to more subtle regions.

| EXPERIMENTAL RESULTS AND ANALYSIS
This section describes our experimental results and analysis for algorithm evaluation over multiple image sets.
Implementation Details: We utilise VGG16 as base network, and initialise the parameters with pre-trained parameters on ImageNet, and then fine tune it on CIFAR-10, CIFAR-100, and 'Orchid' Plant Image Set. The HB-CNN is constructed based on the classical VGG16 (second column in Table 1) and its corresponding branch networks (third and fourth columns in Table 1). For network training, we using stochastic gradient decent (SGD) optimiser with momentum 0.9. In all experiments, we will step-wise decompose our approach to reveal the effect of each component.

| CIFAR-10
The CIFAR-10 dataset [23] contains of 10 classes of 32 � 32 RGB images. The 60,000 images are divided into training and test sets, which have 50,000 and 10,000 images, respectively. For the CIFAR-10 dataset, we manually construct a label tree (see Figure 3).

| CIFAR-100
The CIFAR-100 dataset [28] has 100 classes containing 600 images each. The CIFAR-100 is divided into 20 superclasses and we manually grouped into eight coarser classes to provide richer information.

| Orchid Plant Image Set
We have crawled images which included 51 plant species (object classes) in the 'Orchid' family, where 32,064 plant images are used to train and other 7894 images are used for testing. Figure 4 shows some examples from 'Orchid' Plant Image Set. The coarse-grained level corresponds to the genus of 'Orchid', which contains eight classes. The fine-grained level is the target level corresponding to the 'Orchid' species, containing 51 target classes. The category information contained in the 'Orchid' Plant image set is shown in Table 2.

| HB-CNN on CIFAR-10 and CIFAR-100
The HB-CNN model is constructed with VGG16 as baseline model and adds five additional branches to output predictions (see Table 1). In Table 1 A 1 , A 2 , A 3 , A 4 , A 5 refer to the loss weight of the coarse 1 branch, coarse 2 branch, coarse 2 branch, fine branch and fine branch, respectively.

| Comparison of Training Strategies
In the experiment, we use three different strategies to verify the advantages of hierarchical training. Table 3 shows the loss weight distribution. It can be seen from Table 3 that the hierarchical training strategy (MBDJ-strategy) can effectively boost the classification accuracy. One possible reason is that it can be regarded as a multi-task learning strategy, which can jointly train the related classifiers among them to enhance objects discrimination power. Further, it guides the H-CNN model to pay more effort to learn particular level at different training stages, thus paying attention to more subtle regions.

| Justifications of Number of Branches
In order to study the effect of branch number on classification accuracy, we compared three-branch H-CNN structure with five-branch H-CNN structure. The three-branch H-CNN is constructed based on VGG16 (second column in Table 1) and its corresponding branch networks (fourth column in Table 1). Figures 5 and 6 show results on the CIFAR-10 and CIFAR-100, respectively. Compared with the three-branch H-CNN, the five-branch H-CNN can significantly improve the classification accuracy. We attribute this to two factors: (1)  The performance of models trained on CIFAR-10 are shown in Table 4. The baseline model (VGG16) gets 88.11% accuracy while its corresponding H-CNN reaches 89.15%. This phenomenon confirms that using multi-granularity semantic concepts can effectively regularise semantic space and provide extra guidance to focus on more subtle regions. The B-CNN improves the performance from 88.11% to 88.40%. This shows that B-CNN can learn more discriminative representations. The connectivity pattern boosts the classification performance by a margin of 0.37% (from 88.78% to 89.15%). A possible reason is that the connect pattern embeds multigranularity semantic features, which makes the network to capture more differentiated information at different levels. Further, it can be seen from the experimental results that HB-CNN can further improve the classification performance. It is worth noting that herein, we designed two network structures: H-CNN and HB-CNN, where HB-CNN is the combination of H-CNN and B-CNN [7]. It can be seen from the Table 4 that both the proposed HB-CNN and H-CNN can achieve better classification performance than B-CNN [7].
The training procedures on CIFAR-100 are very similar to the ones on CIFAR-10 except the learning rate. The learning rates of all models trained on CIFAR-100 are initialised as 0.001 and drop to 0.0002 at epoch 60 and 0.00005 after epoch 70. It can be seen from the experimental results that our proposed method shows similar properties on CIFAR-10 and CIFAR-100. Figure 7 shows some examples where the H-CNN model predicts correctly but VGG16 predicts incorrectly. These results further  verify that our method can effectively distinguish visually similar objects. Figure 8 shows some examples of H-CNN model prediction errors on the CIFAR-100 dataset. One possible reason is that these images are not only very similar but also very fuzzy, and the human visual system is difficult to distinguish, which increases the difficulty of H-CNN classification.

| HB-CNN on 'Orchid' Plant Image Set
The training procedure on 'Orchid' Plant image set is very similar to the ones on CIFAR-10. Note that we resize the image to 224 � 224 pixels before passing it through the network.
Here, we evaluate on recognising the categories of the finest level (10 species on CIFAR-10, 100 species on CIFAR-100 and 51 subcategories on 'Orchid' Plant image set) as existing methods primarily report their results of this level. The classification results are quantified and compared, as illustrated in Table 5. With the exception of the results on the 'Orchid' Plant image set, the results of all competing methods were originally published. Note that in VGG16 and B-CNN, the authors did not calculate results on the CIFAR-10, CIFAR-100 and 'Orchid' Plant image set, so these are calculated by us. As shown in Table 5, our method obtained classification accuracy of 91.75%, 66.03% and 91.10% on the CIFAR-10, CIFAR-100 and 'Orchid' Plant image set respectively, which outperforms all the previous approaches. This proves that the method of seamlessly integrating CNNs with multi-task learning over the hierarchical visual concept structures is superior to other methods. It is worth noting that our proposed method is inspired by [8] and achieved better  recognition accuracy than [8]. The reason may come from the following three aspects: 1) We use more branches, which can introduce more coarse-grained features into fine-grained features to help image classification; 2) The proposed connectivity pattern can smoothly pass hierarchical conceptual information and encourage feature reuse; 3) The embedded bilinear module can capture more distinguishing features to reduce the ambiguity of predictions.
To verify the generality and effectiveness of our proposed method, we use ResNet-101 as the base network. The HB-CNN (ResNet-101) is constructed based on the classifical ResNet-101 and its corresponding branch networks. As shown in Table 5, our proposed method can improve the recognition accuracy from 90.61% to 92.85%, 67.61% to 69.23% and 91.68% to 94.29% on CIFAR-10, CIFAR-100 and 'Orchid', respectively.
• HD-CNN: uses a coarse category classifier to separate easy categories, and a fine category classifier to separate difficult categories. One limitation of the HD-CNN is that it requires the coarse and fine category components to be pre-trained, which is quire time-consuming. • Branch-CNN: it uses hierarchical label tree to train a multiple predictions branch network. Although this approach achieves good results, it does not combine the specific information at different levels, which is not sufficient to represent the object.
• Coare-to-fine CNN: Fu et al. [20] propose the coarseto-fine layer by applying Bayesian techniques into the network, making the coarse-to-fine network can learning the hierarchical category tree. -205 • Tree-CNN: Roy et al. [24] propose an adaptive hierarchical network structure composed of DCNNs that can grow and learn as new data becomes available.
• VI-CNN B: Liu et al. [15] build a Confusion Visual Tree (CVT) based on confused semantic-level information to identify the confused categories, which can pay more attention on confused categories. • B-CNN: B-CNN [7] represents an image as a pooled outer product of features derived from two feature extractors.
• Highway Net: Srivastava et al. [26] propose a architecture that uses a learning gating mechanism for regulating. • Capsule Net: In capsule networks, the capsule is a vector whose length represents the probability that entity exist, and orientation denotes properties of the entry [27].
• Autonomous: Ma et al. [28] built a genetic DCNN designer, which can generate a DCNN architecture automatically based on the data available for a specific image classification problem. Figure 9 shows the influence of different models on the improvement of classification accuracy in three datasets. The following conclusions can be drawn from the experimental results. First, the performance of the H-CNN model is always better than the corresponding baseline model, which strongly proves that hierarchical nature of CNN can be connected to the structure prior of object classes to enhance the classification effect. Second, the hierarchical training strategy significantly improves the classification accuracy, which confirms that it is very useful to activate the shallow layers of a CNN model by first learning the low-level features with the coarse-level labels. Interestingly, one can easily observe that B-CNN actually has a significant improvement on 'Orchid' Plant image set (see Figure 9). While on the other datasets (e.g. CIFAR-10, CIFAR-100) the improvements were negligible. One possible reason is that the image size of these datasets is very small, so bilinear module cannot sufficiently generate orderless texture features. Finally, the seamless integration of H-CNN and bilinear module can not only relate the hierarchy of CNN to the structural priors of the object classes, but also extract more discriminating information of visual-similar objects, thus significantly improving the classification accuracy.

| CONCLUSIONS
Herein, a multi-task learning framework called Hierarchical Bilinear Convolutional Neural Network (HB-CNN) is developed by seamlessly integrating CNNs with multi-task learning over the hierarchical visual concept structures. HB-CNN can integrate the hierarchy of CNNs with the prior of the structure of the object classes to strengthen the classification ability, and it can also leverage the bilinear module to extract more discriminative information. Furthermore, we introduce a hierarchical training strategy which enables HB-CNN model to utilise label tree as internal guides, and boosts performance significantly. Our experiments on three image sets have demonstrated that our proposed approach can outperform the corresponding baseline CNN.