Deep Learning to Ternary Hash Codes by Continuation

Recently, it has been observed that {0,1,-1}-ternary codes which are simply generated from deep features by hard thresholding, tend to outperform {-1,1}-binary codes in image retrieval. To obtain better ternary codes, we for the first time propose to jointly learn the features with the codes by appending a smoothed function to the networks. During training, the function could evolve into a non-smoothed ternary function by a continuation method. The method circumvents the difficulty of directly training discrete functions and reduces the quantization errors of ternary codes. Experiments show that the generated codes indeed could achieve higher retrieval accuracy.

Introduction: Existing hashing methods mainly adopt binary codes for image retrieval. The codes are generated by binarizing the features learned by data-independent or data-dependent methods [14]. Among the methods, the data-driven deep learning methods tend to perform best [15,11,16,13], thanks to their powerful capability in generating discriminative features. However, the codes generated by deep features are not perfect, subjective to performance ceilings and even performance decline [10,2], with the increasing of code dimension. This is mainly because the deep features with increasing dimensions tend to become sparse, and their smll/ambiguous elements with values close to zero probably cause large quantization errors and degraded feature discrimination [7,12], as they are roughly binarized to be bipolar values +1 or -1. To address the issue, it is natural to introduce a third state 'zero' to specially denote the ambiguous elements, thus yielding {0, ±1}-ternary codes. This kind of codes has recently been proved better than binary codes.
To the best of our knowledge, there are only two methods [6,12] that have been proposed to generate ternary codes for image retrieval. However, the two methods are suboptimal because they have features learned and ternarized in two separated steps. Specifically, in [6] the features are generated with a hinge-like loss trained AlexNet [9], and then ternarized by two thresholds selected empirically. To obtain better thresholds, a searching algorithm is proposed in [12], based on the principle of maximizing the expectation of pairwise ternary hamming distances between similar samples and decreasing the distances between dissimilar samples. No matter how well the thresholds are selected, the features previously learned alone cannot guarantee to be optimal for the latter ternarization.
To alleviate the issue, we are motivated to jointly learn the features and ternary codes. An intuitive idea is to append a ternary function to the feature extraction network and then take them as a whole to train. For this framework, the main challenge comes from the optimization of ternary function, which has zero gradients and makes back-propagation infeasible. To avoid directly training the discrete function, inspired by [3], we propose to replace the function with a smoothed function, which could gradually evolve into a desired ternary function during training. This continuation method is known for optimizing discrete functions with guaranteed convergence [1]. Experiments show that the proposed method indeed outperforms existing ternary hashing methods.
Method: As illustrated in Figure 1, the joint learning forms a network pipeline consisting of four parts: 1) a convolutional neural network (CNN) for learning deep features, 2) a fully connected hash layer for transforming the features into d dimensions, 3) a smoothed ternary function for converting each element of d-dimensional features to be a value close to 1, -1 or 0, and 4) a loss function. As in [6,12], we will use AlexNet for feature learning. To obtain convincing results, we suggest to use the common cross-entropy loss for network training, although other more selective losses, such as hinge-like loss [6], may lead to better codes. The smoothed ternary function is proposed as follows: where α is a positive constant and k needs to be an odd number greater than one, namely k = 3, 5, 7, and so on. The odd k enables f (x) to take both positive and negative values, while the even k could only lead to nonnegative value. As the odd k tends to infinity, as shown in Figure 1, f (x) will converge to a ternary function where the threshold parameter α is identical to the scale parameter α in (1), which is set to be α = 0.5 in our experiments in terms of the fact that the CNN feature elements x usually have values normalized to the range (−1, +1). Considering the equivalence between (1) and (2) in the limit of k, we propose to gradually increase the value of k during training, such that the smoothed function (1) finally approaches the desired ternary function (2). In testing phase, we need to replace (1) with (2) to generate ternary codes, and further, as in [12], the ternary codes will be converted to binary bits via the mapping {−1, 0, 1} → {01, 00, 10}, in order to perform hamming distances-based image retrieval on binary machines.

Experiments:
We compare the proposed method with other two ternary hashing methods DPN [6] and TH [12], on three typical databases CIFAR10 [8], NUS-WIDE [4] and ImageNet100 [5]. Following the settings in DHP and TH, the databases are processed as follows: 1) CIFAR10 consists of 60,000 colored images in 10 classes. We randomly select 1000 images (100 images per class) as a query set, and the remaining 59,000 images are taken as a retrieval set. From the retrieval set, 5,000 images (500 images per class) are randomly selected for network training; 2) NUS-WIDE comprises 269,648 multi-labeled images in 81 classes. By convention, we select the most frequent 21 classes for experiments. Among them, we randomly select 100 images per class for query and the rest for retrieval. From the retrieval set, 500 images per class are randomly sampled for network training; 3) ImageNet contains 1000 categories of images, including over 1.2M images in the training set and 50k images in the validation set. ImageNet100 collects 100 categories from ImageNet, with the entire validation set for query and the entire training set for retrieval. From the retrieval set, we randomly select 100 images per category for network training. For generality, as stated before, we simply adopt the AlexNet and crossentropy loss for feature learning. The network is trained with stochastic gradient decent with 0.9 momentum. The learning rate is initialized as 10 −3 , with cosine learning rate decay. We set the batch size as 64 and the weight decay parameter as 10 −4 . The parameter k for f (x) is gradually increased from 3 to 11 in a total of 150 epoches, with a stride of 2 every 30 epoches.
The retrieval accuracy is evaluated with the mean average precision (mAP). As in DPN [6] and TH [12], the mAP is calculated with all retrieval images as returned images for CIFAR-10, with top 5,000 returned images for NUS-WIDE, and with top 1,000 returned images for ImageNet100. For fair comparisons, we directly compare with the best results achieved by DPN and TH in their original papers [6,12]. Note in TH, the best results are usually obtained by a hinge-like loss, rather than the cross-entropy loss. The results are provided in Table 1. It is seen that our method achieves consistent performance gains over other two ternary hashing methods, on three databases with varying code dimensions. The gains range from 0.1% to 1.4%. So we can say that the proposed joint learning method indeed outperforms exiting independent learning methods, because of its advantage in reducing quantization errors.

Conclusion:
In the letter, we have proposed a continuation method to jointly learn the features and ternary codes in an end-to-end manner.  The core of the method is to introduce a smoothed function to gradually approach a desired ternary function during network training, and this avoids the difficulty of directly optimizing the discrete ternary function. As expected, the proposed joint learning method generates better ternary codes than existing independent learning methods. For generality, we simply tested the method on the commonly used AlexNet, which is trained with cross-entropy loss. It is believed that better ternary codes could be obtained, if more advanced network structures and loss functions are adopted.