Cross-modal semantic correlation learning by Bi-CNN network

Cross modal retrieval can retrieve images through a text query and vice versa. In recent years, cross modal retrieval has attracted extensive attention. The purpose of most now available cross modal retrieval methods is to ﬁnd a common subspace and maximize the different modal correlation. To generate speciﬁc representations consistent with cross modal tasks, this paper proposes a novel cross modal retrieval framework, which integrates feature learning and latent space embedding. In detail, we proposed a deep CNN and a shallow CNN to extract the feature of the samples. The deep CNN is used to extract the representation of images, and the shallow CNN uses a multi-dimensional kernel to extract multi-level semantic representation of text. Meanwhile, we enhance the semantic manifold by constructing cross modal ranking and within-modal discriminant loss to improve the division of semantic representation. Moreover, the most representative samples are selected by using online sampling strategy, so that the approach can be implemented on a large-scale data. This approach not only increases the discriminative ability among different categories, but also maximizes the relativity between different modalities. Experiments on three real word datasets show that the proposed method is superior to the popular methods.


INTRODUCTION
Under the booming of internet society, online multimedia information is growing at an amazing speed, and the content is usually presented in multiple ways. Soon afterwards, different forms of representation learning are essential for many practical applications, such as image captions and video captions, etc.
In this paper, we are absorbed in cross modal retrieval, which uses image queries to search text, and vice versa. The core of cross media retrieval is to explore semantic consistency between different modes. The efficacious retrieval of cross-media plays an important role in our daily applications. Figure 1 gives two demonstration of cross-modal. Differ from single-modal, crossmodal retrieval can provide multi-modal results. The key issue of cross-modal retrieval is that the distribution of different modalities is not consistent, so we cannot directly measure the crossmodal similarity.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. In order to retrieve cross-media information, a great deal of researchers concentrate on building the common space models bridge different information modalities. Canonical Correlation Analysis (CCA) [1] and its variants [2][3][4][5][6] are the most popular methods. The basic concept of these methods is to obtain the common subspace by learning the maximum relativities among different modalities. For instance, [2] applies CCA to maximize the correlation among modalities. In addition, some works add additional supervised information to improve the CCA performance. In details, [3] jointly learns multiple additional information to explore a distinguishable common subspace. Manifold learning [7,8] is another important insight of cross-media retrieval. For instance, [7] uses parallel fields to consider cross media retrieval as a manifold alignment problem, rather than merging the original features. Ref. [8] calculates a common low dimensional embedding to maximize the cross-media correlations while preserving the local distances. Deep neural network have been shown to be effective in many tasks [9][10][11], by provides scalable nonlinear transformation for feature representations in unimodal scenarios. An increasing number of researchers have deployed deep neural network in the cross-media retrieval [12][13][14], which then specifically exploit nonlinear ear correlations in learning generic subspace representations. The existing deep neural network based cross-media retrieval models can be mainly divided into two main phases. The first learning phase is to generate a separate representation for each modality, and the second learning phase is to obtain a generic representation across modalities. Although the best results can be ensured in the corresponding phase, it may produce suboptimal results for the cross-modal tasks. First, cross-modal generic representations may not be optimally compatible with visual and textual representations. Secondly, representation of different modalities are extracted independently of each other, thus ignoring the semantic correlation between the modalities. Third, the two phases of learning may lead to cumulative errors in which the representations of each modality are indistinguishable.
In this paper, we fully explicate category information to generate semantic representations of images and text in cross-media retrieval. We design a two-way CNN model named Cross-modal Semantic correlation learning by Bi-CNN Network (CSBN) which generates multi-level textual feature and generates discriminative feature of images. For the problem of cumulative errors, the bidirectional CNN employs probabilistic functions directly on textual and visual features, which can produce semantic representation of different modalities. Based on the two-way CNN model, we propose a novel ranking loss function for modelling semantic relationship across modalities. In contrast to existing cross-media retrieval algorithms, which only focus on the relevance of paired items, the proposed ranking loss function does not require image-text pair as input. In addition, two label predictors are learned so that they jointly perform label prediction and preserve the underlying cross-modal semantic structure in data. Therefore, the proposed framework ensures that the learned representations are both distinguishable within modalities and invariant between modalities.
As shown in Figure 2, the proposed framework generate semantic representation of images and texts. On the top half of the figure, we generate semantic representations of images via ResNet50 [9]. Obviously, words describe objects image, phrases describe attributes or activities of objects, and the whole sentence expresses a complete meaning covering semantic information of the entire image. Therefore, CSBN use kernels of multiple sizes to extract word and phrase features. In addition, max-pooling is used for word and phrase features to generate a semantic representation of the sentences.
The remainder of this paper is organised as follows. In Section 2 related work is discussed. In Section 3, we provide details of the CSBN and its target optimisation. In Section 4, we present experimental results and discussions for three real data sets. Finally, conclusions and future work are given in Section 5.

RELATED WORK
Measuring semantic similarity between visual and textual features is the key problem in cross-media retrieval. The common idea is to learn a joint embedding space where heterogeneous data representation can be directly compared. Traditional methods and deep neural network based methods are two main streams of cross-media retrieval. For the traditional approach, several works [1-4, 12, 15-20] focus on learning correlations between different modalities. These methods aim to learn the common subspace of images and text by maximising the correlation between the different transmitted feature vectors of different modalities. Alternative technique [7,8,21] adapts the cross-media retrieval to a manifold alignment problem. These methods make the assumption that they embedded high dimensional data in a low dimensional intrinsic space. The joint embedding representation can be achieved by learning an underlying manifold. To transfer images and texts into a common semantic space, a number of ranking based approaches [22,23] have been proposed in recent years. Ref. [24] projects heterogeneous data into a semantic space by applying multiclass logistic regression to preserve the semantic similarity. These methods learn semantic models by using bi-directional training samples, which combine the advantages of both directions of retrieval, thus improving generation performance. Lastly, to exploit label information, some algorithm [5,6,[25][26][27][28] directly minimize the cross-entropy loss between the predictions and the class labels. [29,30]are combined with fine-grained [31] calculate the semantic similarity between the instances by both the label information and feature information of instances. The correlation between modalities is formulated by class labels. Therefore, class information helps to learn discriminative potential spaces.
Deep learning has achieved state-of-the-art performance in some tasks of single-modal scenario, such as image classification [9], object detection [11] and 3D layout [32]. These excellent works have inspired researchers to model the intrinsic semantic correlations between different modalities through deep neural networks. Ref. [23] seeks an effective common subspace by adopting adversarial learning, in which adversarial learning is implemented as an interaction between two processes. To address large-scale cross-media retrieval, [12] propose a deep multi-network framework that decomposes large matrices into many sub-matrices and devises novel sampling strategies to select the most representative samples to construct crosspattern ranking losses. Ref. [33] constructs different semantic fragments from words by using a multimodal convolutional neural network, and then the fragments are interacted with image at different levels. Although m-CNN considers multi-level matching relations, it uses different matching levels as separate steps, thus ignores the intrinsic relation from words to sentences. Refs. [34,35,58] incorporates domain adaptation to transform synthetic data into realistic images. Ref. [36] proposes a framework that does not require any adjustment of parameters or thresholds. Ref. [57] transformed the language information into different expressions to improve the accuracy of prediction. Ref. [37] uses a bi-modal CNN to train a holistic scene classifier in two modalities, and then semantic relevance of the sub-concepts included in the images are used to improve overall scene recognition. Ref. [13] seeks an effective public subspace based on adversarial learning. Adversarial learning is implemented as an interplay between the two processes. The first process attempts to generate a modality-invariant representation in the common subspace, while the other process attempts to distinguish between different modalities based on generated representation. Ref. [38] proposes an approach, the modality-specific and modalityshared features are jointly explored and leveraged, such that the complementarity and correlation information is effectively used for the retrieval task.
Generally, cross-modal retrieval tasks require low storage capacity and high retrieval efficiency, and recently, hashing has received more and more attention. Ref. [39] uses the semantic tag information provided by data set, combining the classifier learning and matrix decomposition method. Ref. [40] uses a hybrid deep structure, which constitutes a visual semantic fusion network, is used to learn hash functions to generate binary codes for generational comparisons. Furthermore, the architecture combines joint multi-modal embedding and cross-modal hashing with advantage, and it is based on a novel combination of Convolutional Neural Networks on images. Recurrent Neural Networks over text and a structured maximum margin targets to learn high quality hash codes. A triplet-based deep hash network for cross-modal retrieval is proposed in [41], which utilises a triplet table and establishes a multi-view loss function for hash codes. In addition, graph regularisation is introduced to preserve the original semantic similarity between hash codes in Hamming space. Ref. [42] proposed a new hash learning method that combines multi-view and deep learning methods.
Ref. [43] proposed an efficient discrete optimal algorithm to directly learn discrete hash code matrix with closed-form solution instead of learning them bit-by-bit. Ref. [44] employed the adversarial learning, it utilised two adversarial networks to maximise the consistency of semantic associations and representations between different modalities. Moreover, high-level semantic information in the form of multi-label annotations is also available via a self-supervised semantic network.

THE PROPSED METHOD
In this section, we introduce CSBN for cross-media retrieval, as shown in Figure 2. The CSBN is a bi-directional convolution neural network, one deep convolution neural network on images and one shallow convolution neural network on sentences. Furthermore, a novel ranking loss and a label projector is deployed to learn high quality inconsistent representation.

Image model
In this paper, we extract image feature f I from an original image I by applying ResNet. Specifically, the first step is to rescale the each image to be 224×224 pixels. We alter the last fully connected layer in order to fit our dataset. Then we obtain the features f I from the last fully connected layer. Therefore, the dimension of f I is c, where c is the category number of dataset. Finally, in order to generate the category probability of image, we apply a softmax function to the feature vector.
The image semantic representation is denoted by where v i represents the probability of image belong to the ith category.

Textual model
The pre-trained word vectors are used as embedding layer, and each row of the embedding matrix is a word vector. We perform multikernel convolution on the chopped sentences. Different size of convolution kernels gets different feature size, so we use pooling function for each feature map so that they have the same dimensionality. Word-level feature. We define W embedding ∈ R n×k as a word embedding matrix, where n is the size of vocabulary and k is the dimension of word embedding. Let w i ∈ R k be the kdimensional word vector corresponding to the ith word in the text. Text with length n is represented as where ⊕ is the concatenation operator. Here, we initialise the word embedding matrix using pre-trained word2vec and finetune it by back-propagation.
Phrase-leve feature. Generally, let w i:i+j refer to the concatenation of words w i ,w i+1 ,…,w i+j . A 1-D convolution operation involves a filter W Conv ∈ R h×k , which is applied to a window of h words to produce a h-gram feature. For instance, a feature p i is generated from a window of words w i:i+h − 1 by Here b ∈ R is a bias term and f is a non-linear function such as ReLU. This filter is applied to words in each possible window of in the text {w 1: h,w 2: h+1,…, w n −h+1:n} to produce a feature map With p ∈ R n−h+1 .Then we apply a max-pooling operation over the feature map and take the maximum value pˆ= max{p} as the feature corresponding to this particular filter.
Text-level feature. In order to make full use of the semantic information of the word sequence W. We concatenate the output feature of each filter as Pˆ= {pˆ1,…,pˆh}. Then, the phraselevel vector Pˆi s feed into a single layer perceptron, which can combine the phrase-level feature to generate the discriminative feature of text.
Where W T is an transformation matrix and b T is a bias term. In a word, our text representation is a step-by-step progressive from low-level word features to high-level text features. After gaining the multi-modal representation, we apply the label predictor to learn discriminatory representation with a modality and propose a rank loss to model the correlation between modalities.

Label prediction
To ensure discrimination of the intra-modal data, two classifiers are deployed to predict the category labels of the samples in the common space. For this target, there is a cross-entropy loss function on the top of each representation output. The semantic representations of the instance of images or texts are taken as training data by the classifier and the classifier generates a probability distributions p¯V (p¯T). The formulation of intra-modal classification loss as follow: L * denote the cross-entropy loss of image or text category classification of all the samples. Here, n express the number of samples within each mini-batch, y i denotes the groundtruth of each instance, while p¯ * ,i is the generated probability distribution per image or text.

FIGURE 3
Anchor and positive have the same identity information, but negative has the different identity information. The cross-modal ranking loss minimizes the gap of an anchor and a positive, and maximizes the gap of an anchor and a negative. Different shapes represent different modalities (i.e., images and texts). The same colour means the same identity

Cross-modal ranking loss
If two samples are close in the primitive space, they should approach to each other in the common subspace. In the crossmedia retrieval tasks, we usually need to rank the retrieved results so that samples have same semantics close to each other. Hence, ranking loss is naturally suitable for this mission, which can learn to rank the retrieved results by distance or similarity. Inspired to [12,45], our model proposes a new and original triplet loss to evaluate the relative ranking distance between given query and a certain retrieved samples so that more differences can be surveyed in contrasts between retrieved samples.
Here we want to make sure that an image v i a (anchor) is closer to all other texts (positive) from the same category, rather than any text t n i (negative) with different semantic labels. This is visualize in Figure 3. We formulate the relationship as follow where, α is a margin that is enforced between positive and negative pairs. T is the set of all possible triplets in the mini-batch. As mentioned above, there is a relationship between the text anchor and the corresponding image sample.
Therefore the triplet loss constraints can be written as Because of the large number of samples, it is impossible to sample triplets in the whole instance space is impractical. We perform online triplet sampling from labelled instances in each mini-batch. First, the distance matrix between the cross modal samples is calculated. Second, the distance matrix is used to select all samples as anchor points, positive samples and negative samples. Finally, we calculate the triplet loss by anchors, negative samples and positive samples.

Formulation of CSBN
On the basis of the above, the loss function of the CSBN is expressed as the combination of the inter-modal ranking loss and the intramodal discrimination loss. The variables of the twoway CNN are learnt by end to end. Our approach not noly increases the discriminative ability among different categories, but also maximizes the correlation between different modalities. The jointly loss can be formulated as: where, θ is the model variables which will be optimized. The hyperparameters γ control the contributions of the three terms.

EXPERIMENTS
In this section, we introduce the experimental results of our proposed CSBN and other comparison methods (image query text and text query image) for cross media retrieval. The comparisons are implemented on three public datasets Wikipedia [2], Pascal Sentence [46] and NUS-WIDE [47]. Although we used these three datasets for our experiments, the properties of these three datasets are not the same. Their text modality are not the same and their categories do not coincide. Wikipedia is the article, PASCAL Sentence is the captions and NUS-WIDE is the tags. So the features of different modalities are totally different.

Datasets
Wikipedia dataset is first propose by [2] for cross-media task. The dataset consists of 2866 text-image pairs, annotated with a label from 10 semantic classes vocabulary. We randomly split following [48]: 2173 pairs for training, 231 pairs for validation and 462 pairs for testing. Pascal Sentence dataset [46] derives from 2008 PASCAL development kit. This dataset consists of 1000 images which are categorized into 20 categories, and each image has 5 corresponding sentences. For each category, we select 40 image-text pairs for training, five pairs for testing and five pairs for validation following [48].
NUS-WIDE dataset [47] consists of 270,000 images with their tags categorized into 81 classes. We follow [49] which prune the original train-test split of the NUS dataset by keeping the pairs which belongs to 10 largest classes. Each image-tags pair belongs to a unique class.

Compared methods
The proposed CSBN is compared with several related methods, such as CCA [1], 3view-CCA [5], SM [2],SCM [2],GMMFA [3], GMLDA [3] and LGCFL [15]. CCA focuses on learning a common subspace and analysing the relevance between different modalities in the subspace. 3view-CCA uses CCA as a starting point, combined with a third view capturing high-level image semantics that can be represented by a single category or multiple non-mutually exclusive concepts. SM learns two logistic regression variables to generate probability for each samples, and uses category probability to represent samples. SCM applies the CCA to learn two maximally related subspaces, and then studies the logistic regressors in each subspace. GMA constructs a general discriminant subspace and proposes a generalized multi-view analysis. GMMFA and GMLDA all use the framework of GMA.
LGCFL learns the consistent features by applying the valuable class information on the cross-media retrieval.

Implementation details
For CSBN, we extract features directly from the original pixels of the image using a residual network. With respect to sentences, we transformed the word sequences in each sentence into a word index as input to the text part. For all the methods compared in our experiments, we extracted feature vectors of the same size from the images by means of the same residual network. Word vectors are obtained by adopting word2vec pre-trained on extra dataset for each sentence, and then we calculate the average of all word vectors as the final feature vector. We implement our method in Tensorflow framework. For the training of the network, we employ ResNet-50 for image, and the convolutional network for text. To initialize our network, we apply the pre-trained ResNet-50 variables and word2vec word embedding. Back-propagation is applied to fine-tune the ResNet and text-CNN and train the new layers. Due to the new layer is trained from scratch, we first adopt the fixing lower layers to train it. Then, we adjust slightly all layers and reduce the learning rate to one tenth of training new layers. The proposed CSBN is implemented using PyTorch. During training, we use SGD optimizer with 0.9 momentum. The maximum number of words in each sentence for Wikipedia Pascal Sentence and NUS-WIDE are 1200, 200 and 200 respectively.

Experiment settings
We utilize the precision-recall (PR), scope-precision (SP) and mean average precision (MAP) to measure the retrieval results. MAP is the mean value of the average precision (AP) for each query in the query set. AP is defined as: where, n is the size of test set, N is the number of retrieval results, R k is the kth accumulated positive samples in the retrieval list. rel k = 1 if the kth sample at the returned retrieval list is relevant, and rel k = 0 otherwise.
In particular, we measure the retrieval performance by adopting MAP@R [50] after the fixed number of back sampling. If the first 50 samples are retrieved, R is set to 50, if all samples are retrieved, R is set to "all". In addition, the precision-recall curves and scope-precision curves are also given for all methods. The scope is specified by the number of samples with the highest ranking when the retrieved samples are ranked on the basis of the similarities between them and the query.
The parameters γ and α need to be set for our CSBN framework. We set γ and α according to five-fold cross validation result on the Pascal Sentence dataset. Then, the parameter is used for all of the experiments. In detail, the parameter γ are set 1.0, the α is set 5.0. As shown Figure 4, our CSBN is not very sensitive to α parameter.

4.5
Image-text retrieval Tables 1, 2 and 3 report the MAP@50 and MAP@All of the different methods on the test set of Wikipedia, Pascal Sentence and NUS-WIDE, respectively. From these tables, we conclude the following: As shown in Table 1, CSBN achieves the best results in both bi-modal retrieval tasks on Wikipedia dataset. The excellent performance of CSBN is due to the addition of the labels predictors and cross-modal ranking loss, which can produce distinctive feature and preserve the original data structure.
First of all, CSBN performs better than the relevant comparison method on either task tasks. CCA has the worst performance, because CCA focuses on the common subspace learning and ignores the semantic correlation among different modalities. CSBN utilizes a new ranking loss function to model crossmodal semantic relations so it works better than CCA. Second, SM adopt two logistic regressors to generate samples category probability, which achieves comparable results on both tasks. CSBN chooses to use softmax and label prediction, which has a significant improvement over SM. Third, 3-View CCA, SCM, GMMFA and GMLDA have higher MAP scores than the CCA. In fact, these three methods are based on the CCA, and they introduce the semantic modelling in the CCA framework. This proves the significance of the semantic correlation constraint. Fourth, LGCFL excels over traditional methods by learning consistent features for cross-modal retrieval.
LGCFL models both paired and unpaired image-text loss, but it still has the disadvantage of not being able to cluster samples with the same class well in the common space. So, CSBN uses triplet loss for semantic clustering and achieves better results. As a result, the MAP@All of CSBN is higher than LGCFL by 1.6% and 7.2% on text-query-images and imagequery-texts retrieval tasks. Ref. [51] reduces the distribution differences between different modes to enhance the similarity. In contrast to it, We are directly optimizing the distance between feature vectors to reduce their differences. In comparison, the average Map@ALL scores of our method on the NUS-WIDE dataset is about 0.12 higher than it. Figure 5(a) and (b) show the precision-recall and precision scope curves of the eight methods. The scope (i.e., the top   Precision-recall curves and precision-scope curves for the image-query-texts and text-query-images Experiment on Pascal Sentence K retrieved samples) of precision-scope curve diffs from 100 to 500.We can observe that our method achieves higher precision at almost all levels, whether it is text-query-images retrieval or image-query texts retrieval. For precision-scope curve, the CSBN performance is superior to the compared methods at all level scope. Hence, the superiority of CSBN for cross-modal retrieval is further validated by curves. The Pascal Sentence is widely used to assess the effectiveness of cross-modal retrieval algorithms. The results on Pascal Sentence are shown on Table 2. The trends of the results on Pascal Sentence are consistence with Wikipedia datasets. Figure 6(a,b) are the precision-recall and precision-scope curves of imagequery-texts and text-query-images respectively. From Figure 6, we can find the performance of CSBN is unsatisfactory at low recall level. The purpose of cross-modal rank loss is to learn discriminant semantic representation of different modalities, which leads to sub-optimal performance at low recall level.
Moreover, we also try to run the codes of GMA on NUS-WIDE. But our device cannot meet their memory requirements, thus we do not reporting their results. As shown on Table 3, CSBN dese not achieves the highest scores on MAP@50, which may results from the anchor of cross-modal ranking loss not sensitive to the neighbour samples. However, our CBSN approach can stably benefit from two way CNN framework and keeps the almost best among all compared methods. Figure 7 show the precision-recall curves and precision-scope curves of bi-modal retrieval tasks on NUS-WIDE datasets, which can further verify the effectiveness of our CSBN approach.
In Figure 8, to visually determine the search results, we give two retrieved instances of image-query-texts and text-query-images For each instance, we display the query on the left, and the five most relevant retrieved results are displayed on the right. We can observe the results reflected by the class labels. CSBN finds the most relevant matches at semantic level. For imagequery-texts task, the query image tasks, given textual description about "car" "road", the top retrieved images of CSBN are also relevant to the query text. is a 'dog', we find each result contains 'dog mouth', 'dog face', 'puppy', 'puppies mouth' belong to the class 'dog'. For the text-query-images

Semantic representations
In this section, we attempt to illustrate the difference in recognition power of low-dimensional feature representations learned by different methods. Based on Pascal Sentence, we extracted 50 sample pairs each from the 'bicycle' category and the 'areoplane' category to construct a toy dataset. In Figure 9, we adopt the t-SNE [52] algorithm to project the semantic representations from the two-way network and the embedding features of the different methods are transformed into a mixture of images and textual representations in a 2D distribution. The red circles denote the distribution of the 'areoplane' class, and green circles represent the 'bicycle' class. From Figure 9, we first conclude the proposed CSBN maximizes the correlation between different modalities, and simultaneously dataset increases the discriminative ability among different categories. However, the second best result (i.e., LGCFL) only unifies the semantic representations of sameclass samples. Moreover, the CSBN also differs from the other FIGURE 7 Precision-recall curves and precision-scope curves for the image-query-texts and text-query-images Experiment on NUS-WIDE dataset

FIGURE 8
Two examples of image-query-texts (in upper half) and text-query-images(in bottom half) on Pascal Sentence dataset. For imagequery-texts, the query image depicts the scene about "dog", "mouth" and "face". For sentence-query-images, the query sentences describes the semantics about "car", "road". The incorrect retrieved results are shown in the red frame  methods in the range of sample distribution, which is spread over the same range of coordinates. These results validate that CSBN ensures the consistent structures between image and text spaces such that the low-dimension representation is enhanced with stronger discrimination.

Quality of word embedding
To better comprehend the quality of learned word embeddings, we provide empirical analysis by showing the nearest neighbours for each given word in Table 4. CSBN adopts an end-to-end training mode, the word embeddings matrix is regarded as one part of model variables, which can be learned during training. Then we measure the similarity between words by using cosine distance, like word2vec model [53,54]. As shown in Table 4. we find that the closet entities of each given word are clearly relevant to each other. For instance, when we input the word "women", some related words like "man", "people" and "men" appear at the top positions of the ranking list. We think that the advantages of CSBN benefit from the two-way CNN framework, which coherently combines the deep ResNet and shallow text-CNN layers for images and words feature respectively. As is known to all, ResNet is the most effective way of deep image feature extraction [55] and CNN layers show powerful ability in text classification [56]. In order to enable the semantic information in images and text to be mutually guided, this is achieved by unifying the architectures of two different CNNs. As a result, the word embedddings will be updated in a more informative direction so that the relevant words are likely to be close to each other in embedding space.

CONCLUSION
In this paper, we propose a cross-media retrieval framework named CSBN based on two-way convolutional neural networks. In CSBN, images and text representation are generated by the two-way CNN, which is proposed in this paper. Specifically, the deep CNN (i.e., ResNet) is applied for images representation and the shallow multisize kernels CNN generates muti-level texts feature. Further, the cross-modal ranking loss reduces differences in the distribution of categories across modalities. Two label predictors are adopted to generate discriminative sample representation. Experimental results on three public cross-modal datasets show that CSBN works better than the related method on tasks. In future, we prioritise improving the effectiveness of our method and plan to make the CSBN more practical.