Video captioning via a symmetric bidirectional decoder

The dominant video captioning methods employ the attentional encoder–decoder architecture, where the decoder is an autoregressive structure that generates sentences from left ‐ to ‐ right. However, these methods generally suffer from the exposure bias issue and neglect the guidance of future output contexts obtained from the right ‐ to ‐ left decoding. Here, the authors propose a new symmetric bidirectional decoder for video captioning. The authors first integrate the self ‐ attentive multi ‐ head attention and bidirectional gated recurrent unit for capturing the long ‐ term semantic dependencies in videos. The authors then apply one single decoder to generate accurate descriptions from left ‐ to ‐ right and right ‐ to ‐ left simultaneously. The decoder in each decoding direction performs two cross ‐ attentive multi ‐ head attention modules to consider both the past hidden states from the same decoding direction and the future hidden states from the reverse decoding direction at each time step. A symmetric semantic ‐ guided gated attention module is specially devised to adaptively suppress the irrelevant or misleading contents in the past or future output contexts and retain the useful ones for avoiding under ‐ description. Experimental evaluations on two widely applied benchmark datasets: Microsoft research video to text and Microsoft video description corpus, demonstrate that the authors' proposed method obtains substantially state ‐ of ‐ the ‐ art performance, which validates the superiority of the bidirectional decoder.


| INTRODUCTION
The goal of the video captioning task is to automatically describe what is happening in videos with semantically reasonable sentences, which is a crucial and challenging issue in computer vision for closing the gap between vision and language. The generation of accurate and meaningful sentences requires the model to thoroughly understand the rich spatialtemporal semantic information about objects, actions, and their interactions in videos and depict the visual information in a grammatically correct manner. Video captioning has drawn considerable research attention recently for its extensive practical applications in high-level video understanding tasks, including Visual Question Answering (VQA) [1], video retrieval [2] and assisting people with vision problems [3], and so on.
Most dominant video captioning methods adopt the classical attentional encoder-decoder framework [4][5][6][7][8][9][10][11][12][13], which has brought significant performance improvements in neural machine translation (NMT) [14,15] and image captioning tasks [16,17]. Typically, the 2D or 3D convolution neural networks (CNNs) are exploited to extract visual semantic representations from videos, and the recurrent neural networks (RNNs) are leveraged to decode the video representations into sentences sequentially. Meanwhile, the attention mechanism is incorporated into the decoding process to encourage the decoder to selectively concentrate on the relevant visual contents at each time step, which plays a crucial part in generating diverse and fine-grained language sentences. The video representation dramatically influences the quality of generated sentences, especially for videos with rich semantic information. In the real scenario, the videos may contain complex activity scenes, so there have been long-term semantic dependencies in videos. Each video frame is not only related to adjacent video frames but also related to distant video frames. However, it is hard for existing methods [4,5] to utilise a simple video representation to capture detailed temporal dynamics over a long time interval without explicitly considering the direct relationships among frames. Therefore, how to efficiently capture long-term semantic dependencies in videos and extract discriminative video features is a pivotal issue in video captioning.
In addition, another primary concern of most video captioning methods is that the RNN-based decoder is an autoregressive structure which predicts each word based on the previously generated words in a left-to-right manner. This unidirectional decoding structure constrains the utilisation of future output contexts generated from the right-to-left (R2L) decoding. By analysing the human cognition process, it can be concluded that the combination of past and future output contexts can help decide what video contents to be described at the current time and avoid under-description. Furthermore, due to the autoregressive property, most existing methods suffer from the exposure bias issue for the discrepancy between training and testing. Concretely, at each time step, the RNN-based decoder is trained to generate each word on the basis of the previous ground-truth words, while in the testing stage, the RNN-based decoder is fed with the previous words predicted by the model itself. Hence, the decoder has never been exposed to the predicted words during training and cannot deal with the errors that never occur in the training stage. Once a wrong word is obtained at the early decoding steps during testing, the errors would be quickly accumulated and propagate alongside the word sequence, which will mislead the word prediction process. This issue becomes more severe as the word sequence becomes longer. It can be observed that the left-to-right (L2R) decoding tends to generate high-quality prefixes and inaccurate suffixes, and the R2L decoding shows the reverse preferences. Moreover, two separate attention modules are adopted to calculate the past or future output contexts on the basis of the current decoding hidden state and the past or future hidden state vectors. There exists a situation where the past or future output contexts are not what the decoder expects, and thus the decoder could be misled to yield false results. In other words, it is probable that the decoder hardly determines whether or how well the past or future hidden state vectors are related to the current decoding hidden state. The extreme case is that no worthy information in the past or future hidden state vectors satisfies the demand of current decoding hidden state, but the past or future output contexts can still be calculated by aggregating all attention weighted past or future hidden state vectors and are utterly irrelevant to the current decoding hidden state.
For addressing the above problems, a new symmetric bidirectional decoding model is proposed for video captioning, which tries to generate the discriminative video representation and employ one single decoder to simultaneously perform decoding from L2R and R2L directions interactively. At first, the self-attentive multi-head attention (MH-Att) and bidirectional gated recurrent unit (BiGRU) are integrated [18] to model the frame-to-frame interactions directly and capture the longterm semantic dependencies in videos. Next, one single decoder is applied to translate the video representations from two reverse decoding directions at the same time for mitigating the exposure bias problem. The two decoding directions are symmetric and are complementary to one another. Then the L2R decoding is used as an instance so that the generation of each word not only depends on the predicted words of the L2R decoding but also relies on the previously generated words of the R2L decoding. This bidirectional decoding structure can facilitate the deep and comprehensive information interaction between the two decoding directions and encourage each decoding direction to embed more semantic knowledge from the reverse one. Besides, we exploit a joint training scheme to make the L2R and R2L decoding interactively improve each other, enhancing the robustness and accuracy of the output predictions. Finally, a symmetric semantic-guided gated attention module (SSGGA) is proposed to measure the relevance between each decoding hidden state and past or future output contexts. Two attention gates are devised to adaptively suppress the irrelevant or misleading contents in the past or future output contexts and retain the useful ones. This module is critical to prevent under-description.
The primary work of the proposed method is outlined as follows: � We introduce a symmetric bidirectional decoder for video captioning, which employs one single decoder to simultaneously describe the video contents from L2R and R2L decoding directions so as to fully exploit the future contexts and mitigate exposure bias. To the best of our knowledge, our method is the first that tries to investigate the effectiveness of the symmetric bidirectional decoder in video captioning. � We devise a SSGGA module to determine the relevance between the current decoding hidden state and past or future output contexts and avoid misleading the sentence generation. � Experimental evaluations on two extensively applied benchmark datasets: Microsoft research video to text (MSR-VTT) and Microsoft video description corpus (MSVD), show that our method obtains substantially state-of-the-art performance compared with several recently proposed methods, which proves the superiority of the bidirectional decoder.

| RELATED WORK
This section provides some brief introductions to several recently proposed video captioning methods as well as the extensively applied attention mechanism.

| Video captioning
The primary video captioning methods are mostly templatebased methods [19][20][21][22][23], which first define some grammar rules for sentence templates and then align each syntactic component (e.g. subject, verb, object) of the predefined template with semantic words detected from video contents. This type of video captioning is capable of producing grammatically correct video descriptions but loses the flexibility and diversity of sentences. Meanwhile, these methods are insufficient to depict open domain videos with complex semantic contents. Recently, benefit from the successful development of deep learning techniques (e.g. CNNs and RNNs) and NMT, the sequence learning-based methods are proposed to depict the visual contents using the accurate sentences of flexible syntactic structures. Venugopalan et al. [24] made the first attempt to adopt the encoder-decoder framework to generate video captions. They applied the mean pooling operation to the video features and obtained the global video representation, which neglected the temporal information among frames. Some latest efforts have been made either in the encoding stage for extracting more discriminative video features or in the decoding stage for generating more robust and meaningful language sentences. The S2VT [25] integrated the Red, Green, Blue (RGB) features and optical flow features for better video representation and leveraged long short-term memory (LSTM) in both the video encoder and sentence decoder. Pan et al. [26] devised the hierarchical RNN, which can excavate the video semantic information with multiple temporal granularities. Ramanishka et al. [27] fused multi-modal source information (i.e. visual, audio, and text) to boost the captioning performance. Baraldi et al. [28] exploited the boundary-aware LSTM cell to discover the discontinuities among the video frames and modified the temporal connectivity of the encoding module accordingly. Wu et al. [29] leveraged a multi-modal circulant fusion (MCF) module to integrate the visual and textual features and fully capture the cross-modal interactions between the two modalities. Chen et al. [30] designed a plug-and-play PickNet module which tried to pick the most discriminative frames and mitigate the interference of duplicated and redundant frames. Aafaq et al. [11] applied the short fourier transform operation to the CNN-based video representations hierarchically for incorporating the spatio-temporal semantic dynamics into the visual features. Hou et al. [12] learned the syntactic structure of captions by inferring part-of-speech (POS) tags from videos and adopted a mixture model to translate the video features into linguistic words based on the learned syntactic structure. Zhang et al. [31] designed an object relational graph based encoder which attempted to acquire the fine-grained interaction features and incorporated plentiful external language information to guide the caption generation. Moreover, the well-known transformer structure has been applied in the video captioning [32,33], which focused on learning high-level video semantic features. Also, many methods have applied various attention mechanisms to video captioning and obtained significant performance improvements. Gao et al. [34] exploited the hierarchical attention to mine temporal dynamic information in videos, and Yan et al. [13] presented the spatio-temporal attention mechanism which can adaptively concentrate on the important regions of the most relevant video frames when predicting each word. Wang et al. [5] put forward a reconstruction network developed from an encoder-decoder-reconstructor framework to exploit the bidirectional flows between videos and sentences. Wang et al. [6] introduced a bidirectional proposal framework that leveraged the past and future visual contexts to localise precise activity segments and generated the corresponding descriptions.
As far as we know, the bidirectional encoding structure has been widely applied in video captioning [6,8], but the bidirectional decoding structure has never been explored in video captioning, which has demonstrated its effectiveness in NMT [35][36][37]. The unidirectional decoder is an autoregressive structure which constrains the utilisation of future output contexts obtained from the R2L decoding. Here, the proposed method is capable of mitigating the exposure bias issue and improving the robustness and accuracy of the output predictions.

| Attention models
The idea of the attention mechanism derives from the operation mode of human visual system. This technique has been extensively explored in lots of computer vision tasks including video classification [38], visual question answering [39], and image captioning [16], etc. The attention mechanism can help identify the important frames of videos or salient local regions of images for extracting more discriminative features. Recently, various types of attention mechanisms have emerged, such as: adaptive attention [40], self-attention [41], multi-level attention [42], and stacked attention [43].
The self-attention mechanism first converts the input features into queries, keys, and values. Then the attention weight matrices between the queries and keys are calculated through dot product. The final attention features are obtained by aggregating all normalised attention weighted values. Recently, a large number of computer vision tasks using the self-attention mechanism have gained superior performances, which encourages us to employ the self-attention mechanism to our video captioning method for exploring the relationships between different components.

| PROPOSED METHOD
The authors propose a symmetric bidirectional decoding architecture for video captioning task, which is illustrated in Figure 2. At first, a video semantic encoder is introduced, which integrates the self-attentive MH-Att module and the BiGRU network to capture the long-term semantic dependencies in videos. Next, the simultaneous bidirectional decoder translates the video representations from two reverse decoding directions (L2R and R2L) interactively. Finally, the symmetric semanticguided gated attention module adaptively filters out the irrelevant or misleading information in the past or future output contexts and keeps the useful for avoiding under-description.
Notably, gated recurrent unit (GRU) [44] is chosen to be the basic building block of the video encoder and sentence decoder, which is widely used for its relatively fewer parameters required than LSTM [45].

| Preliminary
Here, the authors briefly present the background knowledge of the Transformer framework [41], which consists of two important attention modules, as shown in Figure 1. We begin QI AND YANG -285 with illustrating the scaled dot-product attention, which is fed with three input features: queries Q ¼ q 1 Generally, the three input features have the same dimension, denoted as d. The scaled dotproduct attention yields the weighted sum of values V, where the inner product is exploited to compute the attention weights based on query q i and keys K. Moreover, the attention function applied to each query is calculated at the same time.
We call the scaled dot-product attention self-attention when the three inputs are set to the same features. Otherwise, it is called cross-attention.
Furthermore, the MH-Att is put forward to strengthen the capacity of attention representations, which contains H identical scaled dot-product attention modules known as 'head'. Each head attends to the contexts from different representation subspace and the MH-Att can achieve better results than the single-head attention (SH-Att). We concatenate the output of each head and obtain the final MH-Att results as follows:

| Video semantic encoder
The video semantic encoder aims to build the frame-to-frame interactions and generate discriminative visual features for the F I G U R E 1 Overview of the structures of two core attention modules in transformer: (a) scaled dot-product attention and (b) multi-head attention F I G U R E 2 Overview of the symmetric bidirectional decoding architecture. The full model consists of three modules: video semantic encoder, simultaneous bidirectional decoder, and symmetric semantic-guided gated attention (SSGGA) downstream decoder. Recently, CNNs have provided significant performance improvement to various computer vision tasks. The pre-trained CNN networks can be taken as the standard feature extractors and applied to different tasks directly for its strong power of representation learning. Here, given the video frames X ¼ x 1 ; …; x n ð Þ of length n, we feed each frame into the pre-trained ResNet152 network and get a 2048-dimensional vector from the pool5 layer.
Moreover, a video may contain complex activity scenes over a long period in the real scenario, which indicates that there have long-term semantic dependencies in the video contexts and each video frame is not only related to the adjacent video frames but also related to the distant frames. We leverage the selfattentive MH-Att module to build direct relationships among different frames and facilitate the adequate comprehension of the complex dependencies. Unlike the traditional RNNs, it is required to incorporate position information into video feature sequence V because the self-attentive structure contains no recurrence and convolution. The sinusoidal time signal is adopted as the absolute positional encoding as follows: where e and j indicate the position of video feature sequence and the order along the video dimension, respectively. The positional encoding is then element-wise added to the video feature sequence V directly.
We set the three inputs of MH-Att module to the same video feature sequence: g is given by: where a residual connection is employed to retain the important information of videos. The subsequent layer is a feed-forward network (FFN), which contains a ReLU function between two linear transformations.
where W 1 ∈ R d�d f , W 2 ∈ R d f �d denote the weight matrices for the feed-forward layer and b 1 We define the framework mentioned above as a subencoder. N such identical sub-encoder are stacked to constitute a complete self-attentive MH-Att encoding module, and the output results of the previous sub-encoder are taken as the input to the next sub-encoder.
In addition, the adjacent video frames tend to have strong dependencies, which derive from the fact that the video signal is a highly correlated temporal signal. To strengthen the local dependencies, we feed the self-attentive video feature sequence to a BiGRU network, and obtain the contextual representation of each frame, given by: where GRU ef and GRU eb indicate the L2R and R2L GRU encoding networks, respectively. The contextual representation of each frame is the concatenation of hidden state vectors of L2R and R2L GRU encoding network at each time step. We denote

| Simultaneous bidirectional decoder
Generally, most video captioning decoders are based on RNNs and generate sentences from left-to-right. These unidirectional decoders are incapable of leveraging the future output contexts obtained from the R2L decoding and always suffer from the exposure bias issue. Intuitively, the future output contexts can provide complementary signals and are crucial for sentence generations. Thus, to fully exploit the past and future output contexts as well as to mitigate the exposure bias issue, one single decoder is employed to simultaneously perform decoding from L2R and R2L directions interactively. Our bidirectional decoding structure attempts to leverage the complementary nature of the two reverse decoding directions to yield accurate and reasonable language sentences. The decoder in each decoding direction outputs predictions under the semantic guide of visual hidden state vectors generated by the video semantic encoder, the past hidden state vectors obtained from the same decoding direction, and the future hidden state vectors obtained from the reverse decoding direction. Because of the symmetrical characteristic of L2R and R2L decoding, the L2R decoding is considered as an example to explain how the bidirectional decoding model works.
Given the visual hidden state vectors, all past hidden state vectors, and all future hidden state vectors, the bidirectional decoder models how to predict the next word from left-toright as follows: where GRU df denotes the L2R GRU decoding network. s → a denote the learnable model parameters.
Then a softmax function is applied to normalise the relevance score:   (16)). The future output contexts act as the auxiliary knowledge to help predict the current word in L2R decoding accurately.
In addition, as illustrated in Figure 2, the last hidden state vector h n of video semantic encoder is utilised to initialise the first decoding hidden state vector s → 0 . The conditional probability of predicting each word from left-to-right is given by: where W → s and b → s denote the learnable model parameters. Similar to the calculation of sentence generation from leftto-right, the reverse sentence generation process is defined as follows: where GRU db represents the R2L GRU decoding network. W (see Equation (21)) and H ← f uture t (see Equation (22)) indicate the reverse visual context information, past output contexts generated from R2L decoding, and future output contexts generated from L2R decoding, respectively.
It can be observed that the bidirectional decoder generates two different sentences at the same time.

| Symmetric semantic-guided gated attention
The attention mechanism has been extensively applied in video captioning, which always yields a weighted average attention vector for any input decoding hidden state no matter whether or how well the attended features and the input decoding hidden state are relevant. We have calculated the past and future output contexts by cross-attentive MH-Att. The primary concern is that the past or future output contexts could be not what the bidirectional decoder expects at each time step and would mislead the word prediction process. This would happen when there is no relevant information in the past or future hidden state vectors that satisfies the demand of current decoding hidden state.
Here, we devise a SSGGA module as illustrated in Figure 3, which is designed for measuring the relevance between the past or future output contexts and the current decoding hidden state, and avoiding under-description. Likewise, we introduce this module from the perspective of L2R decoding, and the direction arrow sign will be omitted.
The SSGGA module first generates two symmetric attention gates g 1 and g 2 for the past and future output contexts, respectively. Each attention gate is determined by the semantic guidance factor s t and the corresponding output contexts H past t or H f uture t . Otherwise, it receives a low gate value. We then apply the two attention gates to the past and future output contexts by element-wise multiplication to filter out the irrelevant or misleading information, and obtain the gated past and future output contexts (i.e. H 1 and H 2 in Figure 3). Besides, a residual connection is employed to keep the original past or future output contexts.

b H
where ⊙ represents the element-wise multiplication. b H past t and b H f uture t denote the final gated attention results, which are refined under the guidance of semantic factor s t .
where [⋅] denotes the concatenation operation, W c , U, V, W r , b c, and b r are the learnable parameters. We set λ to 0.1. σ denotes the non-linear activation function. The sigmoid function is utilised in the implementation.

| Training and inference
The bidirectional decoding model is trained in two stages. We first optimise the proposed method by minimising the bidirectional cross-entropy (CE) loss defined in Equation (32), which involves both the L2R and R2L likelihood. Let θ → and θ ← represent the learnable parameters of L2R and R2L decoding, respectively. w * 1:m indicates the ground-truth caption.
In the second stage, the non-differentiable evaluation metrics are optimised with self-critical sequence training (SCST) [46], which can further provide significant performance improvement.
where the score of metric CIDEr is considered as the reward r (⋅) [47]. We approximate the gradients as follows:

F I G U R E 3
Overview of symmetric semantic-guided gated attention architecture w a indicates the video caption sampled from the probability distribution, and c w is the video caption of greedy decoding.

| EXPERIMENTS
Now, we evaluate the symmetric bidirectional decoding model on two extensively applied benchmark datasets, including MSR-VTT [48] and MSVD [49]. We first give some brief introductions to the datasets and implementation details. Afterwards, our model will be compared to the unidirectional decoding models (i.e. L2R and R2L) and some recently proposed video captioning models.

| Microsoft Video Description Corpus
MSVD is another extensively applied dataset in video captioning, which consists of 1970 videos collected from the YouTube website. Each video depicts an activity scene and is annotated with 40 English captions by AMT. Likewise, we follow the standard splits in [5,25,50] to separate the dataset with 1200 videos for training, 100 for validation, and 670 for testing, respectively.

| Metrics
There are four standard metrics employed to evaluate the experimental performance, including BLEU@n [51], ROUGE-L [52], METEOR [53], and CIDEr [47]. The BLEU metric estimates the precision of n-grams (n = 1, 2, 3, 4) between predicted captions and ground-truth captions. This metric highly agrees with human assessment results. The ROUGE-L metric calculates the mean of accuracy and the recall of the longest common subsequence between the compared captions. The METEOR metric measures the word correspondence between the ground-truth and predicted captions by generating the alignment. The CIDEr metric computes the mean cosine similarity of n-grams between the compared captions, weighted by Term Frequency-Inverse Document Frequency (TF-IDF). Generally, METEOR and CIDEr are more semantically accurate, and CIDEr is specially designed for video captioning.
In addition, these evaluation metrics have been implemented and supplied by the Microsoft COCO evaluation server. Therefore, we can directly make use of these evaluation functions to compare the experimental results of various methods. For all the metrics, higher values mean better results.

| Video processing
We resize the video frames to the standard size 224 � 224 and then feed the video frames into the pre-trained ResNet152 [54], which returns a sequence of 2048-dimensional feature vectors. We sample 40 equally-spaced features from the sequence of feature vectors. If the total number of video features is less than 40, we pad them with zero vectors.

| Sentence processing
We first convert the ground-truth sentence into lowercase, tokenise each sentence into words using the word_tokenise function in the NLTK toolbox, and eliminate the punctuations. Moreover, we add the starting position of each sentence with a label <SOS> and the ending position with a label <EOS>. Unknown words are set to the label <UNK>. Afterwards, a vocabulary is built with 8488 words for MSVD and 16,848 words for MSR-VTT.

| Training details
We embed each word into a 512-dimensional word vector. For the video encoder, the self-attentive MH-Att module has N = 3 identical sub-encoders, H = 8 attention heads, and the BiGRU has a 1024-dimensional hidden size. For the bidirectional decoder, the cross-attentive MH-Att module has H = 8 attention heads, and the GRU has 1024 hidden units. Moreover, the regularisation technique Dropout [55] with a rate value 0.5 is employed to the input and output of GRUs for preventing overfitting. We initialise all the model parameters from a distribution on [−0.1, 0.1], and fix the batch size to 64. Besides, the Adam [56] algorithm is leveraged to optimise the bidirectional decoding model by minimising the negative CE loss with a learning rate of 10 −4 . We train the bidirectional decoder for 500 epochs and the learning rate is decayed by 0.5 for every 50 epochs. The model gradients are clipped between −5 and +5 for avoiding gradient explosion. The proposed model is executed using PyTorch by an NVIDIA RTX 2080Ti GPU with 11G memory. The beam size is five for the final sentence generation at the testing time.

| Performance comparison
To validate the effectiveness of the bidirectional decoding structure, the proposed method is compared with the unidirectional decoding methods (L2R and R2L) and several recently proposed video captioning methods.
� MP-LSTM [24]: The mean pooling-LSTM (MP-LSTM) method applies the mean pooling operation to all the sampled features to obtain the global video representation, which is taken as the input to an LSTM-based sentence decoder. � S2VT [25]: The sequence to sequence video to text (S2VT) method extracts both RGB and optical flow features and applies the LSTM to the video encoder and sentence decoder. � SA-LSTM (L2R) [50]: The SA-LSTM method adopts the soft attention (SA) to selectively concentrate on the most relevant visual contents at each time step. We take SA-LSTM as the unidirectional decoding model, which generates the descriptions from left-to-right, and our model is designed based on the SA-LSTM method. � SA-LSTM (R2L) [50]: The SA-LSTM (R2L) method is a variant of SA-LSTM (L2R), which generates the sentence in a right-to-left manner. � h-RNN [59]: The h-RNN method introduces a hierarchical-RNN (h-RNN) framework, which leverages the intersentence dependency to generate multiple video descriptions � M3 [4]: The M3 method designs a shared memory structure that selectively reads/writes both visual contents and sentences for video captioning. � hLSTMat [57]: The hLSTMat method proposes an adjusted temporal mechanism that determines the usage of visual and textual information. � RecNet [5]: The RecNet method designs an encoderdecoder-reconstructor structure to make use of the bidirectional flow between videos and sentences. � BAE [28]: The boundary-aware encoder (BAE) introduces a boundary-aware LSTM cell that recognises the discontinuities of videos by extracting the hierarchical structures of videos. � aLSTMs [34]: The aLSTMs method utilises the hierarchal attention mechanism to concentrate on the discriminative video frames and then explores the correlation between the words and video contents. � PickNet [30]: The PickNet method puts forward a plug-andplay reinforcement-learning-based network that selects the most informative video frames for avoiding content noise.
� MCNN [29]: The multi-stage CNN (MCNN) method designs a Multi-modal circulant fusion (MCF) module to exploit the interactions among visual-textual features. � JSRVC [12]: The joint syntax representation and visual cue (JSRVC) method jointly learns the syntactic structure of caption through inferring the POS tags from videos and utilises a mixture model to translate the video features into words based on the learned syntax structure. � GRU-EVE [11]: The GRU-EVE method devises a novel video encoding module that enriches the video representations with spatio-temporal dynamical information and highlevel semantic concepts. � CAM-RNN [7]: The CAM-RNN method introduces a co-attention model based RNN to extract the most relevant visual and textual features to the generated sentences. � ASM [60]: The ASM method puts forward an attribute selection mechanism (ASM) to automatically focus on the useful attributes. � STAT [13]: The STAT method designs a spatial-temporal attention (STAT) mechanism that considers the spatial and temporal structures of videos simultaneously and catches the significant regions of the most relevant video frames when predicting each word. � ConvRS [58]: The ConvRS method presents a convolutional reconstruction-to-sequence framework, which uses the inter-frame difference to generate discriminative video representation and then adopts the multiple dilated convolution layers to generate sentences. Table 1 shows the comparison results between the unidirectional decoding (L2R, R2L) and bidirectional decoding models on MSR-VTT and MSVD dataset, respectively. For fair comparison, we remove the self-attentive MH-Att module from the full model and extract the video features merely with a BiGRU network, the same with the unidirectional decoding models. We can observe that the bidirectional decoding model makes a significant improvement over the unidirectional decoding (L2R) model by 6.1% and 10.2% (CIDEr) on MSR-VTT and MSVD dataset when only trained by CE loss. These results strongly indicate the effectiveness of bidirectional decoding structure in mitigating the exposure bias issue and regularising the output predictions to be more consistent with the visual semantics. Moreover, we train our model on the basis of the reinforcement learning (RL), which further provides a noticeable gain to the captioning performance in most evaluation metrics, except BLEU@4. In Tables 2 and 3, we illustrate the quantitative comparison results of several recently proposed video captioning methods on the MSR-VTT and MSVD dataset, respectively. It should be noted that not all the compared methods are trained by RL, so our method will be compared to other state-of-the-art methods merely under the CE loss for fair comparison. We can see that our symmetric bidirectional decoding model obtains the best performance in terms of METROR and CIDEr and ranks the second in metric ROUGE-L. Specially, compared with JSRVC [12] and GRU-EVE [11] using same video features, we have not excavated the intrinsic high-level semantic concepts from the videos to enrich the visual features, and our method can still obtain better experimental results in METEOR and CIDEr scores. The newly published methods, that is CAM-RNN and STAT, both attempt to take the spatial and temporal structures of videos into account by employing various attention mechanisms. Another observation is that our method with no complex techniques can achieve comparable performance with most state-of-the-art methods, which to some extent indicates the performance bottleneck only using the unidirectional decoder in video captioning. Besides, it can be observed that our method do not achieve remarkable experimental results in BLEU@4 metric. The probable reason is that our method is adept at mining the semantic information, and the BLEU@4 metric is estimated based on the precision of 4-gram rather than the semantic matching.

| Ablation study
This section investigates the impact of different components in the proposed method, and for that we devise several ablated models with different settings by removing the specific component, as illustrated in Table 4. We first define a 'base' model, which employs the BiGRU network to learn the video contextual features and drops the SSGGA component in the contribution of the self-attentive MH-Att module in capturing long-term semantic dependencies from video contexts and excavating discriminative visual semantic features. Table 5 illustrates the experimental results of various settings with N varying from one to six and H chosen from one, four or eight. We can observe that increasing N can boost the performance improvement, and the performance (BLEU@4, ROUGE-L, CIDEr) drops when N is larger (N ≥ 3). It is probable that stacking more self-attention layers could make the video encoder easier to lose low-level semantic information. Moreover, the experimental results clearly show that a larger H provides significant gains to the captioning performance and is helpful to generate more discriminative video features. Thus, the 3-layer, 8head setting is the best choice in our implementation. Effect of combination methods We investigate the influence of several fusion schemes on the captioning performance, including concatenation, element-wise add, bilinear pooling, and gate mechanism. The performance comparison results are illustrated in Table 6. We can observe that element-wise add provides slightly gains than other combination methods in BLEU@4, ROUGE-L, and CIDEr. Thus, combining the past and future output contexts appropriately can provide robust guidance to the word prediction process.

| Qualitative analysis
Although some standard evaluation metrics have validated the superiority of the symmetric bidirectional decoding structure, it is still not intuitive to see the quality of the generated video descriptions. Figure 4 illustrates the qualitative examples of our video captioning method from eight videos, and six frames are selected to represent each video. It can be concluded that our method has the ability to generate descriptions with more accurate semantic meanings than the unidirectional decoding methods (L2R, R2L). The examples (a), (b), (e), (f) demonstrate that our method can not only express the general meanings of the whole video, but also is capable of identifying more detailed objects ('eyes', 'brush', 'suit', 'audience'), attributes ('red and blue dress') and actions ('playing guitar in stage'). Our method can also ground larger-scale semantic concepts, such as 'A group of people' in example (h). In addition, in example (d) and (g), our model can recognise the video game scenes rather than a normal environment. In particular, example (g) is able to discern that a 'man' is playing a video game and 'commentating' on the game without any person appearing on the screen. According to the above observations, we can conclude that the bidirectional decoding structure can contribute to generating more robust and fine-grained video captions.

| CONCLUSION
The dominant video captioning methods usually adopt the attentional encoder-decoder framework, so they might suffer from the exposure bias problem and leave the future output contexts generated from the R2L decoding unexplored.
Here, the authors introduce a new symmetric bidirectional decoder for video captioning. Unlike previous methods, the past and future output contexts are both taken into account, and our method can learn the capability of employing one single decoder to generate the video descriptions from the L2R and R2L directions simultaneously. We also devise an effective symmetric semantic-guided gated attention module to suppress the irrelevant or misleading contents in the past or future output contexts, which provides the accurate guidance to the word prediction process. The L2R and R2L likelihood losses are integrated together to optimise the full model. Extensive experiments conducted on two widelyused datasets verify the superiority of our proposed method compared to the unidirectional decoding (L2R, R2L) methods and most recently proposed methods. We provide some qualitative examples to intuitively visualise the effectiveness of the symmetric bidirectional decoder. Consequently, our proposed method can generate accurate and robust video captions.