TASTA: Text-Assisted Spatial and Temporal Attention Network for Video Question Answering

is made to be simple, small, clean

DOI: 10.1002/aisy.202200131 Video question answering (VideoQA) is a typical task that integrates language and vision. The key for VideoQA is to extract relevant and effective visual information for answering a specific question. Information selection is believed to be necessary for this task due to the large amount of irrelevant information in the video, and explicitly learning an attention model can be a reasonable and effective solution for the selection. Herein, a novel VideoQA model called Text-Assisted Spatial and Temporal Attention Network (TASTA) is proposed, which shows the great potential of explicitly modeling attention. TASTA is made to be simple, small, clean, and efficient for clear performance justification and possible easy extension. Its success is mainly from two new strategies of better using the textual information. Experimental results on a large and most representative dataset, TGIF-QA, show the significant superiority of TASTA w.r.t. the state-ofthe-art and demonstrate the effectiveness of its key components via ablation studies.
information in both spatial and temporal spaces has already become a consensus in recent studies, [6][7][8]11] but attention seems not well explored because it degrades the performance, which is reported in ref. [6]. Therefore, we revisit the attention mechanism and find a more proper way of modeling it.
In this study, we introduce "Text-Assisted Spatial and Temporal Attention Network (TASTA)" which takes the attention spatially and temporally for VideoQA with textual information as an assistant. As shown in Figure 2, with the assistance of textual information, the network pays more attention to the features related to the question. TASTA is made to be simple and efficient. The frame feature is adopted without the 3D visual information and optical flow. Meanwhile, the effects of text-assisted attention can be better seen. Encouragingly, without bells and whistles, TASTA outperforms state-of-the-art models by large margins on multiple-choice question tasks.
The novelties and strengths of TASTA are mainly from the following two new strategies. The first one is described as "let the text guide." The "guidance" is realized by using text as an input (in addition to the video data) to the attention module, which is inspired by the ST model. [6] Such a network ensures the contribution of textual information while at the same time keeping the main visual information unpolluted. The second one is described as "make text provide contextual information to attention when applicable." The question for VideoQA can be divided into two categories: open-ended question and multiplechoice question. An open-ended question only contains the question itself and the model has to find an answer in the whole feasible answer space. Differently, a multiple-choice question provides not only the question but also a few answer candidates to choose from. These candidates are generally selected to be relevant, and wrong candidates are chosen to be close or highly related to the correct answer to some extent. As shown in Figure 3, TASTA takes all the answer candidates together with the question as textual input. Thus, this augmented text provides a context for TASTA, letting the model focus on finding the key visual information that can tell the correct answer from its confusing distractors most effectively. This is designed for multiple-choice questions because we cannot get such meaningful contextual information for open-ended questions. Figure 1. Example of a smart production scene with multiple video surveillance cameras. It is trying to answer the question proposed by a person, be it a manager or someone, about whether the robot arm is working normally. The system is expected to pay attention to the working condition of the robot arm since the keyword of the question is "robot arm" (in red) and give a proper answer according to the video from camera "For robot arm" (in red).

Spatial Attention
How many times does the man step?

Temporal Attention
How many times does the man step? Figure 2. Example of the proposed TASTA method. An image sequence with the question "How many times does the man step? " in TGIF-QA dataset. [6] ResNet is used to extract the basic features, which contain all the visual information in the image. With the assistance of textual information, the network pays more attention to the features related to the question. Photo by Jusdevoyage on Unsplash (https://unsplash.com/photos/0xMY2zYqwsE).
The main contributions of this study can be summarized as follows: 1) A new video question answering model, TASTA, is proposed for making attention work effectively for VideoQA. 2) Two new strategies for enhancing performance are introduced, i.e., the text assist attention mechanism on spatial and temporal dimensions, and a new connection way of question and alternative answers. 3) TASTA is made to be simple and efficient so that further studies can be easily built based on taking it as a backbone network structure. 4) The proposed model is tested on the TGIF-QA dataset, showing significantly superior performance in comparison with state-of-the-art models. Ablation studies are also performed to justify the effectiveness of detailed model designs.

Image Question Answering
Many datasets for ImageQA task have been proposed, such as DAQUAR, [12] VQA, [13] COCOQA, [14] FM-IQA, [15] and VQA v2. [16] This task requires the network to infer the answer according to the given image and question. In the work, [13] the classification network VGG and LSTM structure were employed to extract image features and visual features. The image and textual features then were simply integrated by concatenation. Finally, the fusion features were fed into a fully connected layer with softmax to infer the answer. Almost all the VQA models followed the basic structure of CNN-RNN fusion. The work in ref. [14] proposed the VIS-LSTM model, which combined the image and textual features by LSTM structure. The work in ref. [17] proposed bilinear pooling, which fused the two features by outer product operation. The work in ref. [18] proposed a multimodal residual network to integrate the two features. And they introduced the attention mechanism into the network to improve the performance of visual feature extraction. The work in ref. [19] employed both compact bilinear pooling and spatial attention. Furthermore, the work in ref. [20] extracted image features by object detection network fast RCNN, and the attention mechanism was adopted at the object level instead of pixel level. The work in ref. [21] applied the attention mechanism in both image and textual features. The work in ref. [22] proposed an attention on attention mechanism to better extract information. ImageQA research provides some basic modeling idea for VideoQA, though the temporal information is not taken into account.

Video Question Answering
VideoQA is a more challenging task than ImageQA. Current public datasets include MovieQA, [23] TGIF-QA, [6] TVQA, [24] Pano-AVQA, [25] CLEVEVER, [26] and STAR. [27] Compared with the other datasets, TGIF-QA is the largest dataset at present in VideoQA task. Therefore, it is used as experimental data to test the performance of the model in the most recent research work, which is also the case in this study.
Following the model from ImageQA, the basic structure of CNN-RNN fusion is also adopted in VideoQA task. The extraction of temporal features and the application of the attention mechanism are the main improvements. The work in ref. [28] proposed a semantic attention generator, which generated answers based on the concept detected from the video. The work in ref. [7] proposed a comemory network, which took the image as appearance information and took optical flow as motion information. Then the attention mechanism was applied to both appearance and motion information. The work in ref. [11] divided input video into structured segments. Two-stream attention, namely, the temporal stream and textual stream, was introduced in each segment.
Although some achievements have been made in recent VideoQA research, there are still some points that can be further studied. First, it is unnecessary to use the features extracted by 3D-CNN or optical flow as additional input. Compared with the huge computing consumption, the improvement of network performance is not significant. Most of the latest models no longer adopt the above additional input information. [11,29] Second, the attention mechanism in the spatial dimension should improve network performance, but some experiment is contrary to the expected results. The work in ref. [6] proposed the ST model, which applied the attention mechanism in both temporal and spatial dimensions. However, the ST model did not achieve the expected performance due to the structure of the network. Some work proposes alternative use of attention, e.g., the work in ref. [9] used attention for its different subtasks, the work in ref. [30] used RNN-free attention network to handle intra-and intermodal feature processing, and the work in ref. [31] adapted a transformer structure for jointly handling video and text features in a common feature space. Works [32,33] proposed new models that learn visual concepts of physical objects Figure 3. In VideoQA task, some questions provide alternative answers. The traditional strategy is to evaluate the confidence of each answer, which leads the model to give false high confidence to some difficult answers. Our strategy is to let the model evaluate all the alternative answers at the same time. Combined with the contextual information between the answers, the model can better exclude the difficult answers. Photo by Jusdevoyage on Unsplash (https://unsplash.com/photos/0xMY2zYqwsE).
www.advancedsciencenews.com www.advintellsyst.com from dynamic scenes, and they emphasize reasoning and are tested on the CLEVERER dataset. [26] Our proposed model TASTA is inspired by the ST model. Our work achieves the SOTA results and solves the encountered problem in ST model, i.e., deteriorated performance caused by less proper attention.

Overall Framework
The overall framework of TASTA is shown in Figure 4. Text and video are fed into their corresponding preprocessing modules: GLoVe word embedding [34] and ResNet-50 [35] frame-wise visual feature extraction, respectively. Then the encoded word sequence goes through a two-layer LSTM model for extracting the global sequence-level textual information. Such textual information is combined with the frame-wise visual information as inputs to the spatial attention module for getting the filtered frame-wise visual feature representation. Then a two-layer LSTM model (similar to the one for textual information) is adopted for extracting and injecting temporal information into the visual representation. A temporal attention module follows to selectively integrate information from all frames. Finally, the integrated visual information and the sequential-level textual information extracted before are concatenated and mapped to the desired output via two fully connected layers.

Spatial and Temporal Attention
The same attention structure is adopted for spatial attention and temporal attention, as shown in Figure 4. Here, the module is introduced in a general way. Suppose the visual feature to be fed into the attention module is denoted by V ∈ ℝ lÂD v , and the textual feature is T ∈ ℝ D q . D v and D q are the dimension of the two types of features, respectively, and l is the amount of visual features. Specifically, in the spatial attention, l refers to the spatial size of the feature map, while in the temporal attention, l refers to the number of frames. The computational process of the attention module can be formulated as where α ¼ fα 1 , α 2 , ·· · ,α l g are the weight coefficients, which are calculated by function f. As shown in Figure 4, function f consists of two fully connected layers. The output dimension of the first layer is 512, and the output dimension of the second layer is 1 for each feature dimension. Then, l weights are normalized by softmax function. Finally, the original input V is weighted and summed to get outputṼ.

Guidance
Because the feature extraction process of text and vision is independent of each other, their feature space is totally different. It is difficult to extract effective features from the model if textual and visual features are added together or fused by attention mechanism too early. As the name of the proposed network implies, textual information assists visual feature extraction. www.advancedsciencenews.com www.advintellsyst.com The guidance of text is done by simply taking textual features as one of the two inputs to the attention modules. As shown in Figure 4, the attention module is essentially to weight the visual features. Text features are not integrated into visual features until the last text information supplement part of the network.

Providing Contextual Information
Questions are open-ended or multiple-choice in VideoQA. Text information for the open-ended includes only the question, while the multiple-choice counterparts include multiple alternative answers besides the question. For an open-ended question, the model outputs the answer, formed by the recognized or detected labels from the computer vision. For a multiple-choice question, the question and each alternative answer are connected in series before used as network input. The network scores each alternative answer in turn, and outputs the one with the highest score. In other words, the network only finds out whether the current answer is appropriate, but lacks comprehensive analysis of each alternative answer. For multiple-choice cases, we propose to augment the questions with answer candidates for providing contextual information to the model, which is inspired by the human practice. Humans solve multiple-choice questions by first considering all the alternative answers, and then choosing a most suitable one. To show the advantage of doing that, two different approaches are taken, which are called "question with singleanswer" and "question with all-answers." The "question with single-answer" outputs an answer directly after the question, such as "What does the singer do after jump? stop jump." In this way, the model calculates confidence by means of regression, representing the probability that the answer is correct, and then selects the one with the highest confidence among multiple answers as the final output of the model. The "question with all-answers," as one of the pioneering work of this study, is to put all alternative answers after the question, and use the special character "<aa>" (alternative answer) to separate different answers, such as "What does the singer do after jump? <aa> vanish <aa> stop jump <aa> hugs woman <aa> inhale slowly <aa> rim eye with hand." In this way, the model outputs five confidence scores. In other words, the model is performing a classification task. Comparisons of these methods are provided in the ablation study.

Initial Feature Extraction
GLoVe [34] is pretrained on the Common Crawl dataset and it maps each word (e.g., the i th word) into a 300D feature vector (e.g., t i ∈ ℝ 300 ). Considering that the length of the video and the image size are not fixed, the video is preprocessed to a uniform size of 32 Â 224 Â 224 before being taken as the network input. The temporal scaling is done by linear interpolation while the spatial resizing is done by bilinear interpolation. ResNet-50 [35] is pretrained on ImageNet.

LSTM Models
The two adopted two-layer LSTM models are chosen to be similar. For textual information, the output dimensions of both LSTM layers are 512. The calculation process can be expressed as where t 1 i and t 2 i are the i th state outputs of the first and second LSTM, respectively. Dropout [36] is applied to each LSTM unit but layer normalization [37] is excluded. The exclusion of layer normalization is justified in the experimental section. We take the last state output of the second LSTM layer as the final textual feature t f . For visual information, the i th frame feature v i ∈ ℝ 7 Â 7 Â 2048 is obtained from the last convolutional layer output before global average pooling in ResNet. Then, the spatial dimension of v i is flattened to get v 0 i ∈ ℝ 49 Â 2048 as the input of spatial attention mechanism. Combined with the textual feature q f , the spatially weighted featureṽ i ∈ ℝ 512 is obtained. Similar to the feature extraction method of text sequence, video feature sequence v ¼ fṽ 1 ,ṽ 2 , · · · ,ṽ 32 g is also input into the two-layer LSTM structure to further extract features. The calculation process can be expressed as where v 1 i and v 2 i are the i th state outputs of the first and second LSTM, respectively. Then the video sequence feature v 2 is input to the temporal attention module to obtain the final visual feature v f .

Feature for Answer Mapping
The textual feature t f and the visual feature v f are concatenated to obtain multimodel feature m f Then two fully connected layers are applied to complete the decoding of the answer. The output dimension of the first full connection layer is 512, and the second varies according to the question form. In this study, the large dataset TGIF-QA for experiments further divides the VideoQA into four tasks: Count, Frame, Action, and Transition, [6] having two question forms in total. Therefore, the model is trained and evaluated separately according to different tasks.

Dataset
The TGIF-QA dataset is proposed by ref. [6], including 71, 74 GIFs and 165, 165 question-answer pairs. Four different types of tasks are introduced, i.e., Count, Frame, Action, and Transition. Frame task can be answered based on a single frame like in ImageQA, while the other three tasks need to take the full video into account. The training and test statistics of the four tasks are shown in Table 1.

Implementation Details
When the model is initialized, the pretrained ResNet and GLoVe are directly loaded, while the other parameters are initialized by Xavier methods. [38] ResNet-50 is pretrained on the ImageNet and GLoVe is pre-trained on the Common Crawl. And both of them are not updated during training.
The batch size is 8 and the optimization is Adam. For each task, the model is trained for 10 epochs with an early stop. The learning rate increased linearly from the minimum to the maximum in the first epoch for warm up and decreased by cosine decay from the maximum to the minimum in the remaining 9 epochs. The maximum learning rate is 0.001 and the minimum is 0.00001.
In Count task, the question is about the number of repetitions of a particular action in the video, such as "How many times does the man step?". The answer is an integer from 0 to 10 so that the output dimension of the last layer is 1 without any activation function. Mean square error (MSE) between the truth and the prediction is used as the loss function during training.
In Frame task, the question is about the attributes of the object in the video, i.e., category, color, location, and number. Following the model of the ImageQA task, we take Frame task as a multiclass classification problem. Specifically, we count all the n answers that appeared in the training set as alternative answers. Then the model completes an n-class classification problem, so the output dimension of the last layer of the model is n with softmax. Cross-entropy is taken as the loss function during training. The question in Action task is about the category of actions that are repeated, such as "What does the man do 5 times?". And the question in Transition task is about the change in human action, such as "What does the woman do before turn to right side?". Both Action and Transition task provide five alternative answers. Therefore, the output form for both tasks is the same as a model perspective. As mentioned in Section 3.3 of the two text preprocessing methods: "question with single-answer" and "question with all-answers," the model output form is different. For "question with single-answer," the output dimension is 1, representing the confidence of the current answer. The hinge loss is used during training max ð0, 1 þ s n À s p Þ where s n is the negative answer and the s p is the positive answer in the input text. Correspondingly, for "question with all-answers," the output dimension is 5 with softmax, and the cross-entropy loss is applied in the training process. Note that the questions are open-ended in the first two tasks and multiple-choice in the other two. Therefore, augmenting questions with answer candidates are only done for the latter two tasks.

Attention Mechanism Experiment
The model proposed in this study, TASTA is inspired by the ST model. [6] To make a fair comparison, the attention mechanism of temporal and spatial dimensions is removed, in turn. When removed, the average sum operation replaces the weighted sum operation. The experimental results are shown in Table 2. In this experiment, layer normalization is employed in all LSTM structure and text type is "question with singleanswer" for Action and Transition task. The evaluation metric of the Count task is the square error, and the evaluation metric of the other three tasks is accuracy.
Compared with the ST model, the TASTA model is improved in almost all tasks. Quantitatively, under the spatiotemporal attention mechanism by comparing with the ST model, the TASTA model improves 0.26, 6.9%, 1.26%, and 13.53% in the four tasks of Count, Frame, Action, Transition, and respectively.
On the other hand, the ablation experiment shows that in the TASTA model, the attention mechanism works well in both temporal and spatial dimensions. The failure of the spatial attention mechanism in the ST model is not repeated. The strategy that let text serves as guidance and supplement is reason and effective. Figure 5 shows some representative results with and without attention mechanism.

Regularization Experiment
Regularization technology, i.e., layer normalization [37] in this study, is employed in both textual and visual features extractor  LSTM structure in the TASTA model by default because it has a wide range of applications and great performance in natural language processing tasks. However, VideoQA is a relatively new task. We consider whether this regularization technique should be used. On the one hand, in theory, the no free lunch theorem [39] in the field of machine learning makes it necessary to discuss whether layer normalization should be used. On the other hand, in practice, layer normalization was adopted in some work, [6] while it was not employed by some other researchers. [11] Our experiment results are shown in Table 3. In this experiment, text type is "question with single-answer" for Action and Transition tasks. For the TASTA model with and without attention mechanism, we conducted a comparative experiment, which removes the layer normalization from the LSTM structure. The experimental results show that the performance of the model is improved in all four tasks, and the improvement of the Frame task is significant. This proves that our consideration is necessary, and the conclusion is that it is not suitable for VideoQA task to apply layer normalization.

Multiple-Choice Experiment
In the multiple-choice task, we propose a new text type "question with all-answers." In this way, all alternative answers are linked to the question with a special character as the separation. In contrast to "question with single-answer," which combines a question with a single answer, our method only requires inference once instead of five times. The efficiency of the calculations is increased. Furthermore, the proposed "question with allanswers" type also provides more textual information. Instead of scoring a single candidate answer, the model can simultaneously consider all the candidates and output the answer.
The structure is designed inspired by the way how humans perform multiple-choice tasks. Table 4 shows the experiments between "question with singleanswer" and "question with all-answers" types. The model is the TASTA with the attention mechanism on both spatial and temporal dimensions. The proposed structure "question with allanswers" gives the model significant performance improvement, up to 33.87% on the Action task and 23.06% on the Transition task.

Architecture Design
We conduct experiments to verify the effectiveness of our proposed work. LSTMs play an important role in extracting sequential features, and we discuss whether it is possible to substitute LSTM with other networks. GRU [40] is another popular model for handling sequential data, thus we replace all the LSTMs with GRUs in our model. All the hyperparameters are kept same with the original, and multiple-choice tasks (action and transition) use TRANS13919 TRANS18334 Figure 5. Some representative results in experiments. Please refer to the TGIF-QA dataset [6] for the videos. Corresponding Video ID is marked in the blue box. "question with all answers" strategy. The results are listed in Table 5.
The results remain similar when substituted GRU with LSTM in our original design. The performance improved in Action and FrameQA while fell in Transition and Count, but did not show significant fluctuance with respect to the original implementation. We argue that this is because GRU internally simplifies LSTM, and this simplification caused the changes. The phenomenon that the performance of LSTM and GRU has each other win or lose is also reported by Chung et al., [41] and it is hard to decide under what condition one is better than the other. Therefore, the results indicate that our design is effective and can be compatible with models other than LSTMs.

Number of LSTM Layers
As LSTM plays a core role in our proposed model, it is necessary to decide a proper number of LSTM layers. We conducted experiments with one and three LSTMs. LSTM numbers for text and video keep the same, i.e., the ablated models have one LSTM for text and for video each, or three LSTMs for text and for video each. All the hyperparameters are kept same with the original, and multiple-choice tasks (action and transition) use "question with all answers" strategy. The results are listed in Table 6.
The result shows that two LSTM layers achieve a balance between performance and computational cost. Almost all performances of two LSTMs are better than the other two, but slightly lower in Action compared to three LSTMs. Thus, we argue that a network with two LSTMs meets our requirements.

Input Features
We conducted experiments with features substituted by C3D [42] and optical flow, respectively, to verify the generality of the proposed method under different feature types. We used pretrained features provided by Jang et al. [6] following the descriptions below. C3D features are pretrained on Sport1M dataset, [43] and follow these two rules: use stride 1 pad the first frame 8 times for the first frame, and pad the last frame 7 times for the very last frame (SAME padding); sample at each original frame as a center moment. Optical flow features share a same process with original frames, input to a pretrained ResNet-152, and obtain the features. Jang et al. [6] only uploaded the pooled features ("fc6" for C3D and "pool5" for optical flow) because the file size of the features without pooling is too large for publication.
We made the choice of using features provided by others based on two considerations: fair comparison and limited experimental condition. Using features provided by the dataset maker can alleviate uncontrolled conditions, and can thus be fair to compare. Limited to our experimental conditions, we were not able to pretrain the models ourselves and obtain the C3D and optical flow features, thus we turned to ready-made features.
All the hyperparameters are kept same with the original, and multiple-choice tasks (action and transition) use "question with all answers" strategy. Performance with these two features is shown in Table 7.
The performances are basically similar to the original, but with a slight decrease. We ascribe the degraded performance to the failure of getting unpooled features. Our design depends on the spatial information carried by the unpooled features, especially the Spatial Attention that focuses on features at different position with text guidance, so the pooled feature could introduce adverse effects to our performance. This result can be a proof of the generality of our proposed model.

Comparison with State-of-the-Art Methods on the TGIF-QA
The state-of-the-art models can be divided according to input visual features. The most basic is to use ResNet to extract single frame image features as network input, including CT-SAN [28] and STA. [11] Furthermore, some works consider that ResNet can only extract spatial features, so optical flow or C3D is adopted to extract temporal features as input. ST model [6] takes both ResNet and C3D as feature extractor. The work in ref. [7] proposes a comemory network, which takes the image as appearance information, and takes optical flow as motion information. The performances of these models are shown in Table 8.
The TASTA model proposed in this study only uses ResNet. First, it is compared with the models which do not employ extra time feature as input. Compared with the previous best results, the performances of the TASTA model in Action and Transition tasks are improved by 9.9% and 8.9%, respectively, and the  performances in Count and Frame fall behind a little but are still comparable. Comemory [7] gets better result in Count task because their design of comemory can store information inside their model, and Count task potentially needs some memorization which suits the nature of Comemory model. ClipBERT [31] outperforms ours in FrameQA task because ClipBERT adapts a BERT structure. This structure is better at aligning video and text features, thus demonstrates superiority in capturing video features that have a textual alignment. And FrameQA task focused on finding the attributes of object mentioned in the question, and can be benefitted from the alignment. Thus, ClipBERT gets a better performance than ours. It is worth mentioning that we can only use ResNet-50 due to the limitation of hardware devices, while they use ResNet-152. In the case of using a slightly weaker feature extractor, our model still achieves better performance. It further indicates that our model itself is better. If the feature extracted by ResNet-152 can be used as input, better results will be obtained.
ST model uses ResNet and C3D to extract features together. However, because the ST model pays too much attention to textual information rather than visual information, it does not show good performance. On the other hand, the comemory method uses optical flow information. It has made a significant improvement in the performance of Count task and is the only one that exceeds our model. However, the performance of the other three tasks is not as good as the TASTA. The lack of temporal attention leads to relatively worse performance for the co-memory network. To improve the counting task is our future research work.

Conclusion
In this study, a novel model called TASTA is proposed for video question answering. The model adopts a text-assisted attention mechanism on both spatial and temporal dimensions. For the details, we discuss the application of the regularization method in the VideoQA task, and clarify that layer normalization is not applicable. In the model input, we propose a new connection way of the question and alternative answers, which significantly improves network performance. Carefully designed ablation studies verify the effectiveness and generality of our proposed network. On the TGIF-QA dataset, the proposed TASTA model achieves better performance by comparing it with the state-ofthe-art method. The proposed method may be beneficial to those who wish to find a way of inspecting vast amount of video with guidance of natural language questions. Table 8. Comparison with the state-of-the-art methods on the TGIF-QA. R, C, and F refer to Resnet, C3D, and optical flow features, respectively. TASTA is the model proposed in this study. The evaluation metric of the Count task is the MSE, and the evaluation metric of the other three tasks is accuracy. Compared with other models, including ST, [6] comemory, [7] CT-SAN, [28] STA, [11] PSAC [30] ClipBERT, [31]  Reported no count result due to their sparse sampling strategy.