Spatial attention model-modulated bi-directional long short-term memory for unsupervised video summarisation

Compared with surveillance video, user-created videos contain more frequent shot changes, which lead to diversified backgrounds and a wide variety of content. The high redundancy among keyframes is a critical issue for the existing summarising methods in dealing with user-created videos. To address the critical issue, we designed a salientarea-sizebased spatial attention model (SAM) on the observation that humans tend to focus on sizable and moving objects in videos. Moreover, the SAM is taken as guidance to refine frame-wise soft selected probability for the bi-directional long short-term memory model. The reinforcement learning framework, trained by the deep deterministic policy gradient algorithm, is adopted to do unsupervised training. Extensive experiments on the SumMe and TVSum datasets demonstrate that our method outperforms the state-of-the-art in terms of F-score.

✉ Email: zhongr@mail.ccnu.edu.cn Compared with surveillance video, user-created videos contain more frequent shot changes, which lead to diversified backgrounds and a wide variety of content. The high redundancy among keyframes is a critical issue for the existing summarising methods in dealing with user-created videos. To address the critical issue, we designed a salient-area-sizebased spatial attention model (SAM) on the observation that humans tend to focus on sizable and moving objects in videos. Moreover, the SAM is taken as guidance to refine frame-wise soft selected probability for the bi-directional long short-term memory model. The reinforcement learning framework, trained by the deep deterministic policy gradient algorithm, is adopted to do unsupervised training. Extensive experiments on the SumMe and TVSum datasets demonstrate that our method outperforms the state-of-the-art in terms of F-score.
Introduction: Video summarisation aims to represent an original video through a brief sketch, which can generally be applied in the video preview, storage, retrieval, and management. Ji et al. [1] first used the classical attention-based encoder-decoder structure to emphasise the salient information of images for the supervised video summarisation method. The attention mechanism can efficiently alleviate the limitation of the encoder-decoder framework in which the scant output length of the encoder leads to the abandonment of salient features [3]. Additionally, Elhamifar proposed the classic dictionary selection-based unsupervised video summarisation approach [4], and then Mahasseni presented a variational autoencoder-based generative adversarial networks (GAN) to summarise videos [5]. Given the excellent performance of reinforcement learning in video representation, Zhou [6] subsequently presented an end-to-end long short-term memory (LSTM)-based unsupervised learning method for video summarisation, deep reinforcement-deep summarization network (DR-DSN), which comprises a novel feedback reward with the aspects of diversity and representativeness. Afterwards, Li [7] proposed an unsupervised cycle-consistent adversarial LSTM network for video summarization (Cycle-sum) by integrating a frame selector and a cycle-consistent learning-based evaluator.
Even with reinforcement learning, the above LSTM-and GAN-based unsupervised methods could not achieve high accuracy of selecting keyframes. To enhance the accuracy of selecting keyframes, we propose a spatial attention model (SAM)-modulated bi-directional long shortterm memory (Bi-LSTM), within which the SAM is established based on salient-area-size-guided attention measuring according to the fact that humans often concentrate on sizable and moving objects in a video.
Specifically, the SAM adopts the attention mechanism [8] to rank the frame-wise saliency level and then select keyframes with a saliency level higher than a threshold. Furthermore, the scant output length of the gate structure in Bi-LSTM causes the problem of ignoring salient features, which leads to low accuracy in the probability of being selected as keyframes in video summarisation named soft selected probability (SSP) in this letter. Thus, the SAM is stacked up with the Bi-LSTM as a modulation to improve the accuracy of predicting the SSP. We first adopt the reinforcement learning framework, deep deterministic policy gradient (DDPG) algorithm [9], to backpropagate loss, which eventually alleviates the convergence problem that occurred in DR-DSN [6].
1. To overcome the issue of high redundancy among keyframes in user-created video summarisation, we propose a SAM-modulated Bi-LSTM model, which further distinguishes images' differences to promote the accuracy of SSP prediction. 2. We propose the salient-area-size-based SAM on the observation that sizable and moving objects attract more attention.  Comparing with the video summarization model with a single feature [6], the combination of semantic and saliency features within the proposed SAM-modulated Bi-LSTM method can improve the capability of filtering redundant frames in selecting the keyframes.
Problem formulation: Given a long video X = {x t } T t=1 , where x t ∈ R w×h×c , w, h, c is the size of the width, height, and channel number for each frame, tis the index of the frames. Y = {y t } T t=1 , y t ∈ {0, 1} is the t th binary label that represents whether the t th frame is selected as a keyframe. The collection of the selected frames composes the final video summarisation represented by  (1): where y t = 1 alternatively represents that the t th frame with the SSP β t is selected, and vice versa.
Main components of our model: As illustrated in Figure 1, we propose a SAM-modulated Bi-LSTM model that performs better in discriminating the difference among images with the integration of semantic and saliency features. First, a saliency map histogram is rendered from the saliency map generated by the saliency detector [2]. Furthermore, the SAM is modelled by measuring the frame-wise spatial importance scores with the input of saliency features. Last, the semantic feature is learned via a pre-trained GoogLeNet on ImageNet [9] and then processed by the Bi-LSTM network for predicting the initial SSP.

SAM-modulated Bi-LSTM:
The SPPβ t is modelled as the initial probabilityp t of the Bi-LSTM, modulated by the spatial attention l t , computed as Equation (2): where N( · ) = exp ( · )/ exp ( · ) is the normalisation operator, and W p and W l represent the weight corresponding to the probability p t and l t , respectively.

Salient-area-size-based SAM:
Inspired by the idea that human beings frequently pay more attention to the sizable moving objects in a video, we propose a saliency feature-based model SAM to formulate spatial where λ t is the visual importance score for the t th frame. In Figure 1, for the frame-wise saliency map z t , we rank the area of salient regions, and then set the salient regions whose area r ∂ is less than the threshold r∂ as non-salient regions. To calculate the optimal value of r∂ for determining the core salient regions, we conduct experiments by ranging r∂ from 20 to 25 and randomly sample 150 videos for testing. As Figure 2 shows, when r∂ = 21, SAM arrives the best result in terms of F-score. Thus, the visual importance score λ t is computed as the sum of the saliency histogram of the core salient regions by Equation (4): We define a saliency histogram for the frame-wise salient map z t , whose horizontal coordinate is saliency level r ∂ , while the vertical coordinate is frequency bins( · ). The range of horizontal coordinate is 26 saliency levels presented as {r 0 ,r 1 ,..., r ∂ ,..., r 25 }, ∂ ∈ [0, 25], where the class interval of each level is 10 (except that the special case corresponds to r 25 = 5). In the saliency histogram, the frequency of vertical coordinate is computed by the discrete function bins( · ), written as Equation (5): where the q r∂ is the number of pixels at the salient regions with the level r ∂ . Specifically, a saliency detector [2] is adopted to detect the salient regions of moving objects and generate a sequence of the saliency map Z = {z t } T t=1 , z t ∈ R w × h denoted in Equation (6): where (i, j) represents the pixel-wise coordinate of z t (i,j).
Training and optimisation: In this letter, we adopt the DDPG [8] to train the SAM-modulated Bi-LSTM.
Experimental results: For a comparison with the other methods, we adopt the commonly used evaluation metric, the F-score [1]. The experiments on the SumMe and TVSum user-created video-based datasets show that our SAM-modulated Bi-LSTM outperforms the state-of-theart methods.
Comparison with unsupervised approaches: As Table 1 shows, the Fscore of the proposed SAM-modulated Bi-LSTM is 22.95% and 1.56% higher than DR-DSN [6] on the SumMe and TVSum datasets, respectively. Further, our method also performs better than the state-of-the-art Cycle-SUM [7], like 21.48% and 1.56% F-score.
Interestingly, the F-score of the SumMe dataset is relatively lower for most comparison algorithms, such as [4][5][6][7], than those of the TVSum dataset, which has proven that the comparison algorithms could not perform well with the raw or minimally edited videos, such as SumMe. However, the gain of a 22.95% F-score on the SumMe dataset demonstrates that our method succeeds in fetching up the drawback.
As shown in Table 2, for the SumMe and TVSum datasets, the SAM (mode 3) and the DDPG (mode 2) performs 21.01% and 3.2%, as well as 16.2% and 2.85% F-score gain, respectively, compared with the baseline (mode 1: the Bi-LSTM model). Moreover, the combination of the SAM and the DDPG (mode 4) rises by 28.86% and 4.09% F-Score against mode 1.

Conclusion:
To accomplish the goal of high efficiency for user-created video summarisation, we propose a SAM-modulated Bi-LSTM model to discriminate the difference among images with the integration of semantic and saliency features. The high redundancy among keyframes has been proven from the low accuracy of SPP of Bi-LSTM, which leads to the low efficiency of unsupervised user-created video summarisation. According to the fact that humans tend to concentrate on sizeable moving objects, the SAM is established via a salient-area-size-guided saliency attention model. Moreover, we innovatively adopt DDPG to fulfil the unsupervised learning under the reinforcement learning framework. The extensive experiments demonstrate that the proposed SAMmodulated Bi-LSTM model outperforms the state-of-the-art methods.