Video super-resolution with non-local alignment network

Video super-resolution (VSR) aims at recovering high-resolution frames from their low-resolution counterparts. Over the past few years, deep neural networks have dominated the video super-resolution task because of its strong non-linear representational ability. To exploit temporal correlations, most deep neural networks have to face two challenges: (1) how to align consecutive frames containing motions, occlusions and blurring, and establish accurate temporal correspondences, (2) how to effectively fuse aligned frames and balance their contributions. In this work, a novel video super-resolution network, named NLVSR, is proposed to solve above problems in an efﬁcient and effective manner. For alignment, a temporal-spatial non-local operation is employed to align each frame to the reference frame. Compared with existing alignment approaches, the proposed temporal-spatial non-local operation is able to integrate the global information of each frame by a weighted sum, leading to a better performance in alignment. For fusion, an attention-based progressive fusion framework was designed to integrate aligned frames gradually. To penalize the points with low-quality in aligned features, an attention mechanism was employed for a robust reconstruction. Experimental results demonstrate the superiority of the proposed network in terms of quantitative and qualitative evaluation, and surpasses other state-of-the-art methods by 0.33 dB at least.


INTRODUCTION
In this paper, we focus on resolving video super-resolution (VSR) via a deep learning method. VSR [1,2] refers to the reconstruction of high-resolution (HR) video sequences from their low resolution (LR) counterparts, which is widely used in high-definition television (HDTV), human face hallucination [3][4][5], remote sensing [6][7][8] etc. Compared with single image super-resolution (SISR), VSR was provided with huge temporal redundancies existing in consecutive frames, which is crucial for the success of SR reconstruction. With dramatic developments in deep learning (DL), deep neural networks (DNNs) [9,10] have been one of the most popular approaches to solving the VSR problems over the past few years. For single image super-resolution (SISR), recent researches [11][12][13][14][15][16][17][18][19] mainly paid attention to designing an effective network structure to fully exploit the inherent information from LR images. In [11], Dong et al. firstly introduced the convolutional neural network (CNN) into SISR to build a direct mapping from LR images to HR images, which achieved a signifi-them to make sure the temporal consistency with the reference frame (i.e. the central frame to be recovered). Some approaches [27,28] employed off-the-shelf ME&MC algorithms [37][38][39] before their networks, but these independent motion estimation methods made it difficult to find a global optimal solution for SR. Besides, these off-the-shelf ME&MC algorithms were generally time-consuming as compared to the inference speed of DNNs. Therefore, some end-to-end networks [29][30][31][32][33][34][35][36] were proposed to integrate ME&MC as the part of the network at the cost of introducing extra training difficulty. In general, ME&MC-based approaches have two main disadvantages: (1) It would be highly challenging for them to estimate an accurate motion field, for instance, in the case of large-scale motions, and (2) even with a high-quality estimated optical flow, warped frames also suffer from severe artifacts caused by the motion compensation, which would be propagated into the final super-resolved frames.
Therefore, some recent researches [40][41][42][43][44][45] tried to perform alignment in a non-ME&MC or implicit ME&MC manner. DUF [40] utilized learned dynamic up-sampling filters and residual images to super-resolved LR input frames. The 3D convolution layer used in this network effectively modelled the temporal correlation among neighbouring frames, but at the expense of a high computation complexity. TDAN [41] introduced deformable convolution to replace explicit ME&MC in their alignment module. Inspired by this work, EDVR [42] integrated pyramidal processing and cascading refinement into deformable convolution operation to tackle complex motions and large parallax problems. Recently, non-local block [46] was also adopted in VSR to capture long-range dependencies among frames [43]. This operation calculates the response at one position as the weighted sum of all positions in the feature maps. Yi et al. [43] claimed that the non-local operation shared a similar purpose as ME&MC in VSR, and reported the state-of-the-art performance with the help of non-local block. This proved nonlocal block a promising method in modelling temporal dependency. However, the non-local block in [43] follows the original idea proposed in [46], which means that it is not aiming at aligning each frame to the reference frame, but tries to enhance each frame itself using the temporal correlation. Intuitively, neighbouring frames are expected to provide the reference frame with supporting information instead of the information about themselves. Therefore, this may leave some room for improvement in exploiting non-local operation for alignment.

Fusion
After aligning each neighbouring frame, another crucial problem is how to fuse them in a proper way. A straightforward approach is using one convolution layer to integrate them directly, but this direct fusion process typically results in a limited performance. Recently, some researches [33,43,47] proved that fusing multiple frames in a progressive framework could provide a better restoration quality. However, approaches mentioned above all treat each frame equally without considering the visual informativeness of each frame. This means that some locations or frames with low quality make the same contribution as others to the result. To address this, an attention mechanism was applied in [42] to measure how informative one point is, and then assigned different weights to them. Although this approach succeed in introducing attention mechanism to the fusion process, it is essentially realized in a direct fusion manner.
In this work, we propose an end-to-end network for VSR, named NLVSR, to address problems in existing alignment and fusion modules. For alignment, a temporal-spatial non-local (TSNL) block is designed to capture the temporal dependency among neighbouring frames and the spatial dependency in the reference frame itself. Different from the non-local block used in [43], the proposed TSNL block aims at aligning each frame (including the reference frame itself) to provide supporting information for the reconstruction of reference frame. All points on one frame are summarized with carefully designed weights to represent the target point on the reference frame. For fusion, an attention-based progressive fusion module (APF) was proposed to integrate aligned frames gradually. The proposed APF module consists of a series of attention-based progressive fusion blocks (APFBs). Each of them fuses multiple aligned frames with the help of temporal attention, and then return the fused features back to refine each frame. After several A, the LR-fused features are up-scaled to generate the HR output. Ablation study is performed to verify the effectiveness of TSNL and APF module. Comparative experiments demonstrate that the proposed NLVSR network yields the state-of-theart performance on SR reconstruction quality.
The main contributions of our work are listed as follows: (1) Different from the previous non-local operation used in VSR whose aim is enhancing each frame itself, we design a novel non-local block, i.e. TSNL block, to align each frame to the reference frame. This non-local operation is capable of capturing the long-range dependency between each neighbouring frame and the reference frame, and extracting supporting information from the neighbouring frames to help the reconstruction of the central frame. We also provide a detailed comparison between the proposed TSNL and other alignment approaches in Section 3 to explain the superiority of TSNL. (2) We introduce the temporal attention mechanism to a progressive fusion framework. Compared with direct fusion, this progressive framework behave better in making full extraction of temporal information among multiple aligned features. In addition, the attention mechanism in each PFRABs offers a high degree of flexibility in dealing with different quality of information, avoiding the equal treatment on them.
The rest of this paper is organized as follows. In Section 2, we briefly review the related works of SR and non-local operation. Section 3 presents a detailed illustration for NLVSR and a comparative discussion between several state-of-the-art methods. In Section 4, experimental results are provided to verify the superiority of our approach. At last, we draw a conclusion in Section 5.

Single image super-resolution with CNNs
Over the past few years, CNN-based methods have dominated the single image/video super-resolution tasks because of its strong non-linear representational ability. It was firstly introduced to SISR in the form of a three layer CNN [11] to learn an end-to-end mapping from LR images to HR ones. However, the shallow architecture limited the performance of the network. To address this, Kim et al. proposed VDSR [12] and DRCN [13] with 20 layers for more contextual information usage of input images. In addition, the residual learning [24] was introduced in these two networks to overcome the training difficulty caused by deep network structure. Tai et al. took a further step toward deeper architecture by introducing recursive blocks in DRRN [14], achieving an improved performance on SR quality. Note that all of these approaches took interpolated HR images as input, which resulted in a huge computation complexity and over-smoothed results. To address this, Shi et al. [16] shifted the up-scaling step to the end of the network with an efficient sub-pixel convolution layer. This specialized convolution layer could up-scale the LR features into the final HR output, allowing the feature extraction before it to be performed in a LR spatial size. This efficient method has been widely adopted not only in the following networks for SISR but also in recent VSR methods. Besides simply stacking convolution layers to reach deeper architecture, some studies achieved significant improvements by making full use of extracted features. In [17], a dense connection network was employed so that each layer was able to receive information from all previous layers. This means that the network could reuse low-level features to provide more information [25]. In [18], Zhang et al. further combined local/global residual learning and dense connection to reach an improved performance. In addition to reusing previous features, Zhang et al. also suggested that channel-wise features with high frequency should be paid more attention because they were more informative than ones with low frequency for HR restoration, and thus they introduced a channel-attention mechanism in [19], which achieved a remarkable SR performance. Refer to [48] for more information about SISR.

Video super-resolution with CNNs
Compared with SISR, VSR task is provided with multiple consecutive frames which are rich in temporal redundancy. Thus, the main challenge of VSR methods lies in how to exploit the temporal correlation among video frames. One popular solution in the literature was to align and warp consecutive frames and then fed all of them into an SR network. Following this idea, Liao et al. [27] and Kappeler et al. [28] performed ME&MC directly with off-the-shelf optical flow approaches [37,38,39], and then fed the compensated frames to their SR networks. One problem with these two methods is that the motion estimation is independent of the SR network, which means that it is dif-ficult to find a global optimal solution for HR restoration. In addition, compared with the inference speed of DNNs, classical optical flow approaches are usually computationally expensive. Therefore, recent studies [29][30][31][32][33][34][35][36] preferred an end-to-end structure with integrated ME/MC. The first end-to-end VSR network was proposed in [29], which was composed of a spatial transformer network for motion estimation and a spatiotemporal ESPCN [16] for super-resolution. Similarly, Tao et al. [30] exploited sub-pixel information by a sub-pixel motion compensation (SPMC) layer, and then designed an encoder-decoder network with LSTM to recover the centre frame. Liu et al. [31] designed an adaptive architecture to accept different numbers of LR inputs and a robust spatial alignment network to align them. In [32], a task-oriented flow method was proposed to achieve a better recover quality in different contexts. Sajjadi et al. [33] created a frame-recurrent architecture which utilized the last superresolved frame to help the restoration of the current frame. In [34], a multi-memory network was proposed to extract temporal information from consecutive frames with several ConvLSTM layers [49]. In order to avoid the resolution conflict between LR optical flows and HR outputs, Wang et al. [35] designed a novel VSR network to estimate HR optical flow in a coarseto-fine manner. Li et al. [36] proposed a motion compensation network with a pyramid structure and adopted channel and spatial attention mechanism in the following reconstruction network. Note that methods above all take explicit ME/MC as part of their network. However, it is quite difficult to train an integrated and often low-capacity ME&MC layer under the current state-of-the-art schemes. Therefore, some studies tried to perform motion estimation in an implicit but more effective manner. Jo et al. [40] utilized 3D convolution layers to model temporal correlations among frames in their network. Tian et al. [41] and Wang et al. [42] introduced deformable convolution layer to replace explicit ME&MC in their alignment module. Wang et al. [42] also proposed a specialized attention module to emphasize important features along spatial and temporal dimension in feature fusion. Yi et al. [43] utilized the non-local block to capture temporal dependency and designed a progressive architecture to fuse aligned frames gradually.

Non-local block
Non-local block [46] was designed to capture the long-range dependencies among features, which was inspired by the classical non-local means algorithm [43]. Figure 1(a) is a basic illustration of non-local block, where T , H , W and C represent the frame number, height, width and channels of video signals, respectively. Mathematically, the response at a position in nonlocal operation could be defined as: where x and y refer to the input and output data, which have the same size. i is defined as the index of current position to be computed, and j represents the index at any possible locations in x.
Examples for different non-local blocks. The feature maps are denoted in the form of their shapes, such as T × H × W × C . Proper reshaping is performed when needed. Each blue block represents one convolutional layer whose kernel size is shown in the block. (a) The original non-local block proposed in [46]. (b) The modified non-local block for VSR in [43]. The input and output are the concatenations of all frames. (c) The proposed TSNL block. The input only includes one neighbouring frame and the reference frame, and the output is the aligned neighbouring frame.
The function g(⋅) represents a linear mapping, and C (⋅) is used to normalize the response. The pairwise function f (⋅) is used for measuring the correlation between x i and x j . A common choice of f (⋅) is the embedded Gaussian function defined as: where (⋅) and (⋅) are two embeddings which could be realized by convolutional layers. In summary, the main idea behind the non-local block is to compute the response at one point as the weighted sum of features at all positions. The weight for each feature is decided by its correlation with the target point. Yi et al. [43] suggested that the non-local block has a similar function with ME&MC since both of them could capture the temporal dependency across frames. Therefore, they introduced and improved non-local block to VSR task as shown in Figure 1(b). One problem with non-local block in VSR is the huge THW × THW matrix generated by (x i ) T (x j ) (denoted as R in Figure 1). For example, when we take five LR frames with size of 480 × 270 as input, the dimension of R would be 648, 000 × 648, 000, which might be prohibitive to be calculated or stored in many currently available GPUs. Thus, two measures are introduced in [43] to make non-local block more lightweight: (1) Combine the time dimension with the channel dimension (i.e. reshape the input from T × H × W × C to H × W × CT), which makes the matrix R not related to the frame number any more, and (2) reduce the space dimension to deepen the channel dimension with a down-scaling factor r (i.e. reshape the input from H × W × CT to H ∕r × W ∕r × CT r 2 ). As a result, the matrix R finally becomes a matrix of size (HW ∕r 2 ) × (HW ∕r 2 ), which is much smaller than THW × THW .

METHODOLOGY
We first describe the architecture of the proposed NLVSR network. As shown in Figure 2, the input of NLVSR is composed with 2N + 1 consecutive LR frames, where the central frame (i.e. the reference frame) is the target frame to be super-resolved, and the others are neighbouring frames used for providing supporting information. The whole network consists of four parts: feature extraction module, temporal-spatial non-local (TSNL) module, attention-based progressive fusion (APF) module and up-scaling module. At the beginning of the network, several stacked convolution layers are used to extract features from each input frame, as shown in Figure 3. Then, the TSNL module aligns each frame to the reference frame by a specialized nonlocal block. After that, the aligned features are fused in APF module gradually with the help of temporal attention mechanism. At the end of the network, the fused features are up-scaled to a HR frame using the sub-pixel magnification layer from [16]. The final SR frame is obtained by adding this HR frame with the bicubically up-sampled frame. The details of TSNL module and APF module are presented in following subsections.

Alignment with temporal-spatial non-local block
In this work, we proposed a novel non-local block, named TSNL block, for alignment in VSR. The TSNL block is performed frame by frame, which means the input of it only involves the extracted features from the reference frame F t and one neighbouring frame F t +n (|n| ≤ N ). The non-local The architecture of the proposed NLVSR. The output is the super-resolved reference frame. The blue blocks represent the main modules in NLVSR, and the green tensors refer to the output of each module.

FIGURE 3
The architecture of the feature extraction module. For each LR frame, several stacked convolutional layers are used to extract features from it. The convolutional layers used for each frame share same weights. operation in TSNL block could be formulated as: where r i and x j represent the points on F t and F t +n , respectively, and y i is the response at position i. We choose the embedded Gaussian function for f (⋅) and a linear function for g(⋅). The function C (x) is set as ∑ ∀ j f (r i , x j ) so that the normalization can be realized by a softmax layer in practical implementation [46].
As shown in Figure 1(c), we adopt a similar measure as [43] in TSNL block to avoid a huge matrix R, which reduces the space dimension with a down-scaling factor r (i.e. the second measure mentioned in Section 2.3). Thus, as shown in Figure 1(c), F t and F t +i are reshaped to the size of (H ∕r ) × (W ∕r ) × C r 2 at first. After that, f (r i , x j ) is used to generate the weight for each point in the neighbouring frame. According to (3), f (⋅) is a pairwise function computing the correlation between r i and x j . If x j is close to r i , g(x j ) will be encouraged with a large weight f (r i , x j ). In summary, the non-local operation in TSNL integrates the global information of the neighbouring frame F t +n using a weighted sum of g(x j ), and pushes the response y i to approach to the target point r i on the reference frame. This is consistent with the purpose of most alignment methods in VSR, which aligns neighbouring frames to establish an accurate correspondences with the reference frame. To better understand the behaviour of the proposed TSNL block, a visual illustration is presented in Figure 4(d). At last, the output of the TSNL block, namely the aligned feature F a t +n , is obtained by: where w(⋅) is a linear mapping realized by a 1 × 1 convolution layer, "+x j " refers to a residual connection in the non-local block which is the same as [46]. Although the TSNL block is applied to each frame, the parameters are shared across them so that the model size is not related to the number of input frames. In addition to temporal dependency, TSNL block is also used to exploit spatial dependency in the reference frame itself. In other words, we also allow n = 0 in F t +n so that the x j in Equation (3) comes from the reference frame at that case. The result features F a t provide an access to exploiting the self-similarity of the reference frame, which is similar with some non-local matching methods [50][51][52] applied in image processing. The effectiveness of this self-similarity is discussed in Section 4. Due to the participation of F a t , the number of aligned features fed into the following fusion module is 2N + 2 in total, including 2N + 1 aligned frames F a t +n and the original reference frame F t . Here, we also offer detailed comparisons between the proposed TSNL module and several alignment methods.
(1) TSNL vs explicit ME&MC: ME&MC methods [27][28][29][30][31][32][33][34][35][36] assume that the optical flow between two frames is pixel-wise dense, which means each pixel could be displaced at a new position on the reference frame [43]. After arranging each pixel, interpolation is conducted to make sure each pixel back onto a regular grid. Figure 4(a) gives a visual illustration of ME&MC. In this figure, the new position of each pixel on the reference frame is assumed to be a regular grid for a convenient explanation. Here, we take a 3×3 kernel as example so that K = 9. (c) The non-local block for VSR in [43]. (d) The proposed TSNL block. Here, we only illustrate two points on the neighbouring frame in (c) and (d) for simplicity. Actually, all points contribute to the generation of y i in them.
From Figure 4(a), (d), it can be observed that the main difference between the TSNL block and ME&MC methods is that the response at one position on the aligned frame is decided by only one point on the neighbouring frame in ME&MC. However, in TSNL block, it is decided by the weighted sum of all points on the neighbouring frame. Thus, compared with ME&MC, TSNL has two advantages: (1) ME&MC tries to describe a one-on-one relationship between the points on the neighbouring frame and the reference frame through an estimated optical flow, which is hard to be error-free faced with occlusion and blurring. However, TSNL do not have to figure out this difficult one-on-one relationship because, for each response at the aligned frame, all points on the neighbouring frame contribute to it, (2) when it comes to some 'bad' points affected by noise or blurring on the neighbouring frame, the one-on-one relationship in ME&MC would protect them during motion compensation, and make them survive in the aligned frame. However, in TSNL block, the response at one position is derived by summarizing all points with carefully designed weights, which would weaken the influence of these 'bad' points.
(2) TSNL vs deformable convolution: Deformable convolution is employed in [41,42] as an implicit ME&MC method for alignment. In this layer, each point p 0 on the aligned frame F a t +n could be derived by: where p k ∈ {(−1, −1), (−1, 0), ⋯ , (1, 1) } is the pre-specified offsets for K sampling positions, and Δp k is the learnable offsets predicted from the concatenated frames [F t +n , F t ]. Figure 4(b) presents a visual illustration for deformable convolution. Actually, deformable convolution could be viewed as a compromise between explicit ME&MC and the TSNL block.
To represent one point on the reference frame, ME&MC methods only involve one point from the neighbouring frame, the deformable convolution involves several points around the target position as shown in Figure 4(b), and the TSNL block involves all points on the neighbouring frame. Therefore, compared with deformable convolution, TSNL block has following advantages: (1) Deformable convolution obtains information from several points in a local receptive field, whereas TSNL block could aggregate the global information from the whole neighbouring frame, (2) deformable convolution learns to select several suitable points to represent the target point on the reference frame, which is similar to motion estimation (only one point selected). Both of them are challenging. However, TSNL do not have to solve this problem.
(3) TSNL vs NLRB: In [43], a modified non-local block (hereinafter referred to as "NLRB") was proposed to replace ME&MC in VSR. It is reported that this non-local operation do not aims at aligning neighbouring frames but capturing the long-range dependencies across frames. Figure 4(c) presents a visual illustration of NLRB. Although the proposed TSNL block and NLRB [22] both belongs to the non-local operation, the core ideas of them are fundamentally different. As shown in Figure 4(d), the main target of TSNL block is to align each frame to the reference frame. Thus, only the points on the current neighbouring frame (e.g. F t +n in Figure 4(d)) are considered. Besides, the weight for each of them is decided by its correlation with the target point on the reference frame (e.g. the point r i on the reference frame F t shown in Figure 4(d)). However, as shown in Figure 4(c), NLRB involves the points from all frames when calculating the response for one frame (e.g. the frame F t +n in Figure 4(c)), and the weight for each point is decided by its correlation with the target point on the current frame (e.g. the point x i on the frame F t +n shown in Figure 4(c)), not the reference frame. This means NLRB just utilizes the longrange dependencies across frames to refine each frame. Intuitively, we hope neighbouring frames provide supporting information for the reference frame instead of the information about themselves. Therefore, we believe that the TSNL block is more promising in the alignment.

Progressive fusion with temporal attention
In this section, we introduce an attention-based progressive fusion (APF) module to integrate multiple aligned features. This module is composed of a series of APFBs. Each of them fuses the temporal information from multiple frame features and then, in turn, utilizes the fused information to refine each aligned frame. At last, the features of all frames are concatenated as the output of this module.
The structure of one APFB is shown in Figure 5, which is designed in a multi-channel fashion. The input of this block includes 2N + 2 features (2N + 1 from the aligned frames and one from the reference frame). After the first convolution layer, all of them are integrated through a temporal attention block, which can be described as: where I 0 t +n and I 0 r refer to the features from 2N + 1 aligned frames and the reference frame, respectively. C 1 (⋅) represents the first convolution layer in APFB. The temporal attention block is denoted as FA(⋅), and I a is the integrated features obtained from it. Then, this integrated information is concatenated with each frame feature and output after the second convolution. A skip connection is applied in each channel for residual learning. The final output of APFB could be defined as: where C 2 (⋅) represents the second convolution layer and [⋅, ⋅] refers to the concatenation operation. Note that C 1 (⋅) and C 2 (⋅) in different channels share the same convolution kernel in order When integrating the information from multiple frames, we employ a temporal attention mechanism, i.e. FA(⋅) in Equation (8), based on the observation that pixels from different frames are not informative equally. Some of them contain less information for reconstruction or are affected by the alignment error, which may lead to a bad influence on the following restoration. Thus, the main target of this attention mechanism is evaluating each pixel from different frames and penalizing the pixels with low quality. Figure 6 illustrates how this temporal attention works. For each frame, we use the reference frame as a criterion, and evaluate the quality of each pixel based on its similarity with the point at the same location on the reference frame. As shown in Figure 6, the similarity is defined as: where p(⋅) and q(⋅) are two embeddings, and the sigmoid (⋅) function is used to limit the outputs in [0, 1]. Then, this similarity would be multiplied to the feature I 1 t +n in a pixel-wise manner: where ⊙ represents the pixel-wise product. This operation enforces a penalizing weight onto each pixel according to its similarity with the corresponding pixel on the reference frame. At last, a convolutional layer is used to integrate all frames into I a , as shown in the top of Figure 6. Note that the temporal attention has a similar architecture with non-local block. However, the targets of them are completely different: for non-local block, the main goal is computing a weighted sum of all points to capture the dependency between the reference frame and one neighbouring frame; for temporal attention, the main goal is penalizing the point with low quality by a weight small than 1.
There is no normalization in temporal attention.
In summary, we design a progressive fusion framework by stacking several APFBs. Each block is able to further fuse the information from multiple aligned frames. In addition, the temporal attention mechanism helps them to penalize the pixel with low quality, which further improves the reconstruction performance. The effectiveness of this fusion strategy is verified in Section 4.

Implementation details
There are five stacked convolutional layers in feature extraction module, whose kernel size is set as 3 × 3. In APF module, there are 20 APFBs in total. The following up-scaling module is the same as [16]. We set the channel size as 64 in the whole network except the up-scaling module. The leaky ReLU [53] is applied as the activation function following each convolution layer. The number of input frames is set as 7 as default (i.e. N = 3), and the output is only the central frame. During training, we employ Charbonnier penalty function [54] as the loss function, which could be formulated as: where I refers to the LR input frames, H represents the target reference frame, SR(⋅) denotes the SR network, and is a constant whose value is 1 × 10 −3 . The spatial size of training frames is set as 32 × 32, and the batch size is set as 16. We adopt Adam optimizer [55] with 1 = 0.9 and 2 = 0.999 to train the proposed network. The initial learning rate is set as 5 × 10 −4 and decays to 1 × 10 −5 finally. We pre-train our network without temporal attention in APF module at first, and then use the pretrained parameters as the initialization to train the whole network. Refer to [9,10] for more details about networks training and testing. The whole experiment is conducted in TensorFlow with one NVIDIA Titan Xp GPU.

Datasets
For training, we use the video sequences collected in [34], which include 522 HD video clips for training and 20 for validation. Most of them are chosen from documentaries with little postprocessing. For testing, we choose two representative datasets: Vid4 [29] and udm10 [33,56]. The Vid4 dataset is widely used for VSR evaluation. It includes four video sequences whose spatial sizes are smaller than 720 × 576. The udm10 dataset has a larger spatial size (1272 × 720) and is composed of 10 video sequences. Following [40,43], we employ a 13 × 13 Gaussian blur kernel on each HR frame, and set the standard deviation as 1.6. Then, the LR input frames are obtained by down-sampling these blurry HR frames. To enhance the performance of the network, the data augmentation from [18,57] is applied during the training process, which randomly flips and rotates input frames before they are fed into the network.

Comparisons with state-of-the-art methods
Here, we compare our network with nine state-of-the-art methods in VSR: VESPCN [29], RVSR-LTD [31], MCResNet [58], DRVSR [30], FRVSR [33], DUF_52L [40], RBPN [59], EDVR [42], PFNL [43]. Note that most of them [29,31,42,58,59] obtain LR frames by bicubically down-sampling HR frames, whereas some recent methods [30,33,40,43] involved a Gaussian blur to the HR images before down-sampling. Clearly, the latter strategy is closer to the reality. However, different downsampling approaches make the fair comparison difficult since the relationships between LR and HR frames are changed. In order to tackle this problem, Yi et al. [43] retrained several popular VSR methods [29,30,31,33,40,58] under a unified condition. We present part of experimental results from [43] in Tables 1 and 2. To make a fair comparison with these methods, we also implement the proposed network under the same condition as [43]. For other two methods [42,59], we only present the results reported in their publications. All comparisons are conducted with a scaling factor of 4×. The quantitative evaluation metrics includes pick-signal-to-noise-ratio (PSNR) and structural similarity (SSIM), which are computed on the Y channel of YCbCr colour space.
First, we present the results on Vid4 testing dataset, which is composed of four video sequences: Calendar, city, foliage and walk. The quantitative results are presented in Table 1. It can be seen that the proposed NLVSR network achieves the best performance. It outperforms the state-of-the-art method, i.e. PFNL, by 0.18 dB in PSNR, and 0.0055 in SSIM on average. Specifically, the proposed NLVSR surpass PFNL by 0.18 dB/0.0066, 0.19 dB/0.0065, 0.10 dB/0.0053 and 0.22 dB/0.0037 on Calendar, city, foliage and walk, respectively. Although EDVR provides a higher PSNR values on walk, but the SSIM of NLVSR is still the highest. Several qualitative results of these methods on calendar sequence are presented in Figure 7. It can be observed that DUF_52L, PFNL, FRVSR and NLVSR significantly outperform other networks in the zoomin region. The characters from their SR results could be easily identified. Among these four methods, NLVSR recovers finer details with fewest distortion especially in character 'A' and 'R'. Although Vid4 dataset is popular in VSR evaluation, it only includes four video sequences with a low resolution (smaller than 720 × 576). Thus, we also conduct a comparison on the udm10 dataset, whose spatial size is 1272 × 720. Quantitative results are presented in Table 2. It can be seen that NLVSR achieves the best performance on both PSNR and SSIM for all video sequences. The average gains compared with other methods are 0.33 dB on PSNR and 0.0019 on SSIM at least. We also provide some super-resolved frames in Figure 8 to evaluate the visual quality of these VSR methods. As shown in Figure 8(a), the number '536' can be hardly identified from the super-resolved frames of VESPCN, MCResNet, RVSR-LTD, DRVSR, and FRVSR. PFNL and DUF yield a better performance but suffer from some distortions, e.g. the number 6 becomes incomplete and is attached to the neighbouring number 3. Compared with these methods, NLVSR retains more details and present a more accurate result. From the visual results of the sequence photography, we find that many VSR methods fail in recovering some texture like dense parallel lines. For example, as shown in Figure 8(b), the reconstructed frames from all networks except NLVSR have a wrong direction in the texture of sweater. This mistake happens in all frames from VESPCN, RVSR-LTD, MCResNet, DRVSR and FRVSR. For DUF, only one recovered frame has the correct texture direction (31 frames in total from the sequence photography). PFNL is much better than above methods but still fails in the reconstruction of ten frames. However, NLVSR recovers all frames successfully, which means that it has a stronger ability than other approaches in handling this kind of texture.
In addition to the quality evaluation, we further show the tradeoff between accuracy, model size and test time of each network in the Figure 9. It can be observed that DRVSR, MCRes-Net, RVSR-LTD and DRVSR have a relatively small model size and fast inference speed, but their performance on PSNR is not ideal. FRVSR and DUF have a similar recovery accuracy and both require more parameters than other approaches. NLVSR achieves the best accuracy and has a similar model size with  PFNL. However, the inference speed of NLVSR is relatively slow as compared to other methods. The reason is that the TSNL and APF module are both implemented in a frame-byframe manner in our code. It will perform better in a parallelized implementation.

Effectiveness of TSNL
In order to evaluate the effectiveness of TSNL module, we design four networks with different alignment strategies: (1) the SISR method without any alignment, which only take the

FIGURE 9
The model size, accuracy and inference speed of different methods on udm10 dataset.
LR reference frame as input, (2) STMC from [29], which is an explicit ME&MC method, (3) NLRB from [43], which is an implicit ME&MC method, and (4) the proposed TSNL block. The deformable convolution from [41,42] is intended to be involved, but it is too complex (e.g., 8 GPUs used in [42] during the training) and it seems that there is no efficient implementation in TensorFlow which could run in 11GB GPU memory. To make a fair comparison, we use the same framework after alignment, which includes one convolution layer with 5 × 5 kernel, 20 residual blocks without batch normalization, and the same sub-pixel up-scaling module from [16]. The number of input frames is set as 7. The learning rate is set as 1 × 10 −4 during the whole training process. We name these four networks as: SISR, STMC-VSR, NLRB-VSR and TSNL-VSR. Note that the loss function of SISR, NLVSR-VSR and TSNL-VSR is the same as Equation (12), whereas STMC-VSR involves an extra limitation term for its motion estimation sub-network (refer to [29] for details). As shown in Figure 10(a), three alignment methods all achieve a significant improvement compared with the SISR method. STMC-VSR converges early, but NLRB-VSR and TSNL-VSR exceed it at last. We can observe a performance gap about 0.15 dB between NLRB-VSR and TSNL-VSR, which implies the superiority of TSNL in alignment.
As discussed in Section 3, TSNL module not only exploits the temporal correlation across frames but also the self-similarity of the reference frame. To evaluate the contribution of this selfsimilarity to the SR performance, we present the training processes of NLVSR with it and without it in Figure 10(b). The learning rate is set as 5 × 10 −4 . It can be seen that the PSNR gain of the self-similarity in TSNL module is about 0.07 dB on average after 1 × 10 5 steps. Actually, we do not expect too much improvement from the self-similarity of reference frame since there are other 6 aligned neighbouring frames and the original reference frame contributing to the reconstruction at the same time. However, as shown in the Figure 10(b), the participation of self-similarity speeds up the convergence of the network, which is beneficial to shortening the training time.

Effectiveness of APF
In this sub-section, we evaluate the effectiveness of progressive fusion strategy and temporal attention mechanism in APF module. First, we design a baseline model with a direct fusion strategy, which includes 30 residual blocks without batch normalization. The second and third model replace the 30 residual blocks with 20 APFBs shown in Figure 5, and there is no temporal attention mechanism in the second model. Note that the parameter number of 30 residual blocks is almost the same as that of 20 APFBs. The TSNL module is applied in all models before fusion to make a fair comparison. Table 3 presents the PSNR values of these three models on the validation dataset. It shows that the PSNR gain from the progressive fusion strategy is 0.58 dB, which indicates that the progressive fusion is much better than direct fusion in extracting temporal information from aligned frames. Comparing model 2 and model 3 in Table 3, we can observe an improvement of 0.15 dB from the proposed temporal attention mechanism. This implies that the attention mechanism is helpful for APF module in balancing the contribution of pixels with different quality.

4.6
Influence of input frames NLVSR can accept any number of input frames in the form of 2N + 1. In above experiments, we set the frame number as 7 (i.e. N = 3) in order to be consistent with recent studies. Here, we further explore the influence of different frame numbers on the network performance. We trained 3 networks with 3, 5, and 7 frames as input, which are named as NLVSR-3f, NLVSR-5f and NLVSR-7f, respectively. As shown in Figure 10(c), the more input frames are involved, the higher performance are achieved. This is consistent with the common sense that more input frames means more temporal information beneficial to the reconstruction. However, we find that the performance gap between NLVSR-3f and NLVSR-5f is much larger than that between NLVSR-5f and NLVSR-7f. The reason is that the frame with a long distance from the reference frame has a relatively low temporal correlation with it, and thus typically makes small contributions to the reconstruction. Besides, the inference time is linearly related to the frame number in the TSNL and APF module because of their multi-channel structures. Therefore, a moderate frame number like 5 or 7 would be a reasonable choice.

CONCLUSION
In this paper, we propose a novel network for VSR, termed NLVSR. The proposed network performs a specialized nonlocal operation on each frame to capture temporal and spatial long-range dependencies across frames, and align each of them to the reference frame. Then, the aligned information from each frame is fused in an attention-based progressive framework, which could balance the contribution of each aligned frame according to their quality. Extensive experiments demonstrate that the proposed network could recover accurate and temporally consistent SR frames, and achieve the state-of-the-art performance on public benchmark datasets.