Memory‐ and time‐efficient dense network for single‐image super‐resolution

Ahmad R. Naghsh‐Nilchi, Department of Computer Engineering, University of Isfahan, Isfahan, Iran. Email: nilchi@eng.ui.ac.ir Abstract Dense connections in convolutional neural networks (CNNs), which connect each layer to every other layer, can compensate for mid/high‐frequency information loss and further enhance high‐frequency signals. However, dense CNNs suffer from high memory usage due to the accumulation of concatenating feature‐maps stored in memory. To overcome this problem, a two‐step approach is proposed that learns the representative concatenating feature‐maps. Specifically, a convolutional layer with many more filters is used before concatenating layers to learn richer feature‐maps. Therefore, the irrelevant and redundant feature‐maps are discarded in the concatenating layers. The proposed method results in 24% and 6% less memory usage and test time, respectively, in comparison to single‐image super‐resolution (SISR) with the basic dense block. It also improves the peak signal‐to‐noise ratio by 0.24 dB. Moreover, the proposed method, while producing competitive results, decreases the number of filters in concatenating layers by at least a factor of 2 and reduces the memory consumption and test time by 40% and 12%, respectively. These results suggest that the proposed approach is a more practical method for SISR.


| INTRODUCTION
Single-image super-resolution (SISR), which aims to restore rich details and/or pleasant visual quality to an image, is favoured in many fields, including surveillance, remote sensing, and medical imaging. SISR is a classic problem, nevertheless a challenging open research problem in computer vision because of its ill-posed nature, that is, being under-determined. In detail, the low-resolution (LR) image (y) is formed using [1]: where, D, x, and v stand for degradation process, highresolution (HR) image, and additive noise, respectively. Degradation operators mostly include blurring and downsampling. The information loss in the degradation process is high. Therefore, there exist various images that can be reduced to the observed LR image by applying Equation (1). That is especially problematic in larger up-scaling factors because the SISR is more ill-posed in these cases.
For decades, there has been consistent progress in developing and improving SISR techniques, which are documented in several surveys [1,2]. These techniques are in three main categories, namely interpolation-based, reconstruction-based, and learning-based methods [3][4][5][6]. Interpolation-and reconstruction-based methods have the problem of preserving information. Due to the significant learning ability of deep convolutional neural networks (CNNs) and their hierarchical property, they have been widely used in the single-image superresolution task recently. CNNs learn an end-to-end mapping between the LR image and its counterpart HR image.
As a pioneer, Dong et al. proposed the first convolutional neural network for the SISR problem [3]. The problem with this network is the slow convergence that prevents it from increasing in depth. Kim et al. addressed this problem with a skip connection that adds the input and output of the network via element-wise addition and proposed two 20-layer CNNs, by increasing the recursion depth [7] or stacking weight layers [8]. Residual connection alleviates the vanishing gradient problem in training deeper networks. In other research, Tai et al. proposed another recursive network with depth 52 [9]. They used recursion on the local residual unit. Ledig et al. also proposed a residual CNN that combines local and global residual learning [10]. However, their proposed CNN uses a late up-scaling strategy and does not use shared weights, reaching promising results. Lim et al. improved Ledig et al.'s method in different ways, including by omitting batch normalisation [11].
With the advent of dense CNNs, recent SISR methods using them have achieved superior results. The number of output feature-maps of dense network layers is defined as the growth rate. Dense networks ensure maximum information flow by connecting each layer to every other layer in a feedforward manner, as shown in Figure 1. These networks using channel-wise concatenations provide several other advantages, for example, they use the collective knowledge of hierarchical features and avoid learning redundant feature-maps. However, this kind of CNN consumes high GPU memory due to the dense concatenation.
Tong et al. used dense connections in the whole network, and set the growth rate to 16 to prevent the network from growing too wide [12]. Tai et al. proposed dense connections for image restoration in a global way and between modules named memory blocks [13]. Zhang et al. proposed dense blocks for SISR, which employ dense connections inside blocks of limited depth [14], as shown in Figure 1. They also use a 1 � 1 convolutional layer to reduce the channel number. Recent SISR methods have used either local [15][16][17] or global dense connections [18] in their proposed CNNs with promising results. Local dense connections are between convolutional layers [15,16] or residual units [17]. Utilising a larger growth rate enriches the concatenating feature-maps and leads to an overall superior discriminative ability. However, existing methods do not handle the memory problem caused by larger growth rates.
Increasing the growth rate does indeed produce richer features, but it also produces some irrelevant feature-maps that increase memory usage and testing. A composite layer for learning concatenating feature-maps is therefore proposed. In other words, a wider convolutional layer is used before concatenating layers to learn richer feature-maps. This layer is then followed by a slim layer to extract relevant information from the input feature-maps. The proposed method, which is shown in Figure 2, is inspired by the concept of dimensionality reduction to reduce the memory usage of dense CNNs. It has the flexibility to tune the number of filters in the odd and even layers to get an efficient trade-off between the representational power and memory usage of dense CNNs.
Comprehensive experiments justify the efficiency of the proposed method. It has been examined at four depths: 4, 8, 16, 32. On average, the proposed method decreases the number of concatenating feature-maps, resulting in 24% and 6% less GPU memory usage and test time, respectively, compared to the basic dense method. It also improves the peak signal-to-noise ratio by 0.24 dB. Moreover, with the proposed method, the growth rate can be reduced by at least a factor of 2 to lower the memory and time usage by 40% and 12%, respectively, while keeping the results competitive.
In summary, the proposed method has the following advantages: The basic dense method (Conv4 stands for a convolutional layer with four filters) F I G U R E 2 The proposed dense method (Conv16 stands for a convolutional layer with 16 filters, and Conv1 stands for a convolutional layer with one filter) 142 -� It lowers the need for larger growth rates to increase the discriminative capability of dense CNNs, and � It significantly reduces GPU memory consumption and test time by propagating only the representative concatenating feature-maps.
A review of related works is provided in Section 2. The proposed method is detailed in Section 3. Then, the data sets, training setup, network parameters, and experimental results are described in Section 4, and finally, the paper is concluded in Section 5.

| RELATED WORK
With the ability to automatically learn informative hierarchical features, CNNs have been extensively studied in recent years. In the following subsections, related concepts from dense CNNs and dimensionality reduction, which help to comprehend the proposed method, are presented.

| Dense CNNs
Recent works have shown that shorter connections between layers close to the input and those close to the output of convolutional networks make them substantially deeper, more accurate, and efficient to train. Dense CNNs concat featuremaps in other layer's input (with matching feature-map sizes) in a feed-forward manner [19], shown in Figure 1. Dense CNNs have several advantages: they alleviate the vanishing gradient problem, strengthen feature propagation, and reduce redundant feature-maps.
Recent SISR methods have also witnessed these advantages from dense CNNs. Tai et al. [13] used a densely connected structure in a global way and between memory blocks. In each memory block, they used successive residual units to learn multi-level representations of the current state, concatenating with outputs of previous memory blocks. At the end of each memory block, they applied a pre-activated 1�1 convolutional layer. Their experimental results show the efficiency of the long-term dense connections for image restoration tasks. Tong et al. [12] and Zhang et al. [14] proposed dense blocks for SISR similar to that shown in Figure 1. Tong et al. also used dense connections between blocks for improved results and set the growth rate to 16, preventing the network from growing too wide, and used a 1�1 convolutional layer after all blocks, to reduce the number of feature-maps. Zhang et al. used local residual learning between the input and output of dense for improved performance. The output of each block has a direct connection to all the layers of the next block to support contiguous memory among blocks. Finally, they concatenated the outputs of all blocks to use the hierarchical features of the input LR image for reconstruction. They also utilised 1�1 convolutional layers to adaptively preserve features and stabilise the training of the wider network.
More recently researchers have used dense connections in their proposed methods. Shamsolmoali et al. used dense blocks with dilated convolutional layers to increase the receptive field [15]. Anwar et al. proposed a densely residual Laplacian network, and used a local dense connection between residual units [15]. Qin et al. proposed a multi-resolution spaceattended residual dense network with an adaptive fusion block based on channel-wise sub-network attention [15]. Dai et al. used channel-wise attention mechanisms to extract more informative and discriminative representations [18]. The main problem arising from dense CNNs is the memory consumption, especially in larger growth rates.
In a dense block, if each layer produces G feature-maps, then the l th layer receives G 0 + G � (l − 1) feature-maps of previous layers, where G 0 is the number of received featuremaps in the first layer and G is referred to as the growth rate. Using larger depth (L) and growth rate (G) increases the number of feature-maps to be kept in memory. Previous methods prevented the memory problem of dense CNNs by various means, including using less concatenate layers, using a smaller growth rate, or using a 1�1 convolutional layer to reduce the channel number. The concept of dimensionality reduction, a two-step concatenate feature-map learning, is proposed to produce reduced and representative concatenate feature-maps. The proposed method described in the following sections significantly improves the memory usage without loss of information.

| Dimensionality reduction
In general, well-performing features have several characteristics, including (1) being representative to provide a concise description, and (2) being independent, as dependent features are redundant. Dimensionality reduction is concerned with reducing the number of features to generate more compact and representative features. The main problems with highdimensional data are when many features are irrelevant or redundant. Therefore, such features increase memory usage and test time without useful function.
There are two general approaches for dimensionality reduction: feature selection and feature extraction. The central premise when using a feature selection technique is that the data contain some features that are either redundant or irrelevant and can thus be removed without incurring much loss of information [20]. Feature extraction creates new features based on the original feature set intended to be informative and non-redundant. It usually involves transforms to get relevant information from the input features, so that the desired task is performed by using this reduced representation instead of the complete initial one. The transforms may be linear or non-linear. However, the best transform is most likely a non-linear function.
A novel approach is proposed here for reducing the memory consumption and test time of dense CNNs inspired by the dimensionality reduction concepts mentioned above.

| Network architecture
The network architecture used for experiments is shown in Figure 3. This is the common architecture being used in most SISR techniques. The network learns an end-to-end mapping from LR images (I LR ) to HR images (I HR ). The output of the network is named I SR , and is an approximation of I HR .
Low-level feature-maps F −1 and F 0 are extracted using two convolutional layers: where W −1 and W 0 are the weights of these two convolutional layers. The bias term is omitted for simplicity. The proposed dense method has F 0 as input and learns residual multi-level feature-maps. If the number of levels of the proposed method is considered to be D, the input to the d-th level is denoted by F d , for d = 1, 2, …, D. Each level applies a non-linear transform H d (.) consisting of two convolutional layers: where W i d ; i ¼ 1; 2 stands for the weights of the i-th convolutional layer in level d, σ denotes the ReLU activation function [21], and d is the index of the level. The size of W i d is 3 � 3 � n i in which n 1 is much larger than n 2 .
Each level gets the concatenation of feature-maps of all preceding levels as input: where ⋅; ⋅ ½ � refers to the concatenation of feature-maps. The output of the proposed method, F D , is fed into a 1�1 convolutional layer, namely feature fusion layer, to control the output information and adaptively fuse multi-level featuremaps. A 3�3 convolutional layer is used to extract features for residual learning. The final multi-level feature-maps after residual learning formulate as: where W L+1 and W L+2 represent the weights of 1�1 and 3�3 convolutional layers, respectively. Up-scaling is done on these multi-level feature-maps using ESPCNN [22], followed by a convolutional layer outputting the I SR .

| Representative dense feature learning
In the basic dense block shown in Figure 1, the network's discriminative ability increases by using a larger growth rate. However, the larger growth rate is associated with huge memory usage due to the accumulation of concatenating feature-maps stored in memory. Therefore, the memory problem does not allow the growth rate to be increased very much in these networks.
Increasing the growth rate also produces irrelevant featuremaps that do not affect the network's discriminative ability but increase its GPU memory usage. Therefore, a new dense method is proposed that determines the network's discriminative capability using two hyper-parameters. In other words, the concatenating feature-maps are learnt in two consecutive IMANPOUR ET AL.
-145 layers. In the proposed method shown in Figure 2, the richer feature-maps can also be learnt with a wider layer before the concatenating layer. The concise and representative concatenating feature-maps are then extracted from these features using a thin concatenating layer. As a result, in the proposed dense block memory usage efficiently decreases, as discussed in the following subsection. This is while the proposed method does not reduce the discriminative capability of dense CNNs because it keeps the propagating feature-maps as representative as before.

| Discussion
By assuming the number of convolutional layers to be even, if the odd and even layers of the proposed dense block produce G 1 and G 2 feature-maps, respectively, then the input to the l-th layer (to be odd) has G 0 + G 2 � (l/2) channels, which is almost half of the basic dense block G 0 + G � (l − 1) by setting G = G 2 . Therefore, with the same depth and growth rate, the proposed block is expected to have a lower memory requirement and shorter test time than the basic dense block.
With the help of the wider layer used before the concatenating layer, the proposed method learns discriminative feature-maps. Therefore, the growth rate can be reduced (G 2 ) compared to the growth rate of the basic dense block (G) without loss of information. That produces more representative concatenating feature-maps and can more reduce the GPU memory usage and test time.

| EXPERIMENTS
The basic dense block is used instead of the proposed dense block in the network architecture to compare the results. These models are trained with different numbers of convolutional layers. The results are reported in Tables 1-4, and are discussed below.

| Data sets and metrics
The DIVerse 2K resolution high-quality image data set (DIV2K) contains 800 training images, 100 validation images, and 100 test images [23]. The DIV2K data set is used for training and validation. Only five validation images are used in experiments to reduce training time. Set5 [24], Set14 [25], B100 [26], Urban100 [27], and Manga109 [28] are used as the five standard test data sets. The HR images are degraded by the bicubic downscaling (using 'imresize' function of MATLAB) with a scale factor of 2 to form the LR images. Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [29] metrics are calculated on the Y channel of transformed images in YCbCr space, in both validation and test steps.

| Training setup
In each training batch, 16 LR RGB patches of size 48�48 are randomly cropped as inputs. These patches are randomly Note: In the first column, symbol G indicates the number of filters in each layer of the basic dense block, G 1 − G 2 stands for the number of filters of the proposed method in layers before concatenate layers, and the concatenate layers, respectively. Other columns represent the memory usage, average test time for the B100 data set, number of network parameters, average PSNR/SSIM for each test data set, and average PSNR/SSIM for all five test data sets, respectively.

F I G U R E 4
The percentage memory usage reduction of the proposed method compared to the basic method, while the growth rate (GR) is the same for both methods. L represents the number of convolutional layers IMANPOUR ET AL. augmented by flipping horizontally or vertically and rotating 90°. Each input patch to the network is subtracted with the mean RGB value of the DIV2K data set. This mean value is added back to the output of the network. The learning rate is initialised to 10 −4 for all layers. The network is implemented with the Torch7 framework. Adam optimiser [30] is used by β 1 = 0.9, β 2 = 0.999, and ϵ = 10 −8 . An epoch contains 1000 iterations of back-propagation. The results are reported after 200 epochs of training. Two NVIDIA GTX 1080Ti GPUs are used for training, validation, and testing.

| Network parameters
All convolutional kernels have a size of 3�3, except the feature fusion layer whose kernel size is 1�1. Zero-padding is used in all 3�3 convolutional layers because using a kernel size of 3�3 reduces the feature-map size. The efficiency of zero-padding has been shown by Kim et al. [8].
The number of filters in the first and second convolutional layers, feature fusion layer, and the next coming 3�3 convolutional layer is 64. Three filters are used in the last convolutional layer to output a colour image.

| Results
The proposed dense method in Figure 3 is replaced with basic dense method 1 to compare the results. The symbols 'G' and 'G 1 − G 2 ' are used in the first column of tables to present the names of the basic method and the proposed method, respectively. G stands for the number of filters in each layer of the basic dense block. G 1 and G 2 represent the number of filters in odd and even layers of the proposed block, respectively. In almost all experiments, the value of G 1 is larger/equal to four times the value of G 2 . Larger G 1 boosts the results.

| Memory/time investigation
As formulated in Subsection 3.3, the proposed method is expected to reduce GPU memory usage and test time, while the growth rate is the same for both methods. Comprehensive experimental results justify this hypothesis. By setting the growth rate to be the same for both the basic and proposed dense methods, the models are selected based on their PSNR/ SSIM values to be competitive with the basic dense method. The percentage of memory usage reduction of the proposed method compared to the basic method is calculated at each depth and depicted in Figure 4. Furthermore, the average value of four depths for selected models is reported in Table 5. The overall average of all models is 24%. By the same calculations, the proposed method is 6% time-efficient.
At each growth rate for the basic dense method, the percentage of memory improvement achieved by the proposed dense method is illustrated in Figure 5. The horizontal axes show the growth rate of the basic connectivity pattern. The vertical axes represent the percentage of memory improvement achieved by the proposed method with similar PSNR and reduced growth rate. The average memory improvement of all models at all depths is 40%. The proposed method improves the test time by 12% with similar calculations.

| Investigation of the number of filters
The value of G 2 is assumed to be constant for investigating G 1 . Each sub-figure in Figure 6 shows the average PSNR values of the four depths in each value of G 2 . The larger G 1 results in better SISR performance at any fixed value for G 2 . This is conceivable because larger G 1 enriches concatenating featuremaps. Results converge at G 1 = 512. Therefore, the '512-X' models have been selected to compare with the 'X' models.   PSNR improvement of the proposed method compared to the basic method is shown in Figure 7. The improvement is shown for different growth rates (G 2 in the proposed method and G in the basic dense method), and different depths (L = 4, 8, 16, 32). The proposed method has a larger PSNR than the basic method pattern at almost all growth rates and for different values of G 1 and L.
PSNR improvement of '512-X' models compared to X models, averaged at all four depths, is reported in Table 7. On average, for all values of X, an improvement of 0.24 dB is obtained.
For investigating the growth rate (G in the basic dense block and G 2 in proposed method), G 1 is fixed to 512 in the proposed method. PSNR values, averaged at four depths, for different growth rates are depicted in Figure 8. From this figure it can be inferred that increasing the growth rate improves the results in both methods.

| The effect of the number of layers
Increasing the number of convolutional layers improves the PSNR/SSIM values in both methods. An sample is shown in Figure 9 for models '128' and '512-128'.

| Visual results
A visual comparison is shown in Figure 10 for models '128' and '512-128', and images 'img047' and 'img52' from F I G U R E 8 PSNR/SSIM in different growth rate values (G and G 2 ). For the proposed method the G 1 is fixed to 512. Increasing the growth rate improves the results in both methods F I G U R E 9 PSNR/SSIM with different values of convolutional layers (L) in models '128' and '512-128' Urban100. These models are trained with a scale factor of �2 and L = 32 convolutional layers. The basic dense block produces noticeable artefacts and blurred edges. In contrast, the proposed method can recover sharper and clear edges.

| Comparison with the state-of-the-art
Dense blocks of two recent dense CNNs are replaced with the proposed dense blocks [14,16] to compare with the state-ofthe-art. All networks are trained with 200 epochs, and the results are reported in Table 8. The blocks of RDN [14] have depth 8 and growth rate 64, which are replaced with '512-128' of depth 8. The blocks of MARDN [16] have depth 4 and growth rate 32, replaced with '256-32' of depth 4. The proposed method improves the PSNR/SSIM values in both RDN and MARDN.

| CONCLUSION
A novel dense block is proposed, producing more representative concatenating feature-maps. It uses a convolutional layer with more filters before concatenating layers. The proposed method keeps the discriminative ability of dense CNNs, while it reduces the GPU memory usage significantly. It improves the PSNR of the basic dense CNN by 0.24, recovers sharper and clear edges, and reduces memory consumption and test time by 24% and 6%, respectively. It decreases the need for a larger growth rate. Therefore, it achieves 40% and 12% less memory consumption and test time than the basic dense method. The highest improvements are obtained on the very challenging Urban100 data set. These results justify the limitation of basic dense CNNs, relying only on the growth rate value to achieve better hierarchical features. F I G U R E 1 0 Visual results for images 'img047 0 and 'img052 0 from Urban100 -151