A pyramid non-local enhanced residual dense network for single image de-raining

Single image de-raining based on convolutional neural network (CNN) has made consid-erable progress in recent years. However, usually the de-rained result has dark artifacts and image textures tend to be over-smoothed. In this paper, a pyramid non-local enhanced residual dense network is proposed to reduce such distortion. Firstly, the down-sampled images are input into the Laplacian pyramid, which can extract the overall and partial texture clues, and subsequently a set of images of different scales are produced. Secondly, these images are fed into a non-local enhanced residual dense block, which can not only capture long-distance dependencies of feature maps, but also fully utilizes the hierarchical features in every dense block, leading to high accuracy of rain streaks extraction and better preservation of image edge detail. Finally, the de-rained image is gradually restored by Gaussian reconstruction pyramid. Experimental results on both synthetic data and real-world data show that the artifacts distortion is obviously reduced by the proposed network. And the quality of de-rained image is signiﬁcantly improved compared with the state-of-the-art methods.


INTRODUCTION
Rain not only affects human visual perception, but also outdoor vision systems, i.e. self-driving cars and surveillance systems. Usually, for these systems, clear image and video are used as input. Thus, their robustness is easy to be reduced during rainy day. In recent years, de-raining has become an important research topic, and many approaches have been proposed. They can be classified into two types: traditional methodologybased approaches [1][2][3][4][5][6][7][8][9] and deep learning-based approaches [10][11][12][13][14][15][16][17][18][19][20][21][22][23]. Traditional methodology-based approaches focus on analysing the appearance of single rain streak and global rain distribution in image space: such as supposing rain streaks following ellipse shape [1][2][3][4], low rank pattern [5,7,8] and Gaussian mixture model (GMM) [6]. However, due to the complex imaging scenes (e.g., wind and fog), unknown camera parameters (e.g., focal length, exposure time), rain streak shape is complex, changeable and thus does not follow a pre-designed rain model. As a result, the traditional methods cannot obtain satisfactory results in real scenes. In addition, such methods require complex optimization and thus are time-consuming. In contrast, the deep-learning based methods can perform better in real scenes and are more efficient since training images are diverse, and the pre-trained model can be directly used in the rain removal stage. At present, the following deep learning methods are widely used in the deraining task: supervised learning [10][11][12][13][14][15][16][17], semi-supervised learning [18,19], unsupervised learning [20][21][22][23]. They have made a great progress, and the de-raining performance is significantly improved. However, black artifacts along rain streaks and oversmoothed texture edges still are the touchy issues. In this paper, we propose a pyramid non-local enhanced residual dense network, which consists of a series of non-local enhanced residual dense blocks and Gauss-Laplacian pyramid network to efficiently learn the mapping function between image pairs for accurately modelling the rain streaks while perfectly preserving the texture information of the image. The contributions of this work are twofold: 1. A novel CNN-based model named a pyramid non-local enhanced residual dense network is proposed, which can obtain superior performance on removing rain streaks compared to other methods. 2. We integrate non-local enhanced residual dense block (NEDB) and Gauss-Laplacian pyramid. The de-raining advantages of the two modules are fully utilized. NEDB can help our network capture the long-distance dependencies of feature maps and make full use of hierarchical features of convolution layers. Moreover, the multi-scale decomposition of the Gaussian-Laplacian pyramid is adopted in our network, which makes use of multi-task supervision in the training stage. Thereby, the problem of black artifacts and oversmoothed edges in the image is better avoided.

RELATED WORK
Image de-raining approaches can be divided into two categories: single image-based methods and video-based methods. Here, only single image-based methods are discussed. Sparse coding and dictionary learning are two popular techniques in traditional methodology-based approaches, which are usually used to divide the rain image into a background layer and a rain layer [1][2][3][4][5][6][7][8][9]. Chen et al. [5] proposed a low-rank pattern to model rain distribution. Li et al. [6] use GMM to model rain streak with different directions and different scales. The model constructed by the existing traditional methods have limited capacity in de-raining. In recent years, many methods based on deep learning have been proposed, including supervised learning [10][11][12][13][14][15][16][17], semi-supervised learning [18,19], un-supervised learning [20][21][22][23]. Supervised learning approaches tend to learn the mapping between ground-truth clear image and synthetic rainy image. Fu et al. [10] first proposed a CNN-based de-raining network, which cannot effectively remove rain streaks because only three convolutional layers are used. Subsequently, residual block is widely used in de-raining task [11][12][13]. To better remove rain streaks, researchers focus on developing new network architectures. A fully convolutional network based on region-dependent rain image model was proposed by Yang et al. [14] to jointly detect and remove rain. The restored image has over-smoothed edges. Then, the authors propose a higher version of JORDER-E to recover image details. Considering the fog-like visual effect caused by rain accumulation, an end-to-end neural network was designed by Hu et al. [15] to learn the depth attention features.
With the improvement of de-raining performance, the learning network becomes more and more complex and requires a lot of parameters. To reduce the number of parameters, Fu et al. [16] proposed a lightweight LPnet. It does not make full use of the hierarchical features of the original image, resulting in unsatisfactory results. A non-local enhanced encoder-decoder network (NLEDN) was proposed by Li et al. [17] to restore detail texture.
However, supervised learning based methods have weak generalization ability to handle real rainy images. To solve this problem, a semi-supervised de-raining model was first introduced by Wei et al. [18], this is realized by calculating the residual between the input image and the expected clean image to simulate the rain pattern distribution. Moreover, Yasarla et al. [19] proposed a semi-supervised learning network based on Gaussian processes, which can model latent space vectors of unlabelled, then was used to compute pseudo-ground-truth for supervising the network.
Compared with semi-supervised learning, unsupervised learning is more conducive to practical applications. Zhu et al. [20] proposed a Rain Removal-GAN (RR-GAN), which mainly defines a novel multiscale attention memory generator and a new multiscale deeply supervised discriminator. Meanwhile, an Unsupervised De-raining GAN (UD-GAN), using a new self-supervised learning method was proposed by Jin et al. [21]. Rain guidance module and background guidance module is designed to take advantage of rainy image characteristics. To generate better de-raining results, Wei et al. [22] proposed an attention guided de-raining network by constraining the CycleGAN of [23], they utilize two cycle consistency branches to extract rain streak by paying attention to the relationship between rainy image and clean image. Although these methods achieve significant success, there exist two touchy issues: (1) Dark artifacts along with the rain streaks; (2) Image textures tend to be over-smoothed. To reduce such distortion, in this paper, a pyramid non-local enhanced residual dense network is proposed.

METHOD
For the first time, a pyramid non-local enhanced residual dense network is adopted in the field of image de-raining. The overview architecture of our proposed network is shown in Figure 1. Interestingly, this combination of non-local enhanced block and pyramid has been exploited by [24] for de-noising and high-resolution reconstruction, which is sufficient to prove that our method is feasible. The design of the Multi-scale model (e.g., image pyramid) can obtain more powerful feature expression and reduce the number of parameters. Moreover, non-local enhanced residual dense block (NEDB) is introduced into our network. The NLB and RDB of NEDB are adopted in each layer of the pyramid to improve the de-raining effect of our proposed method. Specific steps are as follows, first, a rainy image is decomposed into its Laplacian pyramid by five down-sampling operations, then two convolutional layers are used for extracting shallow features. Second, the five layers singly pass through the two convolutional layers to extract shallow features, and then the feature maps are fed into the non-local enhanced block. The weighted summation method is used to improve the receptive field of the neural network and enhance long-distances of feature maps, thereby providing rich information for the following layer. Subsequently, the feature maps are input into a dense The overview architecture of our proposed non-local enhanced residual dense pyramid network. (a) First, a rainy image is decomposed into its Laplacian pyramid by five down-sampling operations, then two convolutional layers are used for extracting shallow features. (b) A non-local enhanced block (NLB) to enhance long-distance dependencies of the feature maps. (c) A residual dense block (RDB) to identify rain streaks and reconstruct a clean background. (d) Residual blocks (RBS) are two residual blocks and a convolutional layer used to obtain de-rained results in each layer of pyramid. (e) For the activation function , the leaky ReLU is used in our network block with residual mapping for learning rain streaks, which can make full use of hierarchical features of convolution layers. Then it passes through residual blocks to obtain the de-rained result.
In the following sections, each part of the network architecture is introduced in detail.

Gaussian-Laplacian pyramid
The pyramid network for rain removal is first proposed by Fu et al. [16], which can get a satisfactory result with fewer parameters. The sub-network of Lpnet is shown in Figure 2. However, the de-rained result still contains dark artifacts due to the following reasons. First, Lpnet does not fully use hierarchical features, which makes the edge information hard to maintain during the process of de-raining. Second, the subnetwork of Lpnet adopts a shallow CNN structure, the size of the network receptive field is limited. Therefore, the modelling of rain patterns has low accuracy, and the ability of de-raining performs worse. Finally, Resnet [25] is used to remove rain streaks in Lpnet, it is problematic to deal with black artifacts in heavy rainy images. The Resnet combine feature maps through summation before passing them to the layer, it will affect the The structure of the sub-network in Lpnet [16]. (1) Convolution layers to feature extraction. (2) Recursive blocks to share parameters information flow of the feature maps. In this paper, a pyramid non-local enhanced residual dense network is introduced to improve it in several ways: (1) In non-local enhanced block, the weighted summation method is used to improve the receptive field of the neural network, thereby providing rich information for the following layer.
(2) To accurately model rain streaks, a residual dense network consisting of five densely connected convolutional layers is used to obtain the feature maps with rich details.
(3) Compared with the LPnet, the NEDB combines feature maps by cascading features. Therefore, the feature maps are passed to all subsequent layers. Figure 3 demonstrates the comparison of Lpnet results and our network results.
As the method described in [16], we learn that the background information can be completely extracted at the top of the Laplacian pyramid, other layers contain rain streaks and detailed information at different spatial scales. Each subnet only needs to deal with a single scale of high-frequency components. The following equation is used to calculate the i-th Laplacian pyramid: where r is the input rainy image, n is the number of pyramid layers, L i (r ) is the i-th Laplacian pyramid layer, G i (r ) is the i-th Gaussian pyramid layer, and G i+1 (r ) is computed by down-sampling G i (r ) using Gaussian filter kernel, G 1 (r ) = r and L n (r ) = G n (r ). The Gaussian pyramid reconstruction is the layer restored by the residual blocks (RBS), and the bottom layer n is our final output: where is an activation function. The reconstruction layer G i (r ) is obtained through the activation function leaky ReLU (LReLU) [26], and the G i (r ) is used to guide the output result of the G i+1 (r ) through the up-sampling operation. The final result is the bottom layer G n (r ) of the Gaussian pyramid.

Non-local enhanced residual dense block
According to [17], non-local enhanced block with dense block is combined to obtain a more accurate modelling of rain streaks. In order to solve the problem of increased computational burden due to the expansion space of the feature maps at the bottom layers of the pyramid, grid divisions are used for the feature maps before inputting into a non-local enhanced block (NLB). For higher layers, fewer parameters are used. Thus, the number of grid divisions are set from low to high levels as 8, 6, 4, 2, and 1 respectively.
Compared with the traditional convolution layer with local operations, region non-local enhanced residual dense block (RNEDB) has a larger receptive field of the neural network, thereby maintaining the edge while effectively removing the rain band. Before the L n (r ) is input into non-local enhanced residual dense blocks, two convolutional layers are first adopted to extract features, which can combine both non-local and local information to build a richer hierarchy in our network. After the first convolutional layer, skip-connection is used to reduce the problem of information loss, which can not only provide

on-local enhanced block (NLB)
The principle of non-local blocks is to calculate a position response as a weighted sum of all position features, which has been applied to the video classification task [27]. A non-local enhanced block is shown in Figure 4. The non-local operation defines as: where P k,i , P k, j denote the feature map P k at position i, j respectively. A pairwise function f computes a scalar betweeni and all j . The unary function g computes a representation of the input feature at the position j . The output feature map is normalized by a factor c(P ). Equation (4) can fit various sizes of input images, which can be embedded in many networks.

esidual dense block (RDB)
The enhanced feature maps are fed into the residual dense block, a residual dense block is shown in Figure 5. It consists of five closely connected convolution layers where [D 0 , … , D k−1 ] represent a series of feature maps generated by densely connected layers. H k is a comprehensive function of two continuous operations: ReLU and a 3 × 3 convolution layer. The local residual learning is introduced to improve the network representation ability, which can learn more abstract feature representation. The residual dense block has two advantages.
1. The Gaussian pyramid, using the up-sampling operation, contains the advantages of the dense network, namely making full use of the hierarchical characteristics and alleviating gradient disappearance. 2. The image of each layer in the pyramid is sparse, which does not obey Gaussian distribution. Thus, dense blocks can remove the batch normalization (BN) layer [28] by using image pyramid technology, to reduce the use of GPU memory.

Pixel-wise loss
Given a training set {R m ,R m } N m = 1 , R is an input rainy image. G (R) andR denotes the de-rained result and corresponding ground truth respectively. The L 1 norm loss over N samples can be written as: The distortion is measured between de-rained image and the ground-truth in image pixel space.

SSIM loss
The SSIM loss is proposed to measure structural similarity between de-rained image and ground-truth. SSIM is defined as where y , 2 y represent the mean and the variance of y. G (R m )R m is the covariance of G (R m ) andR m . For maintain stability of SSIM loss, the constants C 1 and C 2 are used. Different from the SSIM evaluation index, the SSIM loss is range from 0 to 1. Thus, The SSIM loss function defined as

Total loss
where 1 , 2 are positive weights. Pyramid level P = (1,2,3,4,5), N is the number of training data, L 1 and SSIM are adopted in our loss functions. Specifically, in levels {1, 2}, they contain finer details and more complete rain streaks. According to [29], the L 1 +SSIM is adopted as the loss function to better preserve detailed information. The loss function L 1 can be used as smooth background area and crude structural information exists in levels {3, 4, 5}.

Comparison with the state-of-the-art methods
The visual results in each dataset are shown in Figures 6-12. The following points can be seen. First, GMM [6] and DSC [9] fail to remove rain streak from heavy rainy images, although they try to model complex rain streaks by diverse well-designed priors, they are only applicable to specific patterns instead of irregular in different datasets. Second, under the condition of heavy rain, DDN [11] and Lpnet [16] are able to remove rain streak but tend to generate obvious artifacts. Compared with the DDN and Lpnet, our network takes into account the importance of the long-distance dependencies and the hierarchical features, which can learn abstract feature representation, that is, our method can remove more rain streaks while perfectly preserving the image detail. Third, under the condition of light rain, our method has comparable visual results with NLEDN [17] and outperforms other methods. Fourth, the edges of the region produced by the proposed method are clearer than the other methods. Moreover, the PSNR and SSIM values of our proposed method are higher than the other five methods.
Compared with NLEDN, our method has better edgepreserving capability in the heavy rainy images. From Table 1, we can note that our network contains fewer parameters. The number of parameters is reduced in our method by removing the BN layer. Even with fewer parameters, our method still has a better performance.
Since real-word rainy images contain the irregular distribution of rain streaks, the scenarios of these images are very complex. Figure 12 shows the visual comparison of state-of-the-art methods in real-world data. As can be seen, our method has superior performance. This is because the Laplacian pyramid technology is used to decompose rainy images, each layer of the pyramid only needs to deal with high-frequency components at a single scale, it can help our network to deal with complex scenarios in real-world data.

Datasets and evaluation criteria
The performance of our proposed network is evaluated on both synthetic datasets and real data. For real-world data, some of the images are collected from the Internet and some from the real dataset provided by [16]. For synthetic data, the following four datasets are adopted.
1. The first dataset was proposed by [6]. Figure 6 shows visual comparison of de-rained results with state-of-the-art methods on the rain12 dataset. 2. The second dataset is Rain100L, which consists of 200 image pairs for training and the other 100 images for testing. All images contain light rain streaks, so the edge information can be easily maintained. Figure 7 demonstrates visual comparison of de-rained results with state-of-the-art methods on the rain100L dataset. 3. The third dataset is Rain100H, which contains 1800 images for training and 100 images for testing. There is a big gap between images in Rain100H and real rain images, as the rain streaks of images in Rain100H is more like noise. The robustness of our proposed network is enhanced by training in this dataset. Figure 8 shows visual comparison of derained results with state-of-the-art methods Both Rain100L and Rain100H were proposed by [14]. 4. The DIDMDN dataset consists of 12,000 image pairs, each image is labelled for its rainfall intensity, and there are only three types of labels for the entire dataset (e.g. light, medium and heavy rain streak). Follow Zhang et al. [30], the 9100 image pairs are selected for training and the remaining 2900   image pairs for evaluation. A visual comparison is provided in Figure 9.
The performance of synthetic data using three metrics, including peak signal-to-noise ratio (PSNR) [31], structure sim- ilarity index (SSIM) [32] and natural image quality evaluator (NIQE) [33]. PSNR and SSIM are adopted as full reference image quality assessment, and NIQE is used as no-reference image quality assessment.

Parameter settings
Our network is implemented using TensorFlow, all experiments were performed on a desktop PC with I7-10700 CPU and NVIDIA GTX 2080Ti GPU. According to [16], combined with our own experiment, kernel numbers k are set to 16 adopted as activation function . The whole training process is conducted under multi-task supervision. We set the learning rate at 0.001. The input image size is resized to 256 × 256, the batch size is 128, and the training ends after 10,000 epochs. During training, the Adam optimizer is adopted to update the parameters of network.

Different components of our proposed network
In this section, the different components are tested to analyse their impact on de-raining performance. To facilitate experimental comparison, this experiment is conducted on the DIDMDN-data.
Pyramid module: The Gauss-Laplacian pyramid is introduced into our network. The results are shown in Table 4. As can be seen, the value of SSIM and PSNR are improved by using pyramid module. We can note that the PSNR and SSIM have about 1.61 db, 6.7% improvement, respectively.
Non-local enhance block: We add the non-local enhance block into the Gauss-Laplacian pyramid network. From

Loss function
The loss function of our method is mentioned in Section 3.3.
To verify the effectiveness of the loss function, we select the different components for test. Table 3 shows the PSNR and SSIM comparison for different loss functions. It can be seen that incorporating SSIM in loss function can clearly improve the performance since SSIM measures the local image feature, which is consistent with human visual perception characteristics. The performance is clearly improved.

FIGURE 9
The visual comparison of de-rained results with state-of-the-art methods on DIDMDN [30] FIGURE 10 Visual comparison of de-rained results with state-of-the-art methods on three synthesized benchmark datasets

FIGURE 11
The comparison results of our proposed method with [16,17] on Rain100H [14] FIGURE 12 Visual comparison of de-rained results generated from state-of-the-art methods on real-world data

Running time
The comparison of run time in seconds is shown in Table 2 . To provide a fair comparison, all methods are executed on the same machine, and we follow the original setting of all the released codes. On average, our method takes about 0.19 s, 0.36 s, 0.72 s to obtain de-rained images of size 250 × 250, 500 × 500, and 750 × 750, respectively.

CONCLUSION
In this paper, a new pyramid non-local enhanced residual dense network is proposed to reduce the following distortions: dark artifacts along with the directions of rain streaks and oversmoothed de-rained results. Specifically, in the first step, a rainy image is decomposed into its Laplacian pyramid, which can use multi-scale decomposition technology to extract overall and partial information, and each layer only needs to deal with a single high-frequency component. In the second step, a series of non-local enhanced residual dense blocks (NEDB) are adopted to each layer of the Laplacian pyramid, it is designed as a concatenation of non-local enhanced block and residual dense block. The non-local enhanced block is used to enhance longdistance dependencies of feature maps, and the residual dense block is used to fully exploit the hierarchical features. Finally, the result of our method is reconstructed at the bottom of the Gaussian pyramid. Experimental results on real-world data and synthetic data show that our network can effectively remove rain streaks with maintaining edge texture information. Compared with the existing methods, the obtained results are more natural. However, the restoration of background details in real rainy images and heavy rainy images are not clean enough. In future research work, we will focus on the development of an unsupervised method to obtain optimal de-raining results.