Error feedback denoising network

Recently, deep convolutional neural networks have been successfully used for image denoising due to their favourable performance. This paper examines the error feedback mechanism to image denoising and propose an error feedback denoising network. Specif-ically, we use the down-and-up projection sequence to estimate the noise feature. By the residual connection, the clean structures are removed from the noise features. The essential difference between the proposed network and other existing feedback networks is the projection sequence. Our error feedback projection sequence is down-and-up, which is more suitable for image denoising than the existing up-and-down order. Moreover, we design a compression block to improve the expression ability of the general 1 × 1 convolutional compression layer. The advantage of our well-designed down-and-up block is that the network parameters are fewer than other feedback networks and the receptive ﬁeld is enlarged. We have implemented our error feedback denoising network on denoising and JPEG image deblocking. Extensive experiments verify the effectiveness of our down-and-up block and demonstrate that our error feedback denoising network is comparable with the state-of-the-art. The code will be open source. The source codes for reproducing the results can be found at: https://github.com/Houruizhi/EFDN.


INTRODUCTION FIGURE 1
Architecture of the EFDN. The colourful lines are the dense connection and the black lines are the plain connection. The DU block is our proposed down-and-up block. The cubes with the same colour have same network structure. (k, k, m) means the kernel size is k and the number of kernels is m. s is the stride of the convolutional layer using the same number of layers. The memory residual network (MemNet) [5] incorporates the recursive architecture and uses the gate unit to extend the memory further. A model-driven network, fractional optimal control network (FOCNet) [6], is inspired by the differential equation and adopts multi-scale operators on features of different sizes, which achieves good performance in the AWGN reduction. [7] uses a more complex noise model to estimate real-world noise.
Though the denoising deep CNNs make significant progress, the existing networks have some limitations. First, many existing denoising networks usually use plain CNN and work on the features of the same size [3,4], where the receptive field is limited. Second, they only estimate the noise once and remove the noise by a residual connection, which is not enough to approach the true noise.
In order to overcome the above drawbacks of the existing CNNs, we borrow the idea of the error feedback strategy, which has been successfully used in image super-resolution [8]. In this letter, we propose an error feedback denoising network (EFDN) to introduce the feedback strategy into the image denoising task. The structure of EFDN is shown in Figure 1. Specifically, our main contributions are listed in the following.
Down-and-up projection: Error feedback [8,9] is a strategy that corrects the estimating error by iteration. Motivated by this idea, We design a down-and-up (DU) block as our basic unit to realize a denoising network utilizing the error feedback strategy. In this way, the receptive field size improves more than two times compared with the up-and-down (UD) sequence. The detailed structure is given in Section 3.
A novel compression block: Many networks adopt 1 × 1 convolution to integrate all the preceding features [8,10]. However, the 1 × 1 convolution fails to explore the connection in neighbouring regions when integrating features. To solve this problem, we design a special compression block to connect the neighbouring regions, which enlarges the receptive field further.
More residual connections: A single residual connection is not enough to estimate the noise effectively [3,4]. Therefore we add the residual connection in every down-and-up block. These residual connections can help the down-and-up block remove the clean structure from the estimated noise features.
Extensive experiments are implemented on image denoising and JPEG image deblocking, showing that the proposed EFDN is comparable with the state-of-the-art in both visual quality and quantitative measures.
The paper is structured as follows: In Section 2, we summarize the network design of some current denoising methods and some related works of the feedback mechanism. In Section 3, we describe the structure of our proposed EFDN. In Section 4, we verify the effect of our down-and-up feedback mechanism and evaluate the performance of EFDN on denoising and JPEG image deblocking tasks. Finally, we summarize the advantage and some drawbacks of EFDN in Section 5.

RELATED WORKS
Recently, the state-of-the-art technologies for many computer vision problems are based on deep CNN, which also happens in the image denoising domain. The early applications of deep learning in the denoising task [11,12] can hardly compare with the state-of-the-art. Since CNN is successfully used in the computer vision domain [13][14][15], many CNN denoisers [3,4] are developed. A Gaussian denoiser based on CNN, i.e. DnCNN [3], outperforms many traditional methods by a large margin. The basic unit of DnCNN consists of the convolutional layer, the batch normalization (BN) layer [16], and the rectified linear unit (ReLU) [13]. This block becomes a typical architecture of the later denoising networks. The residual connection, namely the skip connection, is also adopted by DnCNN, which forces the network to estimate the noise. The batch normalization layer [16] is first proposed to decrease the distribution fluctuation of the middle features. Many networks for the high-level computer vision task generally apply the BN layer, which enhances the network performance and accelerates the convergence [17][18][19][20]. Some denoising networks successfully use the BN layer [3,5]. In [3], the BN layer and the residual layer benefit from each other and speed up the training. While in many low-level tasks, such as image super-resolution [10,21], and the real-world denoising [7], the BN layer sometimes helps little or even degrades the reconstruction quality.
The feedback mechanism allows the algorithm to correct the error by iteration. This procedure is applied to various architectures in many computer vision scenes [9,[22][23][24]. [8] uses the back-projection stage with dense connection, becoming the first feedback network in the image super-resolution area. The upand-down sampling operator is used to achieve feedback strategy, which consists of three up-or down-sampling layers and two skip connections. [25] plugs the feedback block as an operator in the recurrent neural network (RNN) architecture as in [23]. While so far, there is no relevant work in the image denoising task. The implementation of the error feedback strategy in denoising is to be explored.

PROPOSED METHOD
In this section, firstly, we will explain how we realize the feedback mechanism. Secondly, we will present the structure of the compression block. Finally, we will explain the network structure.

Feedback strategy
Dense deep back-projection networks (D-DBPN) [8] uses up-and down-projection for super-resolution. We remark that there are some differences between denoising and superresolution. Super-resolution is to reconstruct details and structures, which are similar to the noise that has a high local variance. If we use up-projection first as in DBPN for image denoising, the noise will be enlarged. Another weakness of the up-anddown sequence is the large memory consumption due to the large feature size.
Since the down-sample layer can reduce the image size and extract the cleaner feature, it is better to use the down-sample layer to extract features first. Hence, we design a new down-andup (DU) block as the basic unit of EFDN, as shown in the top right of Figure 1. The down-projection extracts the noise and remove the structure information. After the up-projection layer and the residual connection, a more accurate noise estimation feature is generated.
Compared with D-DBPN [8], our sampling layers are simpler, and the sequence of down-and up-projection is different. Moreover, we utilize the batch normalization (BN) layer to improve denoising performance.

Compression block
The feature fusion layer is always a significant proportion of all network layers. Therefore it is necessary to improve the efficiency of feature fusion by a compression layer. We design a compression block (CB), as shown in the bottom right in Figure 1. The widely used 1 × 1 convolution feature fusion layer cannot explore the information in neighbour regions. To enlarge the receptive field, we set the middle layer's kernel size in the compression block as 3 × 3. Moreover, when the input feature number is relatively small, it is hard to sufficiently extract the information in the input features. Therefore we first increase the channel number m to a larger m ′ using 1 × 1 convolution layer to extend the representation space. Then we use the 3 × 3 convolution on the middle features and compress it into m channel using 1 × 1 convolution. In this way, the abstract feature can be better extracted.
The topology of our compression block is similar to the basic block in [26], in which the network is named ResNeXt, but there are some essential differences. ResNeXt is to reduce the complexity of the bottleneck structure of the residual network (ResNet), while our block is to compress the input features. The number of input and output channels of CB is different. The grouped convolution is not used.

Network structure
Denote Conv(c in , c out , k) and DeConv(c in , c out , k) as the convolution and deconvolution layer, where c in and c out are the numbers of the input channel and the output channel, and k is the kernel size. The feature extraction block consists of two steps: feature mapping and shrinking [27], realized by Conv(c, 4m, 3) and CB where c = 1 or 3 is the channel number of the input image. Denote F as the extracted features from the noisy image y.
The body of EFDN consists of T DU blocks to predict the noise features. Let the t -th DU block be f t DU (⋅) and N 0 = F , then the t -th middle feature N t is represented as where [⋅] refers to the concatenation of the input features. When t > 1, the outputs of the preceding blocks are the input of step t .
The t -th DU block f t DU (⋅) consists of down-and up-projection layers, denoted as f t D and f t U . Specifically, f t D and f t U are Conv(m, m, 6) and DeConv(m, m, 6) with stride = 2. All preceding features with the same size are used as the inputs of f t D (⋅) or f t U (⋅). Let U t and D t be the outputs of f t U , f t D . When t > 1, they are represented by , where U 0 = N 0 and f t CB is the CB. There is no CB when t = 1.
Let N e be the final estimated noise, then where f R is composed of CB and Conv(m, c, 3). The restorationx is obtained by residual connection [3]: Given the training set where y i is the contaminated image and x i is the corresponding clean image, the loss function of the whole network  (⋅, Θ) with weights Θ is where ‖ ⋅ ‖ 2 F is the Frobenius norm and M is the number of the training data pairs.

EXPERIMENTS
In this section, we first explain the experimental setting and how to train the network. Secondly, we verify the effectiveness of EFDN via the ablation study. We then evaluate EFDN on denoising and JPEG image deblocking.

Experimental setting
We use 800 images from the DIV2K high-resolution images [28]  The ADAM optimizer [29] is adopted to optimize the loss function, and the hyper-parameters are 1 = 0.9, 2 = 0.99, and = 10 −8 . The weights are initialized following DnCNN [3]. The training epoch is 100 and the batch size is 16. The learning rate is set by 10 −3 and multiplies 0.1 every 30 epochs. We implement our networks with the PyTorch framework. We set the same random seed when the training process begins. The training process takes about 5 hours for EFDN.

Ablation study
In this subsection, we compare the effect of up-and-down and down-and-up feedback strategies in EFDN and D-DBPN. D-DBPN is originally designed for super-resolution. Therefore, we modify D-DBPN for denoising by replacing its last deconvolutional layer with a convolution layer. The original projection sequence of D-DBPN [8] is up-and-down. We also verify the effect of our proposed compression block. We implement two compression methods including 1 × 1 convolution, and the proposed compression block in EFDN. The testing dataset is Set12 [3], in which the images are widely used in many image restoration problems. The testing noise variance is 50 in this subsection. The results in Table 1 includes PSNR, receptive field (RF), the number of parameters (#Params), and computing time (Time). The computing time is the average run-time of running 100 times on a 256 × 256 image. Because the difference between UD and DU is only the sequence of projection layers, the number of parameters of DU and UD are the same. The larger feature size of UD leads to about quadruple computing time. The size of the receptive field of DU is twice as large as UD's. The receptive field of D-DBPN is larger than EFDN but more computationally expensive. Moreover, the PSNR of our EFDN is comparable with D-DBPN with fewer parameters. In conclusion, the down-andup strategy has a larger receptive field, faster inference speed, and higher PSNR than the up-and-down strategy. Table 1 also shows that the proposed compression block both enlarges the receptive field and improves the reconstruction quality. Figure 2 shows PSNRs curves during the training process of networks with different feedback strategies and depth. All curves show an upward trend in the first 30 epochs and improve a lot in the 30th epoch. After 30 epochs, PSNR fluctuates around a relatively high value. The left picture shows the training process of different feedback blocks, in which the UD block with CB has the highest PSNR value and the plain block is the lowest. In the right figure, the PSNR of T = 4 is relatively lower. The PSNRs of T = 8 and 12 are similar.

Experiment results
In this subsection, we test our proposed networks on denoising and JPEG image deblocking. The results of PSNR and SSIM [30] are reported. Some result pictures are displayed to compare the visual quality. PSNR or SSIM values are cited from the original papers if available; otherwise, we use their open codes to test.
Synthetic image denoising: We test our proposed networks on gray images with AWGN, from Set12 [3], BSD68 [32] and Urban100 [33] datasets. To verify the efficiency of our implementation of the feedback mechanism, we compare the PSNR between different methods including BM3D [1], WNNM [2], TNRD [34], DnCNN [3], FFDNet [4], and N 3 Net [35]. The PSNR values are reported in Table 2. It shows that our proposed EFDN has the highest PSNR. On Set12, Set68, and Urban100, the PSNR of EFDN outperforms other methods by 0.19 dB, 0.1 dB, and 0.45 dB at least. We also compare the visual result for gray image denoising, as shown in Figure 3. Figure 3 shows the restoration of the gray image. The noisy images are corrupted by the AWGN with the standard deviation of = 50. The reference regions are enlarged and located on the bottom. In Figure 3, the left enlarged image contains the trees and the building, and the right local image includes the edge of the building. WNNM and FFDNet can restore the sharper edge of the building, while there are some obvious artefacts. The results of our EFDN are cleaner and have fewer artefacts than others.
For the colour image, we add AWGN on images from Set5, LIVE1, and CBSD100 and report the average PSNR of the results. The compared method is DnCNN and FFDNet. As the PSNR results in Table 3, the average PSNR of our CEFDN (EFDN for colour images) outperforms CDnCNN and FFD-Net by 0.21 dB and 0.33 dB respectively The visual results are shown in Figures 4 and 5. Figures 4 and 5 show the results for AWGN denoising on colour images. In Figure 4, EFDN can restore obvious textures in the enlarged beak of the parrot. Moreover, the outline of the eye is clearer than others. In the enlarged local image of Figure 5, EFDN also keeps more details.
Real image blind denoising: We train CEFDN using AWGN with ∈ [0, 40] for blind denoising and test it on NC12 [31] and Nam [36] real-world dataset. The results are shown in Table 4. In Table 4, our CEFDN outperforms the second best method MC-WNNM by 0.53 dB in PSNR and 0.0043 in SSIM.   [31]. Figure 6 is in a dark environment. In the left enlarged image, the result of our CEFDN is smoother than CBDNet. The boundary between the hair and the background is not much clear in CBDNet, while ours can reserve a rough outline. The right enlarged image is a part of the background wall, which should be clean. The result of CEFDN is cleaner than CBDNet in the local bottom image. In the bottom left enlarged image of Figure 7, the result of CBDNet is over-smooth and the edges are vague. The dog's whisker is sharper in the result of EFDN. In comparison, the result of CEFDN is more natural. In the bottom right images, the reconstruction of CEFDN is cleaner and has a bolder outline.
JPEG deblocking: In JPEG deblocking task, we generate the low-quality image with different quality factor Q ∈ (0, 40] to train EFDN. The trained model in this experiment is named EFDN-DB. We compare our EFDN-DB with AR-CNN [38],    Figure 8 shows the results of JPEG image deblocking. The quality factor is Q = 10. The details are better reconstructed by our proposed EFDN than others; see the textures in the enlarged regions.
Computing time: Figure 9 shows the computing time and the corresponding PSNR values of the compared CNN-based methods. The vertical axis is the average PSNR results on Set12 when images are contaminated by AWGN with = 50, and the horizontal axis is the average time on the same 256 × 256 image running 100 times. The computational speed of EFDN is slightly slower than DnCNN and FFDNet, while the PSNR is higher than them about 0.2 dB. Our EFDN is dominantly faster than N 3 Net.

CONCLUSION
This paper examines the error feedback strategy to the image denoising problem and design a down-and-up feedback   mechanism to denoise effectively. The down-projection can extract more abstract feature and remove the noise, and the up-projection reconstruct the clean structure. Then by the residual connection, the clean structure can be removed from the estimated noise feature step by step. This down-and-up feedback sequence is essentially different from the up-and-down sequence of other existing feedback networks. By this sequence, the computing time can be saved because of the smaller middle features. Moreover, the well-designed compression block can improve the expression ability compared with the single convolution layer. Experimental results on image denoising verify the good visual quality and the leading PSNR results of our EFDN. Though the computing speed is slightly slower than DnCNN, the denoising quality is better. So the loss of computing speed is acceptable. The proposed EFDN has a weakness that some details or textures are over-smoothed. To overcome this drawback, we will consider redesigning the basic block of the network, which is left as our future work.