MSNet: A novel end‐to‐end single image dehazing network with multiple inter‐scale dense skip‐connections

Funding information National Natural Science Foundation of China, Grant/Award Numbers: 61966018, 61662030; Provincial Natural Science Foundation of Jiangxi, Grant/Award Number: 20181BAB202013 Abstract Dehazing is a challenging ill-posed image restoration task. Various prior-based and learning-based methods have been proposed. Among them, end-to-end deep models achieve great success on performance improvement. However, most of them are concentrated on feature learning within the same block scale in isolation, and cannot perform associated analysis well on feature characteristics of different scales. Inter-scale information reuse which is especially beneficial to image restoration is often neglected. Therefore, in this paper, a novel end-to-end network with multiple inter-scale dense skip-connections for image dehazing is proposed. Sufficient complementary information combination is considered through dense inter-scale skip-connections among encoder and decoder block layers. Besides avoiding gradient vanishing, a kind of bottleneck residual block is proposed to control the importance of local gradients at different scales over global learning process. Extensive comparisons and ablation studies on public dehazing datasets and real-world images have been conducted. The experiment results demonstrate that the proposed novel elements can ensure more stable training process and superior testing performance with great improvements on PSNR and SSIM. Authors’ haze-removal results consistently comply satisfactorily with real situations, having much higher definition and contrast without colour distortion than those from the state-of-the-art methods compared in this paper.


INTRODUCTION
Haze is a common atmospheric phenomena produced by small floating particles like dust and smoke in the air. These floating particles absorb and scatter light, which makes sky dark and gloomy. Images photographed in this situation inevitably suffer severe vision problems like colour distortion, low contrast, scene attenuation and so on, as shown in Figure 1.
Obtaining clear images is extremely important. Under severe hazy conditions, most of high-level vision applications, such as video surveillance, remote sensing, autonomous driving, object detection etc, can not work well. In order to overcome the adverse influence of such weather conditions on high-level visual tasks, image dehazing (a.k.a haze removal) [1,2] arises as one of desirable solutions, and has been extensively studied in recent years.
Being an ill-posed restoration problem, image dehazing is very challenging. Most of existing methods [3][4][5][6][7] are basically This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology based on a classic atmospheric scattering model [8], formulated as in Equation (1).
where I (x) is the observed hazy image, J (x) is the desired hazefree image to be recovered, A is the global atmospheric light, and t (x) is the transmission map of atmospheric light reaching observer's camera. x represents image's pixel index. The atmospheric scattering model was first proposed by McCartney [8] and further improved by Narasimhan and Nayar [9]. It described the imaging mechanism under the joint action of haze and light from the perspective of physical light scattering. As well known, in the real world, an ideal haze-free image J (x) is formed by sunlight reflections from surface of objects. However, due to the presence of fog and inhalable particulate matter in the hazy air, the light transmission map t (x) is not uniform. As a result, J (x) decays to hazy image I (x) when FIGURE 1 A simple example of image dehazing we observe. So, according to the atmospheric scatter model, in order to get the haze-free ground-truth J (x), we need to estimate the transmission map t (x) and atmospheric light A.
Among most of existing physical model-based dehazing methods, the big differences depend on the ways of estimating transmission map t (x) and global light A, either from image's inherent handcrafted features, or by neural networks. Though tremendous improvements have been made, the inaccuracies resulted from both transmission function and atmospheric light estimation would potentially amplify each other and hinder the overall dehazing performance, especially in case that object's colour is similar to background.
The classic atmospheric scattering model is an elegant simplified physical model based on the assumption of single-scattering and homogeneous atmospheric medium. In fact, the formation of a hazy image has complicated mechanism. There potentially exists some highly nonlinear transformation I = Φ(J ; ) between the hazy image and its haze-free ground-truth. Therefore, with the popularity of deep learning, it has become prevalence to adopt end-to-end neural networks to simulate the formalization of hazy image [10][11][12].
Deep model comfortably achieves success on image dehazing task. In general, the basic ideas of most researchers to build dehazing networks are mainly inspired from encoder-decoder strategies, and to aggregate information in way of ResNet [13] or DenseNet [14]. However, these classic deep models cannot perform well associated analysis on features of different scales. DenseBlock in DenseNet focuses on feature reuse within the same scale. ResBlock in ResNet aims to solve gradient vanishment during deep network training. Vanilla encoder-decoder like traditional U-Net [15] also only considers reuse of the same scale information between encoder and decoder modules. Interscale information reuse is often neglected.
According to past experience, feature reuse is especially beneficial to image restoration. Large-scale feature map has higher resolution with more low-level details, while small-scale feature map contains more significant semantic information. If a deep model only progressively considers features of the same scale in isolation, as network going deeper, a large amount of valuable information will be lost irreversibly.
In this paper, we have proposed an effective end-toend dehazing network with multiple inter-scale dense skipconnections to take full account of feature reuse and fusion among different scales. The proposed network follows encoder-decoder structure. Specifically, The encoder convolves input image into several successive spatial pyramid scales. The decoder then successively recovers image details from the encoded feature maps. In the encoder part, in order to better combine and learn semantic feature information between different scales, among encoder blocks, auxiliary residual skipconnections are applied between block layers of scale k − 2 and k for feature reuse. In order to better learn global information and make the network more effective to dehazing, in the decoder part, at scale k, both decoder information from its previous two scales k + 1, k + 2 and encoder information from corresponding scales k, k − 2 are considered for sufficient decoding information fusion. We have evaluated our proposed network on public image dehazing benchmarks and real-world image dataset. The experimental results have shown that, compared with state-of-the-art methods, our method can achieve great improvements on final restoration quality.
The contributions of this paper are summarized as followings: • An effective encoder-decoder dehazing network with multiple inter-scale dense skip-connections is proposed to take full account of the feature reuse and fusion among different scales. Compared with existing end-to-end dehazing models, more sufficient complementary information combination are considered among inter-scale block layers, without parameters increasing. This important characteristics empower our network preserving much finer and sharper details than many existing dehazing models. • The inter-scale feature reuse mechanism we proposed can better learn intrinsic relationships between hazy image I (x) and haze-free image J (x) in end-to-end way. Extensive dehazing experiments are conducted on public synthetic datasets and real-world image dataset. Compared with recent stateof-the-art dehazing methods, our network achieves superior performance with great improvement both on objective evaluation criterions and subjective visual appearance. • The last is not the least. The proposed dense inter-scale feature reuse mechanism is a general structure, extending traditional U-Net. Its multiple inter-scale dense skip connections can better combine and learn semantic feature information between different scales. Besides being able to avoid gradient vanishing, it also ensures more stable training and testing process. Therefore, it has much potential to apply more applications not limited to dehazing.

RELATED WORK
Single image dehazing is a very challenging ill-posed problem. During the past, various prior-based and learning-based methods have been developed to solve the problem. Most of them depend on the physical atmospheric scattering model. Great efforts are made to estimate transmission map and atmospheric light. In these years, with the rising up of deep learning, data-driven end-to-end dehazing methods become popular. Experiences of image generation and image translation are drawn on for exploring new solution. In this section, we will focus on portion of representative methods and most recent developments. More related works can be referred to surveys [16,17]. We roughly divide most existing dehazing methods into three categories. 1) prior-based methods depending on atmospheric scattering model; 2) learning-based methods depending on atmospheric scattering model; 3) data-driven based end-toend non-physical model.

2.1
Prior-based methods on atmospheric scattering model Fattal et al. [1] assumed that the surface reflectance of object was uncorrelated with transmission map. They removed haze through estimating scene reflectivity by independent component analysis.
Tan et al. [2] observed that haze-free images had higher contrast than hazy images, and proposed a Markov random field model to maximise local contrast for image dehazing. Based on Tan's observation, He et al. [3] proposed a dark channel prior for estimation of haze concentration and transmission map.
Zhu et al. [4] observed that the concentration of haze was positively correlated with the difference between brightness and saturation. They created a linear depth computing model for hazy image, and proposed a colour attenuation prior to estimate transmission map for haze removal.
Berman et al. [5] observed that colours of a haze-free image could be well approximated by a few hundred distinct colour clusters, each of which became a line in the presence of haze. Based on the non-local prior, they proposed a dehazing model to recover both distance map and haze-free image.
Tang et al. [18] observed that different haze-relevant priors contributed complementary information on image dehazing. They systematically investigated existing haze-relevant priors in a regression framework to identify the best prior combination for haze removal.

2.2
Learning-based methods on atmospheric scattering model With the great success of convolutional neural network(CNN) in computer vision, recent dehazing methods proposed to learn transmission map fully from data, in order to avoid inaccurate estimation of physical parameters from a single image.
Cai et al. [6] proposed a DehazeNet in which a CNN network with novel BReLU unit was constructed to estimate transmission map. Ren et al. [7] proposed a multi-scale deep neural network to estimate transmission map. The AOD-Net proposed by Li et al. [19] introduced a newly defined transmission variable to integrate both classic transmission map and atmospheric light. Ren et al. [20] proposed a Gated Fusion Network(GFN) to learn intermediate confidence maps through which different handcrafted feature maps fused to restore clear image from corresponding hazy one. Wang et al. [21] proposed to learn medium transmission map and scatter light based on assumption that a linear relationship exists in the minimum channel between the hazy image and the haze-free image.
Since generative adversarial network first proposed by Goodfllow et al. [22], it has been widely used in many computer vision applications and achieved promising results, especially in image synthesis. Due to similarities existing between image dehazing and image generation task, many researchers utilised it to generate haze-free image.
Yang et al. [23] cast image dehazing problem as a unpaired image-to-image translation problem. They proposed a disentangled dehazing network that generated realistic haze-free images using only unpaired supervision. Zhang et al. [24] proposed a densely connected pyramid dehazing network into which the atmospheric scattering model was directly embedded, so as to ensure their proposed method strictly follow the physicsdriven scattering model. Transmission map, atmospheric light and dehazing were jointly learned together within GAN framework. Zhu et al. [25] explore the connection among generative adversarial models, image dehazing, and differentiable programming. The atmospheric scattering model was reformulated into generative adversarial network for simultaneous estimations of the atmospheric light and the transmission map.

Data-driven based end-to-end non-physical model
Traditional physical model based methods mainly focused on representing human knowledge as priors for image dehazing. However, it is difficult for these priors to be always satisfied in practice. Many researchers realized that exploring dehazing solutions did not have to restrict thought in the physical scattering model.
Considering that there might exist highly nonlinear transformation between hazy image and its clear ground-truth, Mei et al. [10] proposed to establish an end-to-end deep network, which use encoder-decoder structure to directly learn the formulation function. Xu et al. [11] proposed an end-to-end single image dehazing model based on encoder-decoder structure with skip connections and instance normalization. Convolutional layers of the pre-trained VGG [26] network was adopted as available encoder.
Qu et al. [12] disentangled image dehazing from the physical scattering model, and regarded dehazing as image generation problem. They employed GAN to supervise the generation of haze-free image and a well-designed enhancer to produce a realistic result on the fine scale. Li et al. [27] proposed a PDR-Net to deal with image dehazing and quality refinement separately by using a dehazing subnetwork and a refinement subnetwork. Multi-term loss function consisted of content loss, colour Learning the formulation process of hazy image in end-toend way is becoming a dominant trend. However, multiple scale information reuse which is critical for fine details preserving still hasn't been paid sufficient attention during the end-to-end network's construction. Both [10] and [11] adopted standard UNet structure as the backbone architecture, in which information among inter-resolution feature maps were neglected. In spite of two resolution generators in Qu et al. [12], multiresolution feature aggregations are still not sufficiently utilised. Even though in its enhancing block, feature maps on different scales were processed in parallel without much interaction. As a result, details of these existing end-to-end methods could be recovered well, such as the colour fidelity, weak edges, incomplete dehazing etc. Therefore, as discussed above, it is meaningful to investigate inter-scale information reuse with sufficient interaction and embed the mechanism into exploration of nonphysical dehazing model.

PROPOSED METHOD
In this section, we introduce the proposed dehazing network in details. From the success experience of our previous work PFFNet [10], U-Net alike encoder-decoder structure is effective on image dehazing task. Therefore, in this work, we extend the idea to propose a novel end-to-end dehazing network with multiple inter-scale dense skip-connections. We name the network as MSNet for short.
The architecture of the proposed network is as shown in Figure 2. It consists of three key elements: (1) bottleneck residual blocks in both encoder and decoder parts; (2) down-sample skip-connections at encoder stage; (3) up-sample skip-connections at decoder stage. Efficient feature reuse and information interaction among difference scales are the main characteristics.
The parameters overview are shown in Table 1. Two instances of network's structure are demonstrated in the table.

FIGURE 3 Structure of bottleneck residual block
They are respective MSNet_4_128, and MSNet_5_(256 × n).
In this work, we take MSNet_5_256 as the basic network. For other network variants, to keep network structure unchanged, the channel size of each block layer is set as n times of the channel size of the corresponding block layer of MSNet_5_256.

Bottleneck residual block
Bottleneck residual block adopts residual connections similar to traditional residual block, and also does not change the spatial scale of input feature map. But, the difference exists at the skip connection route. A 1 × 1 bottleneck convolution is employed before doing elementary addition with residual signals. The block details are shown in Figure 3. We formulate the bottleneck residual block at scale k as Equation (2).
where h, w are respective spatial height and width of feature map z k , and f represents the residual function for two successive 3 × 3 convolutions. g represents the function for 1 × 1 bottleneck skip-connection. Specifically, g(z k ; ) = Wz k , where W ∈ R M ×C . Herein, C is the channel size of input signal z k and M is the channel size of output signal z k+1 . Assuming a training network with the bottleneck residual blocks is defined as Loss = forward(z k+1 ), we can calculate the gradients back-propagated at scale k as following Equations (3)(4)(5).
where W m,c is the weight parameter of 1 × 1 bottleneck convolution at m th row and c th column position. c represent gradient value of channel c at spatial position with index -i, j˝. ⃗ is the channel gradient vector, ⃗ = [ 1 , … , c , … , C ] ′ , which are the learned values independent of input signals. ⊙ represents elementary multiplication. From the mathematical gradient calculations, we can easily find that the bottleneck residual block not only preserves the merits of avoid gradient vanishing, but also like a datapath gate that can control the importance of local gradients through over global learning process at different scales.
In order to further increase feature nonlinearity and avoid gradient vanishing, ReLU activation is employed before elementary addition. The convolution parameters are shown in Table 2.

Encoder-decoder structure with multiple inter-scale dense skip connections
The proposed network follows encoder-decoder structure. At encoder stage, network zooms image to 1 2 K size of its original input and successively encode intermediate structural information. K denotes the depth of network. At the decoder stage, encoded feature maps are then zoomed back to the original size of input image, through successively doubling spatial resolution scale of feature maps when decoding. For feature learning efficiency and training ease, bottleneck residual block is proposed to replace tradition convolution block at all block layers in encoder and decoder parts.
To avoid ambiguity, in this section, we refer the "scale" to the same meaning as "spatial resolution". That is to say, "large scale feature" means "feature map of large spatial resolution, at shallow layer with much low-level information", "small scale feature" means "feature map of small spatial resolution, at deep layer with much high-level information".

Encoder
An initial convolution block Init_Conv is firstly employed to aggregate informative features on a relatively large local receptive field from observed hazy image I (x).
The following encoder stage consists of five bottleneck residual blocks. For the convenience of description, we denote the k th residual block as ResConv k en , k = -1, 2, 3, 4, 5˝. At feature map of scale k, the channel size increases by two times, compared with previous feature map at scale k − 1. We denote the output signal of k th of bottleneck residual block as E k , and I = Init_Conv(I ).
As well known, large-scale features focus more explicitly on low-level texture information, while small-scale features focus on high-level semantic information. The significance behind downsampling is that it not only can increase the size of the receptive field and the robustness to some small disturbances of the input image, such as image translation, rotation etc., but also reduce the risk of over fitting and the amount of computation. Therefore, considering the complementary information contained in features of different scales, we adopt the idea of image pyramid to construct features maps with different spatial resolutions by down-sample convolution operations.
Inter-scale skip connections among encoder modules are employed for feature reuse at high-level semantic scales. Bottleneck residual blocks are in charge of feature learning at corresponding scales. We formulate the information processing as following Equations (6)(7)(8)(9).
where S k represents input of the k th block. For scale k ≤ 2, where S 2 = DownSample2(E 1 ). Functions DownSample2 and DownSample4 down-sample respective feature maps into 1 2 and 1 4 of its original spatial resolution by convolution operation. The convolution kernel size is the same 3 × 3, but with different step size, such as 2 and 4.

Decoder
The decoder stage also consists of five block layers. In opposite to encoder, at each block layer, deconvolution (a.k.a transposed) operations are sequentially to recover image's structural content details. Bottleneck residual blocks ResConv k de , k = -1, 2, 3, 4, 5˝ are employed to learn decoded information for feature reconstruction. We denote the decoded output of k th residual block as U k = ResConv k de (D k ), k = -1, 2, 3, 4, 5˝, and the decoded haze-free image as J = Conv 3×3 (U 1 ), where D k represents input signal of k th residual block layer.
In order to maximise information flow along multi-level layers and guarantee better convergence, skip connections are employed between corresponding block layers from encoder to decoder. However, since skip-connection at the same scale between encoder and decoder cannot sufficiently grasp multiscale information among decoder's block layers, we propose to perform multiple inter-scale dense skip connections for full feature reuse. Specifically, we formulate the signal connections as in Equation (10) where U k = 0, if k > 5 and E k = 0, if k < 1. Functions UpSample2 and UpSample4 de-convolute respective feature maps to two times and four times of its original spatial resolution. The deconvolution process is performed by kernels of the same size 3 × 3 but with different step size, such as 2 and 4.
It is not difficult to find that, at different scale layer, different information are emphasized for feature reuse. For example, when network going deeper, at small-scale layers k ≥ 3, since encoded feature maps highlight more on prominent highlevel semantic regions with less finer textures, we dedicatedly perform skip-connections from large-scale layers which contain enough low-level information for decoding. At largescale feature layers k < 3, extra semantic information are skipconnected from high-level decoded layers to guid the construction of contents with semantic details. In compromise with computation efficiency and feature scale, we selected feature maps at the distance of two scales as the auxiliary information.
The proposed encoder-decoder structure with multiple inter-scale dense skip-connections progressively performs feature reuse and fusion among spatial pyramid mappings, which maximally preserves colour fidelity and structural details, especially like the weak edges that are meaningful for object recognition, making more suitable for image dehazing task.

EXPERIMENTS
In order to demonstrate the effectiveness of our proposed model, in this section, we conduct comprehensive experiments on two synthetic datasets and typical real-world images. All the experiments are conducted on a NVIDIA GTX Titan XP GPU. The PC configuration is with Intel Core i7-4790 CPU (3.60GHz, 8G cache) and 16G memory. PSNR(Peak Signal to Noise Ratio) and SSIM (Structural Similarity Index) metrics are mainly adopted to evaluate dehaz-ing performance. Related source code is available on https: //github.com/Joyies/MSNet.  [29] and Middlebury [30]. For each clear image, 10 synthetic hazy images are generated according to its corresponding depth-map. Specifically, given a clear image, random atmospheric light A ∈ [0.7, 1.0] for each channel, and the corresponding depth image d (x), function t (x) = e (− ⋅d (x)) is firstly applied to synthesise transmission map, then a hazy image is generated by using the physical model in Equation (1) with randomly selected scattering coefficient ∈ [0.6, 1.8]. Therefore, a total of 13990 hazy images are collected for training.

Datasets
The OTS dataset is generated in the same way as ITS dataset, with 72135 matched hazy and clear image pairs.
The SOTS dataset contains 500 matched Indoor synthetic test images and 500 matched Outdoor synthetic images.

Training details
At data preprocessing stage, we perform data augmentation for training. We firstly perform rotations and mirror flips on images. The rotation angles are  = -0; ∕2; ; 3 ∕2˝. The mirror flips are  = -No Flip; Horizontal Flip; Vertical Flip˝. As a result, we obtain 12 variants for each image. Then, we use sliding window to extract image crops of 256 × 256 size. The stride is set to be 128 pixels. In consequence, sufficient training patches are augmented. ADAM [31] is used as the optimiser. Training batch-size is 16. The initial learning rate is = 0.0001, and kept a constant during training. Mean Square Errors (MSE) between restored clear image and haze-free ground-truth is taken as our objective loss.

Ablation studies
We have conducted three groups of ablation studies to verify the influences of inter-scale skip-connections, number of channels, depth of network, respectively. The comparisons are recorded by test performance curves.

The effects of inter-scale skip-connections
We conduct the ablation experiment on model MSNet_5_256.
The experimental results are shown in Figure 4. In the figure, the "MSNet-256" represents the basic network, while the "MSNetnon" represents network removing all auxiliary inter-scale skip connections, concretely the red datapath and the green datapath in the network architecture shown in Figure 2.
The experimental results show that, model with multiple inter-scale skip-connections such as the "MSNet-256" achieves better performance. Its testing curve is also more stable and not easy to produce oscillation.

The effects of channel size
In order to verify the influence of channel size on model's performance, we conduct ablation experiment on model with multiple inter-scale skip-connections. The depths of all compared networks are kept the same 5 block layers. The experimental results are shown in Figure 5. "MSNet-256" means network with parameters of the case n = 1 in Table 1, and "MSNet-512" means network with the case n = 2. Both models have the same network structure except that "MSNet-512" has double channel size on each block layer in comparison with "MSNet-256". From the experiment results, we can easily find that the overall performance of "MSNet-512" is obviously higher than that of "MSNet-256", and the testing curve is still in consistent uptrend without overfitting. It means that under affordable computation burden, network with larger channel size benefits for performance improvements.

The effects of model's depth
In order to verify the influence of the model's depth on performance improvements, we conduct the ablation experiments  Figure 6. In the figure, "5 layers" represents the basic model "MSNet_5_256", while the "4 layers" represents model removing the 5 th block layers both in encoder and decoder stages. From Figure 6, we can see that the performance of "5 layers" is better than that of "4 layers", and the testing curve of "5 layers" is more stable as well. The ablation experiment result demonstrates that deeper model with more block layers and inter-scale skip-connections could learn more important structural and semantic information for image restoration.

Comparison with state-of-the-art methods
In order to further verify the effectiveness of our proposed network, we comprehensively perform comparisons with several state-of-the-art image dehazing methods. The compared

Evaluation on synthetic datasets RESIDES
We evaluate our proposed method on SOTS test set. The comparison results are shown in Table 3.
From the experiment results, we can see that our method shows great superiority over the compared methods on SOTS dataset both in Indoor and Outdoor cases. On Indoor case, compared with runner-up, our method achieves 4.25 higher on PSNR, 2.4 higher on SSIM. On Outdoor case, our method achieves 5.21 higher on PSNR, 7.91 higher on SSIM than second best method.
We further evaluate our proposed method on HSTS test set. Our performance computing are set the same as the setting-up in [28]. The comparison results are shown in Table 4. From the experiment results, we can see that our method has achieved with great improvements on HSTS too.
In order to better illustrate the advantages of our dehazing model, we have further visualised the results in comparison with EPDN. Some typical examples both from Indoor and Outdoor cases of RESIDE are shown in Figures 7 and 8. It is not difficult to find that, our method can remove haze clearer with higher contrast and more details. As shown in Figure 7, complex texture details of image patch in the red box are restored more faithfully by our method than by EPDN. In Figure 8, the sunlight and the colour of sky are more visual faithful. Haze in street is also removed more clearly.

Evaluation on another synthetic dataset FRIDA2
FRIDA2 [33] comprises 330 synthetic images of 66 diverse road scenes. The view point is closed to the one of the vehicle's driver. To each image without fog is associated four foggy images and a depth-map. Different kind of fog are added on each of the four associated images: uniform fog, heterogeneous fog, cloudy fog, and cloudy heterogeneous fog.
All compared models are directly tested on FRIDA2 images by using pretrained model on ITS and OTS datasets. The test performances are shown in Table 5. However, from the test performance, though our model achieves top performance, we are  disappointed to find out no matter the model-based method like DCP or the learning-based methods including our method fail to achieve satisfied dehazing results on FRIDA2.
One of main factors that result in the unsatisfied performance is the highly different image and fog types from our training data, because all images in FRIDA2 are synthetic scenes, not camera scenes. We realize that the most practical problem that almost dehazing models might face is the generalisation problem, since most of current state-of-the-art methods including our proposed method are fully supervised method with limited training data. However, we should argue that this is another topic out of the scope of this paper. We are still mainly concentrated on solving the problem of insufficient inter-scale feature reuse and fusion in current dehazing network. In the future work, we will devote ourselves to solve the embarrassed limits.

Evaluation on real-world images
To validate the accuracy improvements of our model, we compare our method on a collected real-world hazy dataset with 37 real-world images. All compared models are first pretrained on ITS and OTS datasets, and then directly tested on real-world hazy images. Since real-world hazy images have no ground-truths, in order to sufficiently prove accuracy of the proposed method, we evaluate our proposed method through non-reference image quality assessment criterions that are publicly accepted in PIRM-SR Challenge 2018 1 . These criterions are Naturalness Image Quality Evaluator (NIQE) [34], Perception based Image Quality Evaluator (PIQE) [35], Ma.score" MA2017 and Perceptual Index (PI) [37]. The PI is defined as in Equation (11) PI = 1 2 (( 10 − mean(Ma.score) ) + mean(NIQE) ) .
The performances on real-world images are compared in Table 6. From the comparisons, we can easily find that our proposed method consistently achieves top performance. It demonstrates that our method have much better dehazing quality on real-world images. More typical results are shown in Figure 9. As demonstrated in Figure 9, the overall colour tone of images from the DCP is greenish, having sever colour distortion in sky regions. However, since the DCP poses very strong prior for hazy images with non-sky regions, its results on non-sky regions are relatively good. In terms of images in the first and third rows, our method can remove the haze clearly, while other methods cannot do as well as ours. In terms of images in the second row, the difficult places are the clothes' colour and edge preservations of white tent. None of the compared methods achieves good performance on both the two places. In terms of images in the fourth row, texture details of clouds in the sky are perfectly recovered by our method, while in other dehazing results, more or less details are lost. The similar advantage is achieved on the naturalness of rosy clouds by our method, as illustrated in the images of the fifth row. In terms of image in the sixth row, since it is a of woodland scene with almost non-sky region, the DCP achieves the best on complete haze removal. But it also should be noted that our method achieves relatively good visual result too. In terms of images in the seventh row, our method obtains results with higher definition and contrast without colour distortion, such as the perfect colour tone of the buildings area and sunlight. In contrast, results from other methods more or less seems unnatural. Especially results from GFN and EPDN show severe colour infidelity on real groundtruth. Therefore, to summarise, our method overall achieves more satisfied visual effects which are more in line with real situation than the compared state-of-the-art methods.

The running time and model's parameter size
We have recorded the average running time (in seconds) and platform of different methods on images of 620 × 460 × 3 pixel size. Besides, the parameter size for each compared model is also estimated for reference. These comparisons are shown in Table 7.

CONCLUSION
In this paper, we have proposed an effective end-to-end single image dehazing network with multiple inter-scale dense skipconnections. The proposed model takes full consideration of the importance of feature reuse in image restoration task. Dense inter-scale skip connections are employed to combine complementary information contained in different scale feature maps. Bottleneck residual blocks are used for feature learning on each scale. Comprehensive ablation studies demonstrate that our proposed novel structure enables more stable test process and better dehazing performance. We have conducted comprehensive comparisons with several typical state-of-the-art methods on public dehazing datasets and real-world hazy images. The experiment results demonstrate that our proposed model achieves superior performance with great improvements on PSNR and SSIM. The visual comparisons consistently show that our haze-removal results have much higher definition and contrast without colour distortion, complying satisfactorily with real situations.