Multi ‐ scale capsule generative adversarial network for snow removal

Snowflakes captured on photos may severely decrease the visual quality and cause difficulties for vision analysis systems. Most noise removal frameworks are designed for de ‐ raining or de ‐ hazing, regarding rain or haze as translucent masks on clean images. However, snowflakes are different from them in terms of sizes, shapes, transparencies and floating trajectories, which decreases the performance of de ‐ raining or de ‐ hazing models in processing snowy images. In this work, we propose an effective multi ‐ scale generative adversarial network framework for single ‐ image snow removal, which is built with a multi ‐ scale structure to identify various scales of snowflakes and a capsule ‐ based structure to fuse the features extracted from the multi ‐ scale encoding branches, so that different scaled features could be summarised and learnt by a joint framework. The overall framework is supervised by a weighted joint loss with an iterative training procedure to keep the training stability for the multi ‐ branch ‐ based structure. The experimental results demonstrate that our model outperforms the state ‐ of ‐ the ‐ art comparisons.


| INTRODUCTION
Vision tasks in surveillance systems are vulnerable to weather conditions, such as rain, hail, and snow, whose atmospheric particles may impede the normal interpretation, resulting in a disaster possibly. An experiment supported by Cascade-DilatedNet [1] illustrates this in Figure 1. Due to the snowstorm, the Cascade-DilatedNet fails to label the contents correctly or segment the objects clearly.
To counter this effect, various rain and snow removal techniques have been proposed to obtain clear images [2][3][4][5][6][7][8]. In the early stages of this research, prior-based approaches dominated. The atmospheric particles are detected and removed by welldesigned handcrafted features, such as edge orientations [7,9,10], shapes [11] and streak patterns [3]. However, these methods not only heavily depend on the researcher's experience, but also greatly limit the generality capabilities of the models. With the development of deep learning, learning-based techniques [8,[12][13][14] have become the focus of research due to their greater efficiency and better generalisation capabilities.
Some researchers regard snowflakes as one special kind of noise and apply common de-noising models for de-snowing.
However, snowflakes have more complex characteristics, such as various morphological structures, irregular trajectories, diversified distribution and non-uniform transparencies, which make de-snowing more difficult than other noise removal tasks. It is necessary to develop specific de-snowing models taking into account the characteristics of the snowflakes.
One snowy image contains various sizes of snowflakes, while handcrafted feature-based models are weak to learn them accurately and show quite poor generalisation abilities. As for deep-learning based frameworks [15], where features are obtained by learning, the performance is improved by the accurate learning of the snow feature. However, the existing approaches still have difficulties in handling snowflakes of various sizes and shapes. Thus, we propose a multi-scale structure so that snowflakes of different sizes can be processed at different scales at the same time for feature learning. Figure 2 shows the entire framework. The overall structure follows the idea of the image-conditioned generative adversarial network (image-cGAN), in which the de-snowing is achieved by generating completely snow-free images from snowy images, which can better restore the details of blocked parts. Furthermore, with the help of a capsule, a powerful structure was proposed by Hinton et al. [16] which enables the feature learning process to be carried out under the comparison with the global image content, and our framework can detect and remove snowflakes of diverse shapes more accurately and efficiently.
Learning-based models, especially convolutional neural network based ones, have shown a great power on particle removal tasks. But the performance of these models are heavily dependent on the quality of prepared training datasets. For the snow removal task, Liu et al. constructed a snow dataset named Snow100K [15] by adding synthesised snow masks onto clean images using Photoshop [17]. But they did not examine the image content carefully when synthesising the snowy images, resulting in some inappropriate training samples. It is improper to add snowflakes to some snow-meaningless pictures, such as indoor or underwater images. Having these inappropriate samples in training may lead to introducing unreasonable artefacts to the outputs when dealing with the real-world snowy images. Thus, we synthesise another large-scale snowy-image dataset, SnowySet, where meaningless samples are removed. Concurrently, a real-world snowy image dataset is collected for evaluation as well. All the details of SnowySet are presented in Section 4.2.
Our main contributions are summarised as follows: � We propose a multi-scale image conditional generative adversarial network (GAN) framework for single-image desnowing, within which three-scaled branches are designed to F I G U R E 1 Comparison of segmentation results for clean and snowy images. The segmentation results and the corresponding labels are predicted by [1] F I G U R E 2 The overall framework of our proposed multi-scale de-snowing model. The generator consists of three branches, at scales 512 � 512, 256 � 256, 128 � 128. Three discriminators are placed to discriminate the synthesised three scales of images. The capsule structure is implemented in both the generator and the discriminators. 'Conv' and 'Deconv' represent the convolutional and deconvolutional layers effectively remove various snowflakes. The experimental results demonstrate that our framework outperforms the state-of-the-art methods. � We are the first to apply capsule units to combine multiscale branches in the encoder for image generation, which is proved to be effective for learning snow features. � To improve the stability and efficiency of the training process, we propose a selective training method, where a precheck mechanism is applied to the discriminative loss to avoid unstable loss signals. � We construct a dataset, SnowySet, by adding snowflakes to snow-free images. The snowflakes vary greatly in shapes, sizes, transparencies and densities. To make the dataset more reliable and meaningful, indoor, underwater and other images that are inappropriate for being snowy have been removed.
The rest of this article is organised as follows: In Section 2, the related work on de-raining, de-snowing and other noise removal techniques are introduced; The details of our model and training methods are described in Section 3; The datasets and experiments are presented in Section 4, while the results of an evaluation of these are discussed in this section as well. Finally, the conclusions are summarised in Section 5.

| RELATED WORK
Besides the snow removal, we introduce the rain removal and some other related literature as well in this section.

| Rain removal
There are two types of rain removal, video-based and imagebased. The early research of rain removal was done in videos benefiting by the temporal information between the adjacent frames [18][19][20]. The image-based de-raining is harder as some image content is covered by the rain completely. In recent years, researchers have moved the attention to the imagebased de-snowing.
Before applying deep learning for snow removal, researchers tried to develop handcrafted features to identify snowflakes. Kang et al. [10] proposed a single image de-raining framework firstly. The rainy images were decomposed into the several frequency components to separate the rain streaks against the image content. Chen et al. [9] removed the rain by extracting a hybrid feature set from the rainy images. In spite of the decomposition methods, the sparse coding approach attracted researchers' attention. Sun et al. [21] developed an incremental dictionary learning strategy and Luo et al. [7] proposed a discriminative approach. Son et at. [11] designed a shrinkage-based algorithm to learn the sparse representation. However, the drawbacks of the sparse coding methods are obvious as well. They assumed that the rain noise is 'streaks'-like and has the similar gradient orientation over the whole image [15]. Because of the harsh assumptions, these methods cannot identify various-direction rain streaks and would easily introduce strip-like artefacts. Except the above approaches, there are other methods which are not based on deep learning proposed in different algorithms. Kim et at. [22] made an adaptive non-local means filter to separate the rain streaks. Chen et al. [23] proposed a generalized low-rank model to learn the rain feature. The Gaussian mixture models were implemented by Li et al. [3] as a layer-prior de-raining approach. The performance of the conventional generative models is limited and fell short of what deep learning methods achieved.
The rain removal methods based on deep learning have greatly improved the performance and have attracted widespread attention. Yang et al. [12] proposed a multi-branch framework to detect and remove rain streaks in rainy images. Fu et al. [8] constructed the detail layers to separate the rain streaks from the background. To remove the rain streaks with different densities, researchers tried to build complex structures with sub-networks and recurrent modules. Li et al. [14,24] built a recurrent structure to remove rain streaks in multiple stages. Zhang et al. [25] proposed a density aware multi-scale framework to identify the rain streaks with extra density information. More recently, researchers found that image enhancement is helpful to remove the artefacts that are introduced by the de-raining models. Thus, the enhancing layers or blocks are commonly implemented in recent works [25,26].

| Snow removal
Compared with de-raining, de-snowing does not attract so much attention. Some researchers applied the de-raining models on snow removal directly [4,5,27,28], while ignoring the unique characteristics of snowflakes such as various shapes, opaque and etc. Nevertheless, these special features make snowflakes more complicated than rain drops. In an earlier stage, researchers believed that snow inherits certain rain characteristics and considered it as a special form of rain. Hence, referencing to deraining models, de-snowing models are designed to remove snow particles with the similar handcrafted features as well. Bossu et al. [20] separated the foreground and background by using Gaussian Mixture Model and constructed the snow features in foreground with the Histogram of Orientations of Snow Streaks. Rajderkar et al. [29] proposed an image decomposition approach based on Morphological component analysis, where the image would be decomposed into low and high frequency (LF/HF) parts by bilateral filters firstly and then applying dictionary learning and sparse coding methods to identify snow components later. Xu et al. [30] modelled snow particles by colour assumptions and removed snow with a guidance image. As this approach may result in losing detailed information of local regions, they improved their framework with a refined guidance image in [31]. However, these prior-based methods can only model limited features and are weak in generalisation. The hyperparameters of guided filter in [31] may suits low transparency snowflakes, but failed for opaque ones.
Due to the limitation of handcrafted feature-based models, researchers developed learning-based models to learn effective features from existing noisy data distribution. The image translation frameworks have been applied for de-snowing YANG ET AL.
-3 [8,15]. Now, more and more researchers start to notice the differences between snowflakes and rain streaks, and realize that these differences may significantly affect the removing effect. The frameworks specifically designed for snow removal begin to attract more attention [15,32].

| Other noise removal/inpainting works
The research of haze/cloud removal has a long history in image processing. He et al. [33] proposed the famous dark channel prior (DCP) method to remove haze from the single hazy image. Li et al. [34] proposed two multi-temporal dictionary learning algorithms to recover the image content from large-scale clouds. Cai et al. [35] proposed the DehazeNet, which firstly introduced a fundamental CNN structure for de-hazing. Later, researchers found that the learning of the image feature benefits from the multi-scale structure, and Ren et al. [36] built a multi-scale network to learn the estimation of the transmission map t. More recently, Li et al. [37] developed a non-negative matrix factorization and error correction method (S-NMF-EC) for cloud removal.
Indeed, image noise removal amounts to image inpainting, which aims to restore the missing or damaged regions in an image. Snowflakes and other kinds of noise are regarded as the damaged factors to the image. Bertalmio et al. [38] discussed the problem of image inpainting. Xu et al. [39] introduced an examplar-based inpainting algorithm through investigating the sparsity of natural image patches and achieved quite good results. Guillemot et al. [40] presented a brief overview about the research of image inpainting.
Though a great number of works have been proposed for noise removal, we focus on the snow removal in this work. Since, snowflakes are much different with other kinds of noise in terms of the sizes, shapes, densities and etc., we design a specific de-snowing framework targeting the characteristics of snow removal.

| Overall framework
We build the multi-scale model based on the image-cGAN, which has been successfully applied on image translation [41,42]. The overall framework is shown in Figure 2. The generator (G) is constructed with multiple branches to tackle different sizes of snowflakes, since, the snowflakes vary greatly in sizes, shapes, densities, orientations and trajectories. The discriminating part consists of three separate discriminators (Ds) with each targeting to distinguish one scale of the generated image. Each branch of G takes a certain size of snowy image as its input, which is down-sampled from the original snowy image. After layers of convolutional filters for each branch, the encoded features are jointly connected to the decoder to synthesise the corresponding sizes of clean images.
Our idea of the multi-scale structure came from the feature pyramid network [43], which enhances feature learning ability by setting pyramid-like multi-scale layers of filters. Thus, we designed the multi-scale structure for both G and D to improve their feature learning ability for various sizes of snowflakes. We are not the first to apply multi-scale receptive fields to noise removal. [25,32] introduced the similar idea by implementing different sizes of convolutional filters to expand the receptive fields for de-raining, where the branch with larger convolutional filters is designed for the feature learning of bigger raindrops. Compared with their approaches, our method obtains better results and the computational burden is reduced, since, we down-scale the input images for the smallscale branches.
In addition, inspired by the research of capsules [42,44,45], we implement capsule units in both G and D to improve the feature representation ability of the overall framework. According to previous research, capsule units are able to learn the part-to-whole relationship [45], and therefore enhance a model's ability to identify objects on a global view. In our model, the capsule units help to learn the 'hard samples' of the snowflakes by checking them against the image background with global information. This implementation benefits both G and D and improve the generating and discriminating abilities.

| Multi-scale branches
The conventional generative adversarial network with a single network-based G is weak at tackling the de-snowing problem, because of the monotonicity of convolutional kernels and the variety of snowflakes. Our solution enlarges the receptive fields by applying multiple scales of filters, as shown in Figure 2.
Taking the model of [25] for reference, we designed a multiscale structure with three scaled branches, with scales of 512 � 512, 256 � 256 and 128 � 128. The snowy images are resized into the three scales, with each being processed by one branch of G. Each branch includes a PrimaryCaps layer to transform the layer maps into the capsule structure. All the PrimaryCaps capsules of the three branches are densely connected with an FC_Caps layer, which is responsible for merging features from three branches. The capsule structure is described in Section 3.3.1. The FC_Caps layer is connected with three decoding branches, aiming to synthesise three scales of images, 512 � 512, 256 � 256 and 128 � 128. Each branch consists of a DePrimaryCaps layer and several deconvolutional layers. Each deconvolutional layer enlarges the feature maps by a 2-stride filter. The 512 � 512 image is the de-snowing result, while the outputs from the branches of the 256 � 256 and 128 � 128 images are used to help the overall training of the framework.
To distinguish the three scales of synthesised images, we placed three separate discriminators, 512 � 512 (D 512 ), 256 � 256 (D 256 ), and 128 � 128 (D 128 ), each of which is constructed by several convolutional layers and a twodiscriminating part (named two-branch D in [42]). The responsibility of each scale discriminator is to distinguish the synthesised de-snowing image from the snow-free images (ground truth) at the corresponding scale. The discriminating loss is back propagated to G to improve G's generating ability on every scale. In particular, we down-sample the generated 512 � 512 images into 256 � 256 and 128 � 128 and send them into D 256 and D 128 for multi-scale discriminations. In this way, the decoding branch of 512 � 512 could have useful supervision signals from all scales of Ds.

| Capsule-based generative adversarial network
Generally, three kinds of capsule layers comprise of the capsule implementation of the framework: the PrimaryCaps layer [45], the FC_Caps layer [42,45] and the DePrimaryCaps layer [42]. The PrimaryCaps and DePrimaryCaps layers are responsible for the transformation between the conventional convolution based layer maps and the capsule layers. The FC_Caps layer learns and stores the features in capsule units. The connections between two capsule layers are weighted by dynamic routing [45], which finds the optimal way to flow the capsule information from the previous layer to the next one. In our multibranch model, the FC_Caps layer within G is responsible for merging the features learnt by the three scale encoding branches and passing the features to different-scale decoders.

| Capsule implemented generator
Shown as in the Figure 2, the capsule based multi-scale branch generators are constructed with conventional convolutional/ deconvolutional layers and capsule-based layers. On each scale branch of G, the input image is processed by layers of 2-stride convolutional filters with each layer reducing the feature map size by half. The PrimaryCaps layer applies M � N conventional convolutional filters to output M feature maps with the capsule dimension of N (the length of each capsule vector is N ). Three PrimaryCaps layers are placed after the three scale convolutional encoders to transform the scaler neurons into the capsules with the same dimension of N. Then, all capsules from these three PrimaryCaps layers are able to be concatenated and fully connected to the FC_Caps layers, a set of capsules, via dynamic routing [45]. The dynamic routing calculates the similarity of each pair of capsules to determine whether their connection should be strengthened or weakened. The DePrimayCaps layer is a capsule layer with the same size as the PrimaryCaps layer, fully connected to the previous FC_Caps layer. The layer maps are transformed by conventional convolutional layers after the DePrimaryCaps layer [42]. Three DePrimaryCaps layers are placed to transform the FC_Caps layer into three decoding branches. After layers of 2stride deconvolutional filters, de-snowed images are generated. In addition to the capsule layers, skip connections are placed on the corresponding convolutional and deconvolutional layers [41].
As shown in Figure 2, of the three scales of outputs generated by the generator, the largest one (512 � 512) is the expected de-snowed result. The two smaller scales of outputs (256 � 256 and 128 � 128) are used in the calculation of the joint loss and provide training signals to the corresponding branches. The generator is trained branch by branch according to the optimization functions defined in Section 3.3.2 and 3.4. The overall training procedure is presented in Table 1.

| Capsule implemented discriminator
Three scales of discriminators with the same structure are placed to distinguish the three scales of synthesised images. The images are processed by several convolutional layers and then are discriminated by two parts, the capsule part [42] and the conventional patchGAN part [41]. The capsule part consists of a PrimaryCaps layer and an FC_Caps layer followed by two additional capsules as the output. The patchGAN part has three conventional convolutional layers, with the last one as a single channel discriminating map (Every 'pixel' of it represents the distinguishing result). The two parts focus on discriminating the images in the global and the local view respectively. A balancing weight is placed to summarise the loss from the two discriminating parts, formulated as where λ 1 is a balancing factor for losses of the two branches in the discriminator, L caps and L patchG are represented as where P snowy (x) and P clean (y) represent the distribution of snowy and clean images in the training set, respectively; G(.), D caps (.) and D patchG (⋅) indicate the forward calculation of the generator, the capsule branch of D, and the patchGAN branch of D; while L M is margin loss employed from [45]. We placed three discriminators in the overall framework to distinguish the three scales of images generated by the three branches of Gs. Thus, the L disc for the three discriminators are written as L disc_512 , L disc_256 , L disc_128 , formulated as Equations (4), (5) and (6).
where G * (⋅) and D * (⋅) mean the forward calculation of the corresponding branch. The loss from D to optimise G is written in Equation (7) with training signals from both the discriminating parts.
As mentioned in Section 3.2, to better enhance the synthesis performance, the generated 512 � 512 images are downsampled into 256 � 256 and 128 � 128, which are sent to D 256 and D 128 for down-scaling discrimination. A large size of snowflake, which is hard to be identified by D 512 , could be discriminated more easily by D 256 and D 128 after down-scaling.
where Down * (⋅) represents the down-sampling process into size * .

| Overall optimization functions
In the training phase, the generator is optimised by a joint loss of pixel loss and Structure SIMilarity Index (SSIM ) loss [46], along with the supervision signal from multiple discriminators. The pixel loss is to reduce the overall difference between the synthesised image and the ground truth by calculating the L 1 norm distance. The SSIM loss is to minimise the structural error of the generated outputs from the ground truth. Since, SSIM is closer to the visibility of human eyes than other evaluation metrics [46], it is commonly used as a good evaluation method in experiments in comparing the similarity of two images. Based on this, SSIM would be an effective loss function intuitively if we want to enhance the visual quality of the synthesised images. The idea of SSIM loss has also been recently introduced, [14,47]. Different losses contribute to the final loss differently with a series of loss balancing weights. The supervision signal for the Gs is from the combination of multiple losses, shown in Equation (11).
Here, L pixel is the pixel loss, which is the L 1 distance; L SSIM is the SSIM loss calculated as 1 − SSIM, shown as and where L 1 is the L 1 distance and SSIM(⋅) means the calculation of the SSIM score for the two images. The three branches within the G are optimised in terms of three overall loss functions, shown in Equations (14), (15) and (16).  Table 1. The parameters of the FC_Caps layer within G are updated along with G 512 . When one branch is being optimised, the parameters of the other branches remain unchanged. Equations (14), (15) and (16) are calculated in the form of Equation (17).

| Selective training
A typical way to train a GAN is to iteratively update the gradients for the weights/parameters of G and D. It is obvious that both G and D are not trained well during the early iterations. Indeed, the discriminator loss fluctuates over the whole training procedure. When G generates 'better images' after several optimization steps, the loss of D increases. While, at this moment, D needs some 'time' (more training iterations) to optimise itself and learns to distinguish the generated 'better image' and the ground truth.
The discriminating signal passed backward to G is not always effective, especially when D is not trained well with a larger discriminating loss. This gives G the wrong signal, which leads G to learn improper feature representations and produce low-quality images. After trials of training, we found that the loss fluctuation of D has a big probability of causing the training to fail sometimes by encouraging G to produce a black or a meaningless image. To reduce the side effect of these loss fluctuations, one solution is to use extra training signals (L 1 loss) to guide the learning of G, which is applied in the most recent research [41,42]. Another solution is to set a bigger learning rate for D to speed up the optimization of D, thus shortening the ineffective learning time of G. Both methods can help the training, but do not solve the problem completely. In addition, the second method has an obvious drawback: it causes an instability of D and brings obstacles to D in arriving at the optimum. In the present paper, except for the extra training signals (L 1 loss and SSIM loss described in Section 3.4), we propose a selective training method to have a precheck on the loss of D, named selective training. The training signal from D can be applied to G only when D is checked with a low D loss.
The judgement criterion is to check whether the discriminative loss, L disc , is below a reasonable value. Set an empirical value ξ dis as the threshold, and L genGAN is back-propagated when L disc is lower than ξ dis . Otherwise, the gradients calculated from L genGAN will not be applied. Therefore, Equations (14), (15) and (16) are rewritten in the form of Equation (17).

| Experiment setting
The proposed model (MC-DS) is compared with auto-encoder (AE) [48], Pix2pix [41], cycleGAN [49], DID-MDN [25], DesnowNet [15], LSTM-GAN [50] and ComposGAN [51] on two datasets, Snow100K [15] and the constructed SnowySet. Due to the lack of the source code of DesnowNet, LSTM-GAN and ComposGAN, we re-implement their frameworks in tensorflow with their recommended hyper-parameters. The experiments are conducted on both synthetic and real-world snowy images. Peak Signal-to-Noise Ratio(PSNR) and SSIM [46] are used for quantitative evaluation. The normalized mean square error (nrmse) and the absolute error (L1 norm distance) are also listed. The visual de-snowing samples of both the synthetic and the real-world datasets are represented. To check the effectiveness of the multi-scale structure, an ablation study is set by examining a framework with a single-scale generator constructed.

| Snow100K
Snow100K [15] contains 100,000 synthesised snowy images, 50,000 for training and 50,000 for testing. The snow masks used for synthesising the snowy images are also provided. According to the density of snowflakes, the testing set is separated into three subsets, Snow100K-S, Snow100K-M and Snow100K-L, representing Small, Medium and Large, respectively. Besides the synthesised images, [15] also provides 1329 realistic snowy images downloaded via the Flickr api.

| SnowySet
To increase the diversity of snowflakes, we propose another dataset, SnowySet, which is synthesised with snowflakes in great variations of the shapes, sizes, density, transparency and floating trajectories. The clean images are selected from BSDS500 [52], UDIC.v2 and Snow100K [15] by removing the images that are meaningless to the snowy weather (e.g. indoor, underwater, close-view or water sports images), 4236 for training and 1040 for testing. The snow masks synthesised in the Photoshop [17] are added onto the clean images to synthesise the snowy images, resulting in 42,360 training samples and 10,400 testing samples. Similar to Snow100K and inspired by [53], we prepare the dataset with three density levels, SnowySet-L (Large), SnowySet-M (Medium) and SnowySet-S (Small), to simulate different densities of snowflakes in reality. Hundred real-world snowy images are collected from the Internet for testing with different image contents and various snowflakes. Comparing with the realistic snowy images provided by Snow100K, the ones in SnowySet are with snowflakes of more variations. The datasets and the synthesising instructions will be publicly available.

| Implementation details
The images are resized into 584 � 584 before being randomly cropped into 512 � 512 patches for training. The testing images are resized into 512 � 512 before being forwarded through the well-trained generator.
The kernel sizes and strides in all convolutional/deconvolutional layers are fixed into 5 � 5 and 2. The height and width of feature maps are reduced into half after each convolution layer. The number of the convolution layers for each branch varies according to the inputting scale so that 16 � 16 feature maps are obtained before being connected to the PrimaryCaps layer. The filter number (layer channels) of each convolutional layer doubles that of the previous layer, with the first layer channel of 32. The layers of the decoder branches of G (deconvolutional layer) have the same settings corresponding to the encoder branches. According to [42,45], we design the capsule structure with the dimension of eight for all capsules and set capsule numbers in PrimaryCaps, FC_Caps, DePrimaryCaps layers of G to 16, 48, and 16, respectively.
The first several convolutional layers of Ds are similar to the encoder branches of G, outputting the feature maps in the size of 32 � 32 before being sent into the two-sub-branch discriminating part. The PrimaryCaps and the FC_Caps layers of Ds contain 32 and 16 capsules appended by another two capsules as the output. The patchGAN branch of Ds consists three convolutional layers with the last one outputting a 9 � 9 single channel feature map.
The training is conducted with a batch size of eight for 100 epochs where the learning rate is decayed by 0.1 per 20 epochs from a initialisation of 0.0002. The loss balancing weights are λ 1 = 0.1, λ 2 = 10, λ 3 = 5, λ 4 = 0.1. The experiments are executed in tensorflow on NVIDIA GPU Tesla V100 (32 GB).

| Quantitative results
The quantitative results of the two synthetic datasets are shown in Tables 2 and 3. The subsets are evaluated separately. The overall results are calculated on the whole datasets, which are equivalent to the average of the subsets. The boldface values in the tables indicate the best results.
Overall, the proposed MC-DS performs the best and ComposGAN [51] ranks the second. DID-MDN [25] and DesnowNet [15] produce comparable results, which are still better than LSTM-GAN [50]. The comparisons are observed more obviously on the two heavy-snow subsets, Snow100K-L and SnowySet-L. The performance of AE [48] and cycleGAN [49] cannot meet the state-of-the-art, due to the simpleness of AE structure and the lack of paired information for cycleGAN [49]. Pix2pix [41] gives quite comparable results on Medium and Small snow subsets, which shows that an U-net generator with a patchGAN discriminator [41] is capable to handle the lightweight work of de-snowing.
In Table 2, DID-MDN [25] produces quite good SSIM scores on Snow100K-M and Snow100K-S, benefiting from its refinement layers to improve the image quality. But it suffers the degradation for heave-snow images on Sonw100K-L with worse scores than DesnowNet [15] and ComposGAN [51]. ComposGAN [51] gives the best PSNR on Snow100K-L with 29.54, but the SSIM is still lower than ours.
In Table 3, DesnowNet [15] and DID-MDN [25] give comparable results with ComposGAN [51], but still perform worse than ours. SnowySet contains snowflakes of more variations, which cause more difficulties to de-snowing models. DID-MDN [25] with density estimation contains the ability to learn the feature of multi-density snowflakes. But our MC-DS with multi-scale branches shows better ability of learning the variations of snowflakes. On SnowySet-L, LSTM-GAN [50] gives quite low SSIM of 0.8641, showing that the LSTM structure may not help a lot to learn heavy-snow features. The patchGAN-based discriminator encourages Pix2pix [41] to learn more local details and texture features, resulting in a reasonable L1 score on the SnowySet-L.

| Visual comparison results
We check the visual inspection of the de-snowing results in Figures 3-6. We select samples with diverse image contents from both the synthetic and the real-world datasets. 4.5.1 | Synthetic images Figures 3 and 5 show the de-snowing results on synthetic images from the testing datasets of SnowySet and Snow100K [15], respectively. Figure 4 shows the enlarged details of images in Figure 3.
CycleGAN [49] and Pix2pix [41] fail to remove the snowflakes and CycleGAN even changes the global colour into 'Green' effects. The fourth columns of the Figures 3 and 5 present better de-snowing results, but some local details are also eliminated. The images are over smoothed by DID-MDN [25], which is quite obvious in the first image of Figure 4. For a comparison, ComposGAN [51] and MC-DS restore more details of the 'grass'. The same phenomenon can be seen in the fourth image of Figure 3, and also in the first and the second images of Figure 5. Compared with DesnowNet [15] and LSTM-GAN [50], the proposed MC-DS removes the snowflakes more completely, which can be examined by checking the 'sky' regions of the third sample in Figure 4. ComposGAN's result is close to ours, but some snowflakes are still left. The 'sky' regions from the first sample in Figure 5 presents the similar conclusion. There are snow marks that are not removed on the 'bear face' and 'human face' from the second and forth images of Figures 3 and 4.
We can see that the proposed MC-DS gives the best desnowing results close to the ground truth by successfully removing the snowflakes on various kinds of background and recovering most details of the image content. 4.5.2 | Real-world images Figure 6 shows some de-snowing results on real-world snowy images. With the similar phenomenon to that on synthetic images, CycleGAN [49] produces blurry images with 'Green' effects. Pix2pix [41] leaves some snowflakes undetected apparently. Compared with DID-MDN [25], DesnowNet [15], LSTM-GAN [50] and ComposGAN [51] from the first and second images of Figure 6, the proposed MC-DS produces better images by removing more snowflakes and generating a smoother 'sky'. The down-left corner of the third sample shows 'two people' getting on the 'black car', which are synthesised blurry by DID-MDN [25], DesnowNet [15] and LSTM-GAN [51]. ComposGAN's result is much worse. On the forth sample, DID-MDN [25] considers the 'yellow tie' on the human in black clothes as one snowflake and removes it totally from the outputting result. For the comparison, our MC-DS recovers it much better. From these samples we can see the superiority of MC-DS in distinguishing the image contents against various snowflakes.

| Ablation study
To check the effectiveness of the multi-scale structure, we set a ablation study with a single-scale structure by removing the branches of scale_128 and scale_256, marked as Single_scale.
The proposed MC-DS is marked as Multi_scale here. Since, the heavy-snow images contain snowflakes of more variations, we conduct the ablation study on Snow100K-L and SnowySet-L. The results are shown in Table 4. Without the multi-scale structure, the performance of MC-DS-single_scale decreases with all the evaluation metrics on the two subsets. The single-scale model faces F I G U R E 3 The de-snowing results of different frameworks on testing images from SnowySet. Our model produces the best clear 'sky' and recovers more details, such as the 'plant' and the 'face'. Larger patches are shown in Figure 4. Best view on screen. GAN, generative adversarial network F I G U R E 4 The detail comparison of different frameworks on testing images from SnowySet. Best view on screen. GAN, generative adversarial network 10difficulties in processing diverse sizes of snowflakes from different image contents. However, the Single_scale still obtains comparable results with DID-MDN [25] and Des-nowNet [15] of Tables 2 and 3, which benefits by the contributions of the SSIM loss and the capsule-based structure.

| CONCLUSION
In this work, we build a multi-scale image-cGAN to remove snowflakes from snowy images. Compared with existing desnowing models, the proposed MC-DS outperforms the state-of-the-art.
The experiments demonstrate the F I G U R E 6 The de-snowing of different frameworks on real-world images. The first two samples show that our model removes the snowflakes best. The 'car' of the third sample and the 'human with bag' of the fourth sample show that our model recovers more image detail. Best view on screen. GAN, generative adversarial network F I G U R E 5 The de-snowing results of different frameworks on testing images from Snow100K. Our model recovers the best image details by observing the 'tree' of the second sample and the 'building' of the third sample. Best view on screen. GAN, generative adversarial network effectiveness of the multi-scale structure in removing various sizes of snowflakes. We implement the capsule layers to fuse the features of different branches and learn the part-to-whole relationship between local regions and the global image content.
Although MC-DS performs well in the experiments, there are more to explore and discuss. Firstly, better methods to connect the capsule block with convolutional/deconvolutional layers can be explored by designing new structures of Pri-maryCaps and DePrimaryCaps layers [42,45]. In this work, it is achieved by simply replacing the neurons in different dimensions. Secondly, more effective routing algorithms might be developed for the feature fusion of the three branches if better capsule structures are constructed. DynamicRouting [45] or EMRouting [16] are used for the routing connections between adjacent capsule layers, which might not perform the best for feature fusion. Thirdly, more advanced discriminators could be developed by sharing the features learnt from lowlevel layers as we place three separate discriminators to distinguish different scales of images in this work. But it should be noted that the training procedure will become more complicated for feature-sharing discriminators than separate discriminators. Fourthly, advanced structures, such as ResNet or Densenet, could be designed with capsule structure implementation.