Attentive generative adversarial network for removing thin cloud from a single remote sensing image

Correspondence Rong Chen, The College of Information Science and Technology, Dalian Maritime University, No 1, Linghai Road, Dalian 116026, China. Email: rchen@dlmu.edu.cn Abstract Land-surface observation is easily affected by the light transmission and scattering of semi-transparent clouds, high or low, resulting in blurring and reduced contrast of ground objects. To improve the visual appearance of remote sensing images, the authors present a deep learning method for thin cloud removal using a new attentive generative adversarial network without prior knowledge or assumptions, which copes with thin clouds that are unevenly distributed on different images and learns the attention map with weighted information about spatial features. Such a spatial attention model can endow each pixel with the global spatial context information. Consequently, the generative network focuses on the thin cloud regions to generate better local image restoration, and the discriminative network can evaluate the local consistency of the repaired regions. The experimental results show that this method is superior to state-of-the-art methods in recovering detailed texture information.

translate cloudy images to cloudless images. The deep learning approach leverages learned features automatically as opposed to hand-crafted featurizers. However, image-to-image translation requires a paired dataset. Due to a lack of suitable paired datasets, previous attempts to apply deep learning to the removal of cloud occlusions [10], relying on synthetically generated image pairs, where simulated clouds are artificially added to cloudless images [13].
We propose an attentive generative adversarial network to remove thin cloud from a single remote sensing image. Our generative network contains two parts: a spatial attention network and a contextual autoencoder. The spatial attention network attempts to discover cloud regions and their local structures in the input image. The contextual autoencoder attempts to generate a cloudless image using the concatenation of the cloud image and the attention map from the spatial attention network. To ensure that our output appears as a clean image, our discriminative network evaluated the generated cloudless candidates.
Due to a lack of suitable paired datasets, previous attempts to apply deep learning to the removal of cloud occlusions, relying on synthetically generated image pairs, where simulated clouds are artificially added to cloudless images. So we are motivated to apply an attention mechanism to highlight the cloudy feature. Existing methods that are closest to ours assumed that each region in the image gets the same attention, but in reality, the cloud is unevenly distributed; the area covered by the thin cloud is visually different from that without cloud. So long-range contextual information is beneficial to the thin cloud removal. However, the existing spatial attention mechanisms are not flexible and adaptive in aggregating long-range contextual information. To this end, we propose a new spatial attention model that can endow each pixel with the global spatial context information, which results in better performance on datasets available in the public domain.
The contributions of this study are as follows: 1) We propose a generative adversarial network for image restoration. An attention map is proposed to feed into the generative and discriminative networks, which enables the generative network focus better on the structure information of the cloud region and its surrounding regions and enables the discriminative network to obtain the local consistency of the image restoration region. 2) Extensive experiments show that the proposed method can more effectively remove thin clouds in terms of the evaluation metrics and visual appearance, compared to the commonly used image enhancement and haze removal methods.
The remainder of this paper is organized as follows. Section 2 presents relevant work on remote sensing thin cloud removal methods. In Sections 3 and 4, the generative and discriminative networks of generative adversarial network are introduced, respectively. The results of thin cloud removal for state-of-theart methods and the proposed method are presented in Section 5. Finally, conclusions of the study are summarized in Section 6.

RELATED WORK
Cloud removal approaches are divided into two categories: model driven and data driven. There are four main approaches for the model-based method: HF Method, HOT Method, physical model method, and retinex algorithm.

HF method
The HF method proposed by Mitchell et al. [14] is a popular thin cloud-imaging model. In this work, the authors argue that cloud noise is mainly distributed in the low-frequency domain of one image. Thus, the thin could or fog removal work can be conducted using a suitable filter to reduce or eliminate the low-frequency information. However, the cut-off frequency is empirically determined, and the clear regions are easily affected.
To keep the quality of the clear regions, Shen et al. [15] pro-posed an adaptive homomorphism filter to eliminate thin clouds in visible remote sensing images. The authors further suggest using the clear-sky regions in the image to determine the cutoff frequency. This method cannot properly solve the confusion between the spatial frequency of cloud and a low-frequency land surface, which affects the spectral characteristics of pixels in a cloudless region.

Haze optimization transformation (HOT) method
To address the images with uneven cloud, Zhang et al. [16] proposed the HOT method, in which the pixels in the clear scene can construct the clear skyline. When affected by cloud, the distance of every pixel to the clear skyline is used to assess the relative cloud thickness. However, some sensitive land-surface types, such as snow, bare soil, man-made buildings, and water bodies could induce spurious HOT responses. To solve the above problems, He et al. [17] proposed an atmospheric correction technique, that is, a virtual cloud point (VCP) method, based on advanced haze optimization transformation (AHOT). However, it still depends on the selection of cloudless samples and requires more manual interventions. Jiang et al. [18] presented a high-fidelity haze removal method based on semiautomatic HOT transform, but the method only has a good effect on vegetation cover scenes. Chen et al. [19] proposed a remote sensing cloud correction method based on the cloud detection of iterative haze-optimized transformation (IHOT) and the cloud removal of cloud trajectory. The visual effect of remote sensing images is improved by the relationship between surface reflectance and IHOT. However, the estimation of cloud trajectories is affected by shadow detection error, which leads to inaccurate cloud trajectories estimation.

Physical model method
In the remote sensing images of thin clouds, the spectral signals of the images are mainly composed of two parts [20]. The first part involves the reflection of solar light through the cloud, and the second part involves solar light passing through the cloud after being reflected by the ground object. the expression of the model are as follows: where s(x, y) is the image received by the sensor, L is the solar irradiance, r (x, y) is the reflectance from ground, t (x, y) is the cloud transmittance in the atmosphere; a is the sunlight attenuation coefficient. In addition, the values of r (x, y), t (x, y), and a range from zero to one. This model considers the spectral characteristics of clouds. By assuming zero absorption in the cloud, the model regards the sum of reflectivity and transmittance in the cloud as one. Therefore, this model does not consider the loss of radiation energy as light passes through the cloud.
Tan et al. [21] proposed for the first time a method to remove fog from a single image by maximizing the local contrast of the image. This method improved the visual effect of the foggy image; however, it was not suitable for the discontinuous depth of the image. Fattal et al. [7] proposed a defogging method based on independent component analysis (ICA), which mainly takes the statistical characteristics of colour images as a basis. However, this method is not suitable for greyscale images. He et al. [8] performed the feature statistics of a large number of fog-free images and proposed a single-image defogging method based on dark channel prior (DCP). However, due to the difference between remote sensing images and natural images in terms of imaging distance, this method easily causes colour offset in remote sensing images. The above methods have obtained remarkable results in the uniform distribution haze images. Liu et al. [22] removed thin clouds by a method based on the cloudy physical model, and they used the maximum and minimum radiation correction method to repair the surface colour value. However, the method requires a more accurate estimation of the parameter of the cloudy physical model.

Retinex algorithm
Most of the current image enhancement methods were proposed to improve the visual appearance of the low-illumination remote sensing image. These methods include the retinex algorithm [23], single-scale retinex (SSR) algorithm [24], multiscale retinex (MSR) algorithm [25], and multi-scale retinex with colour restoration (MSRCR) [26] algorithm. The retinex algorithm proposed by Land [23] in 1965 applied to image dynamic range compression and image colour fidelity [27]. The MSR algorithm can be regarded as the result of linear weighted combinations of SSR on multiple scales. Based on the theory of the MSR algorithm, Jobson et al. [26] proposed an MSRCR algorithm. The MSRCR algorithm adds the colour recovery coefficient compared to the MSR algorithm. Although these retinex algorithms improve the image effect by some degree, they still possess certain shortcomings. For example, the SSR algorithm cannot meet the dynamic range compression and colour constancy at the same time. When the original image does not meet the grey domain hypothesis, the MSR algorithm will cause colour distortion. Although the MSRCR has a good ability to avoid colour distortion, the enhancement effect on the details of darker regions is not obvious.

Data-driven method
Compared to the above model-driven methods, the effective non-linear expression ability of deep learning has been developed for removing thin cloud. MCGAN [10] is a neural network model trained by near-infrared (NIR) band images and RGB cloud images synthesized by Perlin noise [13]. However, the texture and spectrum of the Perlin noise are different from that of the actual cloud, and the output image of the MCGAN model generates artefact. This method depends on the NIR data, which can only partly penetrate clouds, whereas synthetic aperture radar (SAR) data are almost completely independent of atmospheric conditions and solar irradiation. Bermudez et al. [11] suggest the use of SAR data instead of NIR data. They used SAR data and the conditional GAN method to reconstruct cloudless optical images. The conditional GAN model requires a pair of cloud/cloudless training data. The Cloud-GAN [12] model used unpaired data for network training and realized the style transformation from cloud images to cloudless images. However, in the network training, the output result was very sensitive to the initialization of variables. Thus, to achieve the best effect of background colour without distortion, the training process must be repeatedly conducted.

CLOUD REMOVAL USING ATTENTIVE GAN
Several previously described methods have achieved remarkable results by assuming a uniformly distributed haze over a region, which approximately causes similar degrees of blurring. In reality, this assumption is not comformed to the reality that thin cloud in the remote sensing image is a non-uniform medium. Therefore, we propose an attention mechanism to capture the probability distribution of cloud thickness and focus on meaningful cloud regions that degrade the image. As shown in Figure 1, the overall architecture of the proposed network consists of two main parts: the generative network and the discriminative network, inspired by the concept of GAN [28]. Given that remote sensing images are corrupted by uneven thin clouds, the generative network attempts to generate cloudless candidates, whereas the discriminative network attempts to distinguish generated candidates from the real images. The contest operates in terms of data distribution, the objective of which is expressed as: (2) where G represents the generative network, D represents the discriminative network, I is the input cloudy image, and R denotes the clean image. Next, we describe the details of each network.
As shown in Figure 1, we fed the input feature map extracted from three residual blocks (RB) to the spatial attention block (SAB), in which the spatial attention module (SAM) is used to generate an attention map to guide three subsequent spatial attentional residual blocks and highlight the cloudy feature. Both the attention map and the cloudy image were fed to the contextual autoencoder network to reconstruct the clean image.

Generative network
As shown in Figure 1, our generative network consists of two sub-networks: a spatial attentive network and a contextual autoencoder network. The spatial attentive network is able to "focus" its "attention" on the interesting regions of the input

Spatial attentive network
To obtain the concentration of the cloud image. We create an attention map guided by the mask M. We use the following loss function to optimize the spatial attention network: where A is the attention map from SAM in the attention network; unlike a binary map, M takes a discrete value ranging from 0 to 1, which indicates the concentration of the cloud. The input cloud image is shown in Figure 2(a). We also visualize the attention map produced by SAM in Figure 2(d).
Red colour indicates pixels that are highly likely covered by thin cloud in the attention map. It is observed that SAM can effectively identify the regions affected by clouds, although the appearance of the clouds considerably varies. The attention map generated by two rounds of direction-aware identity recurrent neural networks (IRNN) can more accurately locate the cloud distribution than that generated by one round of IRNN. The attention map is used to guide the following cloud removal procedure.
There are visual attention models available for visual recognition and classification but they are not flexible and adaptive in aggregating long-range contextual information. It is generally believed that the spatial connection of the image is that the local pixel is closely related while the pixel with a long distance is weakly correlated [29]. Each neuron does not need to perceive the global image but only needs to perceive the local image. To overcome this limitation, we adapted the attentive mechanism based on two-round four-directional IRNN [30] to explore global contextual information to fecilitate the cloudy region detection.
One aspect is about solving the directional feature map. We summed up the direction-aware IRNN model for calculating feature f i, j at location (i, j ) as: where right represents the weight parameter in the convolution operation. Taking the information propagated from left to right on the whole feature map as an example, we obtained the Step 1. Forward propagation: A = Inference (I ) Step 2. Initialize: network weight/bias parameter Step 3. Evolution: Find optimal solutions 1: for r = 1 to 2 do Step 4. Get optimal solution: L = arg min(‖A(x) − M ‖ 2 2 ) Step 5. Back propagation: Update i parameter can obtain the global context information. Figure 3 denotes how two-round direction-aware IRNN operations gather global context information.
First, the direction-aware IRNN architecture uses a 1 × 1 convolution to perform an input-to-hidden feature conversion. Afterward, it propagates information independently from four directions (left, right, top, and bottom) to aggregate local spatial context features. We combined the recurrent translation results in four directions as the intermediate feature map (see Figure 3(b)). By repeating the above process, the global spatial context is finally obtained (see Figure 3(c)). Compared to Figure 3(c), each pixel in Figure 3(a) knows its local spatial context information, and after the first round of data conversion, each pixel in Figure 3(b) knows its spatial context information in four directions. Therefore, after two rounds of data conversion, each pixel obtains the global spatial context information.
Another aspect is about solving the directional weight map. To further analyse the spatial context information in a directional way, in the SAM model, the input feature map calculated by a three-layer convolution operation is transformed into a direction-aware weight map W . We divided W into four attention weight maps, which are expressed as W le ft , W right , W down , and W up . Four weighted maps multiply spatial context features of four directions in a point-to-point way. The detail of the spatial attentive network is introduced in Algorithm 1.

Contextual autoencoder network
The contextual autoencoder similar to the UNet architecture in Figure 1 is another subnetwork of the generator network. The purpose of the contextual autoencoder is to produce a cloudless remote sensing image. Its input is the cloud image and attention map from the spatial attentive network. The contextual autoencoder network contains convolution modules, and skip connections between encoder and decoder are additional to avoid blurred outputs [32]. Furthermore, the contextual autoencoder architecture can be seen in Table 1. The context autoencoder used dilated convolutional layer, which can sense a larger receptive field around each pixel with the same number of parameters.
To calculate each pixel value in the filled image, the pixel needs to know the contents of the surrounding image. Using dilated convolution can help each point to effectively 'see' a larger input image area, compared to the standard convolution [33]. As shown in Figure 1, our context autoencoder has two loss functions: multi-scale loss and perceptual loss. Upsampling with deconvolution in the contextual autoencoder network will cause halo artefacts in the decloud image. To address this issue, the multi-scale loss is proposed. The multi-scale loss provides a way to restore global structural information of images over varying resolutions. The multi-scale loss L M compares the layer-by-layer features of the decoder with the real label. We obtain more context information from different scales and define the multi-scale loss function as: where S i represents the i-th layer image extracted from the decoder and T i represents a real image having the same scale as S i We used the output images of the last fifth, third, and first layers, which are 0.25, 0.5, and 1 of the original image sizes, respectively. As the information contained in the small-scale feature map is unimportant, we attached more weight on the larger scale feature map. 1 is 0.6, 0.8, and 1.0, respectively. In order to maintain the edge fidelity, we use the perceptual loss L p to measure low-level features' difference between the generated image and the real image. These features can be obtained from VGG-16 network pretrained on the ImageNet dataset. The formula is as follows: where O is the output image of the context autoencoder, and T is a clear image.
In order to generate sharp realistic images, we introduce an adversarial loss L GAN (O) , which is a conventional loss function of the generator network. The loss formula is as follows: where O is the predicted generator image.
The structural similarity index (SSIM) loss is used to measure the structural similarity of the predicted generator image and the clear image: where O is the predicted generator image and T is the clear image.
To extract the edge and detail information of the image, the Sobel operator is a common edge detection algorithm to obtain the detail difference between the predicted generator image and the clear image. The loss formula is as follows: where O is the predicted generator image and T is the clear image.
The loss of the total generator is as follows:

DISCRIMINATIVE NETWORK
In the discriminative network, we designed a novel attentive network to effectively learn cloud features in a local-to-global attentive manner. The global and local content consistency is used as the criterion in the discriminative network. The global assessment is used to detect the content inconsistency of the whole image while the local assessment detects a very small specific area. The discriminative network is used to distinguish the real image from the false image. From the local point of view, it is necessary to check the possible false part; however, there is no prior information for the discriminative network to determine which position may be false. Hence, it is necessary for the discriminative network to learn by itself. To solve this problem, we introduced the attention heat map generated by the spatial attention network into the discriminative network. The architecture of the discriminative network can be seen in Table 2.
We obtained a feature from the middle layers of the discriminative network and fed it into the convolution layer. The discriminator concentrates on the area indicated by the attention map. We defined a loss function according to the convolution layer's output and the attention distribution map generated by the spatial attention network. More importantly, we multiplied the convolution layer's output with the original feature of the discriminator and input it to the next layer. In addition, as a comparison, the real image is unnecessary in generating an attention map. hence, it is expected that the mask of the real image approaches 0.
The loss function of the discriminator can be expressed as:  L DIS is a standard loss function of the generator network, and it is expressed as: In addition, an attention loss has been introduced, and L map is the difference between the attention mask and the feature generated from the middle layer of the discriminator. The formula is as follows: where D MAP represents a two-dimensional attention feature generated by the discriminator. is set to 0.05; R is a clean sensor remote image; 0 represents a map including only 0 values. Therefore, for real images, it is unnecessary to focus on specific regions. Table 3 shows that the model without using attention map in the discriminator is degraded to a full model in terms of PSNR, SSIM, and FSIMc. Using attention map loss in the discriminator network facilitates generating the results more closer to real images. The main reason is that we apply the attention map in the discriminator, which learn cloud features in a local-toglobal manner.

EXPERIMENTAL RESULTS AND EVALUATION
Cloud coverage regions often appear in optical remote sensing images, which limits the accuracy of obtaining data. Cloud removal is an important pre-processing step in remote sensing image analysis. Lin D [34] proposed a remote sensing image dataset (RICE) for cloud removal, which contains thin cloud images in RICE-I dataset. The RICE-I dataset used here was collected from Google earth. We required a set of image pairs (one degraded by cloud, one cloudless), each of which contains exactly the same ground scene. The image pairs were obtained by setting the whether to display the cloud layer. There are a total of 500 images in the RICE-I dataset. To increase the diversity of training dataset, 49 pairs of image pathes are cropped from each pair of cloud images and corresponding truth images. As a result, there are 20,825 pairs of images in our training dataset. In the network training, we set the learning rate to = 0.0002 and cropped the image size to 224 × 224.
For the quantitative evaluation, three popular metrics are employed in comparative experiments. The results on synthetic data are measured by two full reference metrics, that is, the peak signal to noise ratio (PSNR) and the SSIM. The higher PSNR indicates the better similarity at pixel level while the SSIM score of nearly 1.0 reflects the best image reconstruction w.r.t. luminance, contrast, and structure. In addition, FSIMc is an image quality assessment that corresponds to a colour and is designed based on the human visual system. FSIMc score of 1 indicates undistorted image and anything lower than 1 indicates some degree of distortion (FSIMc scores below 0.8 results in heavily distorted images). We compare the proposed model with six state-of-the-art methods, including CAP [35], DCP [4], DehazeNet [36], IDeRs [37], Cloud-Gan [12], and CGANs [38] on validation sets respectively.

Quantitative evaluation
A quantitative comparison of our method with the methods of defogging and cloud removal are shown in Table 4. As shown in the table, our method has higher PSNR, SSIM, and FSIMc values than the other methods, which indicates that our method can produce more realistic results.

Qualitative evaluation
The first experiment we conduct is used to compare the performance on removing non-uniform thin cloud, which is reported in Figure 4. Note that we neglect the CAP and the DCP because they fail completely to produce an image as the nonuniform clouds have occupied most of the visible area. As can be observed, our method is competitive in the visual appearance of restored images because the clouds become less and the ground objects keep better structures after restoration.
In Figures 5 to 13, we report the performance of different methods against more cloudy data. The results indicate that the proposed method more effectively removes thin clouds, improves contrast, restores colour information, and retains detailed texture information. It is also observed that CAP and Dehazenet cannot completely remove the cloud and tend to output images with excessive brightness relative to ground truth. At high-frequency detail information performance, such as textures, edges are always unsatisfactory in CAP and Dehazenet. DCP darkens the colour of the image relative to the real image due to its prior assumptions. Consequently, it loses the details in the depth of the image. Similarly, IDeRs suffers from severe colour distortion. The Cloud-GAN method is suitable for image style transfer, and the background colour of the recovered image is easily distorted. Compared to the CGANs's results. our outputs have less artefacts and have better restored structures.
Through these comparative experiments, we observe that our can perform competitively against a state-of-the-art network trained on a benchmark dataset. The main reasons can be attributed to two factors. On one hand, compared with physical model-based methods, our model is independent of assumption and prior. The data-driven mechanism let the proposed model learn the useful features for thin cloud removal. On the other hand, the embedding attention network emphasizes the distorted image patches, and the noise points out of the attentive regions can be ignored, thus our model has better robustness than other deep learning methods. It can be concluded that our method is more effective in removing thin clouds than the other methods.
Since there is currently no real-world dataset pairs for cloud removal problems, we cannot report any results on them, so we select RICE-I dataset which is collected on Google Earth and widely used in the literatures. But there is some potential threat to the synthetic dataset we use; it is possible that concrete application domains might have different weather influence factors that could degrade the experimental results. We observe that even though we trained our model on RICE-I dataset, our model substantially outperforms on all synthetic scenes by a significant margin.

CONCLUSIONS
We propose a thin cloud-removal method based on deep learning. This method takes advantage of the generative adversarial network. First, the generative network generates an attention

FIGURE 11
The comparison for contrast. Thin cloud removal results of CGANs [38] FIGURE 12 The comparison for contrast. Thin cloud removal results of ours method

FIGURE 13
Thin cloud removal results of ground truth map through the attention network, which identifies the cloud area in a local-to-global spatial attention manner. Furthermore, the map is applied along with the input image to generate a cloud removal image through a contextual autoencoder. Subsequently, the discriminant network performs a global and local comprehensive evaluation of the generated image. The novelty of our study is that for the first time, the attention network has been applied in the cloud removal field, and a good cloud removal result has been achieved.
Currently, There are no existing metrics that can assess the generated cloudless images, without clean images for reference. It would be interesting to develop an unsupervised mechanism for this purpose in the future.