DuGAN: An effective framework for underwater image enhancement

Underwater image enhancement is an important low-level vision task with much attention of community. Clear underwater images are helpful for underwater operations. However, raw underwater images often suffer from different types of distortions caused by the underwater environment. To solve these problems, this paper proposes an end-to-end dual generative adversarial network (DuGAN) for underwater image enhancement. The images processed by existing methods are taken as training samples for reference, and they are segmented into clear parts and unclear parts. Two discriminators are used to complete adversarial training toward different areas of images with different training strategies, respectively. The proposed method is able to output more pleasing images than reference images beneﬁt by this framework. Meanwhile, to ensure the authenticity of the enhanced images, content loss, adversarial loss, and style loss are combined as loss function of our framework. This framework is easy to use, and the subjective and objective experiments show that excellent results are achieved compared to those methods mentioned in the literature.


INTRODUCTION
There is no doubt for human beings, the development of ocean has important value in economics, environmental protection, science and education [1]. And the development of ocean depends on the sensors carried by various submersibles to a great extent. For these sensors, visual sensors play an important role because of its mature technology and energy efficient nature [2]. Clear underwater images and videos provide sufficient information of underwater world, which could help workers, researchers or autonomous underwater vehicles (AUVs) perform their tasks. However, the complex underwater environment causes serious distortions of underwater images, which not only reduces the amount of information carried by underwater images, but also increases the difficulty of underwater image enhancement. It is a challengeable task to enhance underwater images. The level of underwater image enhancement restricts the exploitation of ocean. At the same time, more and more underwater entertainment and educational activities also rely on the application of this technology, such as diving  Figure 1 shows a typical underwater image, and it can be seen that underwater environment causes two kinds of distortion in underwater images. First, because red wavelengths are quickly absorbed by water [5], most underwater images have color degradations. Due to different water quality, underwater images are often bluish or greenish. Second, the low light transmittance and suspending particles of the underwater environment [1] result in low contrast and loss of texture features in underwater images. Even worse, the influences of the above distortions change with the multiple factors such as water quality, camera type and shooting environment. At the same time, the distortions of each object in one image is different due to the difference in distances between the objects and camera [6,7]. These factors increase difficulty of this task.
Due to the problems of color degradations, low contrast and texture details loss in underwater images, it not only brings difficulty to artificial observation, more importantly, it causes great interference to many existing object recognition and semantic 2010 wileyonlinelibrary.com/iet-ipr IET Image Process. 2021;15:2010-2019.

FIGURE 1
Underwater image enhancement example. As is shown, severe distortion occurs in the distant view of (a). And UCM [3] and RGHS [4] are difficult to restore this part of the image segmentation algorithms [2]. Therefore, the level of underwater image enhancement methods largely determine the efficiency of auto underwater operations. In order to research the influence of underwater imaging system on images. A common underwater imaging model derived from the Jaffe-McGlamery model [6,7] can be formulated by: where I (x) is the degraded image we observed, J (x) is the clear image to be restored. Global atmospheric light indicates the intensity of ambient light, t (x) ∈ [0, 1] is the transmission map matrix. Transmission map represents the percentage of the scene radiance reaching the camera, could be defined as where is the atmospheric attenuation coefficient and d (x) is the distance from the object to camera. In this imaging model, due to the difficulty to estimate , and d (x) in different underwater environment, restore J (x) is a challengeable task. Meanwhile, the reality imaging system is often much more complicated than this model. For this problem, some none model based methods have been proposed, like relative global histogram stretching (RGHS) [3], and unsupervised colour correction method (UCM) [4] which combined colour balancing, contrast correction of RGB colour model and contrast correction of HSI colour model to get enhanced images. Other papers like [1,5] proposed methods that enhance underwater image by feature maps fusion. However, most of these methods were proposed by analyzing a small number of images extracted, so it is formidable to cope with different underwater environments.
In the past decade, community has witnessed the excellent performance of deep learning on many vision tasks. However, different with other computer vision tasks, underwater image enhancement lacks paired samples used to supervised training. The research [8] proposed an end-to-end convolutional neural network (CNN) and the author built a simulated underwater environment to collect paired training samples for network training. But it is unable to simulate all the kinds of underwater environments in reality, so the robustness of this method is poor. Considering the difficulty of acquiring ground truth, other works used generative adversarial network (GAN) to train model without paired training samples. Chongyi Li et al. proposed a framework [9] similar to cycle-GAN [10] to transfer underwater image into clear air image. The authors of UWGAN [11] and UGAN [2] distinguished clear underwater images and distorted underwater images for adversarial training. Jamadandi and Mudenagudi [12] used the method of style transfer. In general, due to the difficulty of unsupervised training, deep learning does not show advantages over traditional methods in this field.
According to Equations (1) and (2), the distortions of objects in an image are related to the distances between the camera and the objects. As shown in Figure 1, most of the above methods often restore effective close views with slight distortion, but it is difficult to restore distant views with serious distortion.
Other ideas motivated by single image dehazing [13][14][15][16] and Equations (1) and (2) is to restore underwater image by estimating transmission map. However, since the underwater environment is more complex compared with air environment, estimating the transmission map of underwater images is a very difficult task. Some works like [17,18] restore underwater images using dark-channel-prior (DCP), but their effects are not ideal. Chongyi Li et al. proposed a method based on blue-green channels dehazing and red channel correction for underwater image restoration [19]. Yan-Tsung Peng and Pamela C. Cosman also proposed a depth estimation method [20] based on image blurriness and light absorption. Similarly Dana Berman et al. proposed a method [21] based on estimating the attenuation ratios of the blue-red and blue-green colour channels. But the performance of these methods will fluctuate with the colour of the water. And there are also some works [22][23][24] based on deep learning to estimate transmission map. However, most of these works are trained by images synthesized with transmission maps. These synthesized images are significantly different from real underwater images, so the performance of these works is very limited.
To overcome these problems, different from these works, we propose an user-guided end-to-end dual generative adversarial network (DuGAN) to complete this task. Proposed network bypasses estimating transmission map and gets pleasing results.

PROPOSED METHOD
Due to the instability of unsupervised training, we need to provide paired training samples for high quality enhancement. As shown in Figure 1, the existing methods [1,[3][4][5] are already able to restore the slightly distortion part of the images with FIGURE 2 Structure of the framework we proposed, green represents generator, blue represents enhancement discriminator, red represents restoration discriminator, the solid line represents forward propagation of network and dashed line represents back propagation of the network, "Real" represents all-one matrix of the same size as the input image, and "Fake" represents all-zero matrix high-quality, but the restoration effect of the serious distortion part of the images is not ideal. Equations (1) and (2) show that the distortions of distant views is greater than that of close views, so most of theareas that cannot be restored by existing methods are distant views of images. To solve this problem, we design a dual generative adversarial network. This framework includes three components: generator, restoration discriminator and enhancement discriminator. The images processed by the existing methods [1,[3][4][5] are used as training samples, but different from ground-truth, the training set we used serves as reference images, so we named it references R, and the corresponding label L segments each of them into clear parts and unclear parts, L is generated by volunteers through image segmentation tools. The difference between the clear part and the unclear part depend on perceiving of volunteers about underwater images. Although this segmentation method can not be completely accurate. However, with training by a large amount of images, priori knowledge of human can be used to train the network to learn the characteristics of underwater scenery, so that the effect of network outputs is able to exceed training samples. The restoration discriminator is used to train the generator in the area where the reference images R is clear. For the areas that are not clear, training is performed by using the enhancement discriminator to learn the features from the clear areas. Figure 2 shows the architecture of our framework. In Figure 2, we use different colours to represent generator, restoration discriminator and enhancement discriminator respectively, solid lines and dashed lines to represent forward propagation and back propagation of network. It should be noted that in training, we need to get G (I i ) and results of discriminators to carry out feedback training, so we will execute the whole process, while in testing, since we only need to get G (I i ), so we only complete the part represented by the green solid line. The work of each component and the loss function we adopted will be shown respectively below.

Architecture of DuGAN
This section would instruct the architectures of all the components in DuGAN and the reasons for choosing these structures. The function of generator G is to enhance the input degraded images and output enhanced images. For this task, we design a 19 layers fully convolutional neural network containing convolutional layers, stride convolutional layers and deconvolutional layers. We use leaky ReLU as the activation function of generator. In order to balance the semantic features and texture features, we adopt a structure similar to U-Net [25]. The generator network contains an encoder and decoder process by stride convolutional layers and deconvolutional layers. Motivated by residual network [26], we adopted residual block (RB) to extract features of underwater image. Meanwhile, in order to make the network focus on more informative features, and help each layer of generator take advantage of global information, we adopt channel attention (CA) module [27], designed by a global average pooling (GAP) layer, a convolutional layer, two fully connection (FC) layers and tanh activation function (see Figure 3). Figure 4 shows the architecture of generator.

Restoration discriminator
In the clear parts of references, we should drive the outputs of the network G to be as identical as possible to the reference samples. If we only ask the CNN to minimize Euclidean distance between predicted and ground truth pixels, it will tend to produce blurry results [28]. Inspired by pix2pix [28], we design restoration discriminator D r to train generator together with Euclidean distance. As is shown in Figure 4, restoration discriminator is a convolutional PatchGAN classifier [28], raw image I i together with the corresponding sample image input to D r and we drive D r to output true. And I i together with the output of G is opposite. Restoration discriminator is a 8 layers convolutional neural network and we use leaky ReLU as activation function. Figure 4 shows the architecture of D r .

Enhancement discriminator
The function of enhancement discriminator D e is to enhance the unclear parts of the training samples. To perform this task, we design enhancement discriminator as a pixel-level segmentation network. Enhancement discriminator is trained to segment input image E i into clear parts and unclear parts by label. The All the network architecture of the proposed method. "5 × 5 × 64 conv" represents convolutional operations consisting of 64 convolutional kernels with size of 5 × 5 and "5 × 5 × 64 deconv" represents deconvolutional operations consisting of 64 deconvolutional kernels with size of 5 × 5. "stride conv" represents stride convolutional operation with stride of two input E i of enhancement discriminator is jointed by G (I i ) and references R i : where G (I i ) is output of generator G , I i is the i-th raw image, R i represents the i-th reference image of I i , L i is the i-th label, the symbol "⊙" stand for Hadamard product. Enhancement discriminator is a 10 layers convolutional neural network and we use leaky ReLU as activation function. As is shown in Figure 4, we adopt skip connection to help gradient to quickly backpropagate.

Loss function
In order to generate high quality images, three loss functions as follows are designed for our framework, style loss, content loss and adversarial loss.

Content loss
Content loss function  content is Euclidean distance between outputs of G and reference images. We use the weighed method and assign a higher weight in the clear parts of the reference samples, and a lower weight in the unclear parts. It drive generator to enhance the clear part of the reference samples more inclined. The formula for the content loss function is: W (L i ) is weight function as follows: where 1 , 2 are two coefficients to tuning weights, we suggest set 20 and 0.01, respectively.

Adversarial loss
The adversarial loss is got by D e and D r , named enhancement adversarial loss  e and restoration adversarial loss  r respectively. These two loss functions are respectively applied to the clear and unclear areas of reference samples by the mean of Hadamard product. The restoration adversarial loss  r can be expressed as follows: In Equation (6), we use the approach of getting Hadamard product, so that only the error of the parts marked as clear in L i can be fed back into D r and G . In this way, we strengthen network to learn from the clear parts of reference samples. And the enhancement adversarial loss  e can be expressed as follows: similarly, in Equation (7), we also use the approach of getting Hadamard product to only feed back the error of the parts marked as unclear in L i . D e is trained to make pixel-wise segmentation for E i . In this way, we drive G to "deceive" D e as much as possible, and the unclear parts of reference samples can be trained without paired samples.

Style loss
Style loss  style was first applied to neural style transfer [29]. The function of style loss is to ensure the style consistency of the clear areas and the unclear areas in a reference sample. We believe that the enhancement discriminator has extracted the features of the image while pixel-level segmenting, so we use features extracted by enhancement discriminator as the loss of style consistency, avoiding using pre-trained VGG-net [30] as the paper [29] to reduce computation. Style loss can be formulated as follows: where  8 is the feature maps output by the 8th layer of enhancement discriminator, f (x) is a function to set  style equal to 0 when L i is an all-zero matrix or an all-one matrix, sum(x) represents the sum of the elements in matrix x. Different from the paper [29], we do not use the method of getting Gram matrix of the feature maps to calculate style loss, we make difference between the average values of two parts in feature maps to train the network G . It drives the network to generate unclear parts with similar style to the clear parts in reference samples.

Total loss
Finally, we combine the content loss, adversarial loss and style loss to regularize the proposed generative network, which is defined as:

Datasets
Due to the limitations of synthetic underwater image datasets (e.g., inaccurate formation models, hard assumptions, insufficient images, specific scenes, etc.) [31], we used underwater image enhancement benchmark dataset (UIEBD) [32] a real underwater image dataset as our data sets. From the 890 images that have been exposed in UIEBD, we removed 70 samples of poor quality and low reference value through volunteers. We randomly extracted 120 images from these images as test set A, 36 images as validation set, and the remaining 734 images as training set. For our training set, we use the existing four methods [1,[3][4][5] to process the training samples to get references. Therefore, each picture in our training set has four enhanced pictures as candidates to be selected, and the best one is selected by volunteers as a reference sample. We only consider the enhancement quality of the close views in reference samples, which provide us with ground-truth. So we do not use the results of other methods based on transmission maps as reference samples. We segmented reference images by volunteers to distinguish between clear and unclear parts. Examples of our training set are shown in Figure 5. Finally, to detect the robustness of the proposed method, we randomly extracted 34 images from RUIE data set [33] as test set B. It is noted that the style of RUIE data set is far away from our training set, the most samples of test set B are more greenish.

Training details
We use full-size pictures as input patches of our framework. We convert the training samples into 2046 patches of different sizes Comparison with several methods mentioned [1,2,5,21]. The images from the first row to the fourth row are instances from test set A, and the bottom two rows are instances from test set B not less than 320 × 376 by resizing and segmentation. And to facilitate deconvolution operation, the size of these patches is a multiple of eight. In order to drive D e and D r to train the generator better, we give these two discriminators a higher learning rate than generator. The learning rate of these two discriminators is 0.0008, and the learning rate of generator is 0.0005. We use ADAM optimizer by setting the momentum to 0.5. Tensorflow [34] is used in our experiment to establish and train the proposed DuGAN framework. The experimental environment is the Windows 10 operation system running on a server with Intel Core i7-7820X CPU 3.60GHz and Nvidia GeForce GTX 1080Ti.

Objective quality assessment
Underwater image quality measure (UIQM) [35] and underwater colour image quality evaluation (UCIQE) [36] are evaluation metrics of reference-free image quality assessment. Therefore, they are more suitable as evaluation metrics for underwater image enhancement than other algorithms. In the past few years, these evaluation metrics have been widely used to evaluate the effect of various underwater image enhancement methods. But these two metrics might be biased to some characteristics and did not take the colour shift and artifacts into account [32]. Considering these limitations in these two metrics, it is difficult to only rely on these reference-free image quality assessment scores to be consistent with the feelings of people. We also show some results of our method in Figures 6  and 7 and compared with other results. We make comparison with other existing methods, like none model based methods [1,3,5], transmission map based methods [21], and deep models [2,9] trained in our data set. It is noted that the parameters of our framework (28243) are far less than these two deep models (53520, 88112). Figure 6 shows some of results processed by our method and compare them with results processed by other methods. Figure 7 shows the results of our method with more details.
As is shown in Figure 6, our method has an excellent ability to restore the colour degradations of objects in underwater images and generate satisfactory images. And we also enhance test set B with excellent performance, this shows that our model has strong robustness. Meanwhile, as we can see from the Figure 7, our method can also enhance the distant view with serious distortions of underwater images. This has been easily overlooked in past researches. In fact, this is very important for this task. Because the ability to enhance the distant view of underwater images determines the viewing distance  of submersibles especially autonomous underwater vehicles (AUVs). The metrics UIQM [35], UCIQE [36] and entropy scores we mentioned do not fully reflect the quality of enhanced images. In Table 1, it is visually shown that the scores based solely on these metrics are difficult to fully reflect the feelings of human beings. Although the scores of HL [21] and UGAN [2] are high because of high chroma and contrast, but they are not very effective methods. Compared to results of other methods, it can be seen from Figure 6. The images enhanced by our method are closer to the state without water. Therefore, when evaluating the performance of various methods, three metrics should be considered comprehensively. Also, judgments should be made in conjunction with their performance in real images.
For our test sets, Table 2 shows the comparison with other methods by UIQM [35] and UCIQE [36], our method scores high, which shows that the results of our method are better than other methods. Due to the limitations of these metrics, we should combine Figures 6 and 7 to jointly evaluate the effect of these methods.

Subjective quality assessment
Considering the limitations of objective metrics, we also carried out subjective quality assessment. Raw image and results of different methods were randomly displayed on the screen at the same time. Meanwhile, we invited 20 volunteers who had experience with image processing to grade these images. Scores are from 0 to 3 represent the worst, general, fine and the best respectively. We expect the good result has high contrast and visibility, abundant details, especially the colour as if the image was taken without water [9]. To avoid the impact of different score styles on the results, we use the following formula to calculate final score for each method: where P i, j ,k represent the score of the i-th image, the j -th method and the k-th volunteer, N is the number of images in test set, S j is the final score of the j − th method. Table 3 shows the results of subjective quality assessment, It is obvious that our method achieves excellent performance. And to further demonstrate the superiority of our method, we applied the method we proposed to underwater documentaries and has achieved good result , which will be released soon.

Application experiments
Whether the above objective experiments or subjective experiments, they are all judged from human perception of images directly or indirectly. But in actual applications, we also need to examine the feeling of machine. The first, we employ the SIFT [37] operator for an initial pair of underwater images and as well for the restored versions of the images. We use SIFT applied exactly in the same way in both cases. We can see Figure 8  that the images enhanced by our method have more matching points distributed uniformly in the image, while the raw image has fewer matching points and uneven distribution. Finally, we apply YOLOv3 [38] to our experiments. YOLOv3 has been verified by community in the field of target detection, and it is a very effective and widely used target detection neural network. In our test A, we selected a total of 41 images suitable for target detection task and use YOLOv3 to detect them (see Figure 9). The accuracy of target detection and the correct number of samples are shown in the Table 4. Because YOLOv3 uses pre-trained weights from other open air image datasets, the more correct targets detected from the enhanced images, the more effective the enhancement method is. It can be seen from the Table 4 that our method has achieved the best results. But unfortunately, we find that many good enhancement methods do not exceed the baseline of raw images in the number of correct target detections. This is because these methods   are not robust enough, and will fail when processing some kinds of images, which will affect detection (see Figure 9).

Ablation experiment
To demonstrate the role of each module in our framework, we also performed ablation experiment. We removed three loss functions  r ,  e ,  style and CA module respectively, and named these four simplified models as DuGANr, DuGANe, DuGANs and DuGANc, and make comparison in test set A, see Table 5. It is noted that when the corresponding loss function of discriminator is removed, the discriminator would be invalid, equivalent to removing the corresponding discriminator. In Figure 10, it is obvious to see that the function of each module.  r plays a major role in training, and  e ,  style help to restore detail textures and colour degradations in images. For CA module, it is obvious that if there is no CA module, it is difficult for our model to perceive global information, and serious artifacts will appear in outputs, although the score of DuGANc is higher than others.

CONCLUSION
In this paper, we analyzed the shortcomings of existing methods, and proposed a user-guided end-to-end dual generative adversarial network (DuGAN) for underwater image enhancement. We take the images processed by existing methods as training samples for reference and use two discriminators to complete adversarial training toward different areas of image respectively. By comparison with other methods, our proposed framework has achieved excellent results. At the same time, we also designed a variety of experiments to prove this point. But we got reference images rely on user-guided approach, which made it difficult to train with new images. It will be the focus of our future research to enable the network to generate suitable masks spontaneously.