Generative adversarial network for low-light image enhancement

Low-light image enhancement is rapidly gaining research attention due to the increasing demands of extreme visual tasks in various applications. Although numerous methods exist to enhance image qualities in low light, it is still undetermined how to trade-off between the human observation and computer vision processing. In this work, an effective generative adversarial network structure is proposed comprising both the densely residual block (DRB) and the enhancing block (EB) for low-light image enhancement. Speciﬁcally, the proposed end-to-end image enhancement method, consisting of a generator and a discriminator, is trained using the hyper loss function. The DRB adopts the residual and dense skip connections to connect and enhance the features extracted from different depths in the network while the EB receives unique multi-scale features to ensure feature diversity. Additionally, increasing the feature sizes allows the discriminator to further distinguish between fake and real images from the patch levels. The merits of the loss function are also studied to recover both contextual and local details. Extensive experimental results show that our method is capable of dealing with extremely low-light scenes and the realistic feature generator outperforms several state-of-the-art methods in a number of qualitative and quantitative evaluation tests.


INTRODUCTION
Generally, images captured in the low-light environment suffer from various visual quality degradations, including poor visibility [1], low contrast [2], unexpected noise [3] etc. These interference factors degrade the quality of obtained pictures and result in failures in most subsequent computer vision tasks, be it lowlevel or high-level, such as person re-identification [4] in night video surveillance. On the other hand, low-light images analysis is key to the understanding of scenes under some extreme vision conditions, such as automatic machines [5], monitors and pieces of automatic equipment [6]. Therefore, low-light image enhancement is gradually becoming one of the useful and urgent research problems to be resolved. In short, it aims at restoring the image captured under low-light condition to achieve perceptive details similar to those from a natural light images, with higher contrast, less noise contaminations and superior visibility. Generally speaking, enhancement algorithms consist of a denoising and a brightness adjustment step. The enhancement should allow pertinent visual interpretations of these images and it is thus key to most computer-vision based intelligent systems [7][8][9], for example, automated driving and video surveillance.
However, it is non-trivial to enhance low-light images, since noises are easily amplified but hard to be removed in this illposed inverse problem. For this task, massive restoration algorithms are proposed in the past decade, including [10][11][12][13]. These works mostly attempt to use handcraft features or priors to exploit the hidden information in low-light patches. For example, Cheng et al. [14] was the first to propose the histogram equalisation(HE) approach for image enhancement. The main idea is to stretch the dynamic range in the original low-light image to that of a natural light image. However, it often introduced undesirable illumination distortions as well as increased noises levels.
Recently, numerous enhancement methods [13,15,16] based on convolutional neural networks(CNNs) have been proposed to improve enhancement performances. There has been growing research interests in end-to-end deep neural network architectures to model the mapping transfer between low-light image input and desired image as output. Specifically, these learning methods extract the abstract features and learn the non-linear mapping functions between input and output using a considerable volume of training data. The state-of-the-art RetinexNet [17] and the See-in-Dark [15] are the typical examples of these methods. The RetinexNet was a model consisting of a Decompose and a Relight network layer. The former decomposed the input low-light images into illumination and reflectance map while the latter adjusted the entire light distribution. While Chen et al. [15] focused on the enhancement of the RAW images that contain more details collected from the cameras, their methods are less efficient with compressed image datasets. Furthermore, the above-mentioned images enhancement methods suffer performance losses in several extreme scenarios and shortcomings that yield insufficient enhancement qualities, for example, noises and unbalanced light distributions. Meanwhile, disconnection with high-level applications could also hamper the performances of enhancement. Finally, these studies did not give much attention to the uncertain relationships between spatial features of various size.
To address these problems in an attempt to further the enhancement results, we introduce in the present study a densely residual Generative Adversarial Network (DRGAN) to focusing on the feature extraction and the functional practice, that is, enhancement used in application of high-level vision applications, such as face detection in dark [18].
In particular, we propose a novel feature extraction module for the low-light enhancement by exploiting relationships between the extracted features of various sizes. Specifically, we feed the GAN network with image pairs of synthetic low-light images and their ground truth(GT) counterparts. Additionally, inspired from feature pyramid network and the multi-scaled feature fusion strategy [19], we design the enhancing blocks to extract features of different sizes to be concatenated for intermediate processing in order to improve the model's feature representation capacity. To further improve the feature representation, we modify the standard discriminator by increasing the last feature size and allowing the generator to distinguish between synthetic images and the perspectives of the patch level. Experimental results show that our enhancements are more accurate and realistic due to the proposed module as compared to the outcomes of reference algorithms.
The main contributions of this paper include: (1) A novel low-light enhancement network comprising the DRB and EB module and achieving the start-of-the-art performances on several widely used low-light datasets. (2) A novel loss function designed for image detail preservation. (3) An extensive experimental validation to demonstrate the improvements in both pixel mapping and high-level visual tasks.
The remainder of this paper is organised as follows: In Section 2, we give a brief overview of the background knowledge and related topics. We describe the proposed model in Section 3 and provide the experimental details and result analysis with per-formance comparisons with previous works in Section 4. And finally, we conclude this work in Section 5.

RELATED WORKS
Compared with the natural-light images containing higher contrast and more detailed information, low-light images have low illumination, often resulting in poorer performance in high-level vision tasks. Normally, low-light environment means limited light sources with weak lighting. Only target objects close to the light sources are visible while considerable illumination variations occur in one image. In this section, we briefly review and analyse the following three items: the traditional methods, the Retinex theory and the learning-based methods.
It is generally acknowledged that low-light image enhancement has gradually become a popular research topic in computer vision while a number of methods have been proposed recently. One typical characteristic of low-light image is its lower dynamic range and thus the most common solution consists of raising the contrast by stretching the range. In particular, a series of approaches, such as histogram equalisation(HE) [14,20], aims at recovering the visibility of dark regions by contrast enhancement. Other well-known contrast enhancement methods based primarily on improving image contrast proposed in the past decades, for example, contrast-limiting adaptive histogram equalisation(CLAHE) [21] and brightness preserving bi-histogram equalisation(BBHE) [22]. However, these global enhancement approaches do not target particular regions for enhancement. For example, dark regions should be treated with priority compared to those with sufficient object details.
Unlike the above-mentioned contrast enhancement methods, the Retinex-based method [17] performs the joint illumination adjustment and noise removal by decomposing the captured image into different reflectance regions and their corresponding illumination components. This study generates high-quality output by processing reflectance and illumination. Other variations include the single-scale Retinex (SSR) [23], the multi-scale Retinex (MSR) [24] and the robust Retinex [25,26], all having the potential to adjust the illumination and remove the noises. However, these methods may also yield over-enhancement or under-enhancement due to simple and single constraints, resulting in unnatural outputs with intensive noise artifacts.
With the rapid emergence of the computing device and neural network theory, learning-based methods have proven their excellent learning ability in image reconstruction and enhancement. This is primarily due to the more sophisticated loss function than the Euclidean distance, prone to produce blurring results. The LLNet [13] was the first to have introduced one auto-encoder for the low-light image enhancement. Inspired by the Retinex theory, the MSR-net [27] was then proposed to learn an end-to-end mapping between dark and bright images. Motivated by image components' decomposition and illumination, the RetinexNet [17] proposed two networks for decomposition and relight and learn the key constraints between decomposition and illumination maps. To further remove the noises, the RetinexNet added the joint denoising module. More recently, Chen et al. [15] introduced a universal pipeline for low-light image processing based on the end-to-end training of a fully convolutional network. Despite its effectiveness with the RAW sensor data, this pipeline could not be applied to more generic and publicly available dataset.

PROPOSED METHODS
In this section, we firstly discuss the formulation of the low-light image, then the overall architecture of the proposed densely residual generative adversarial network(DRGAN). Finally, we detail the loss function design to resolve the limitation of simple constraint problem in the training process.

Low-light image formulation
In order to understand the details and resolve the low-light image enhancing problem, Guo et al. [28] introduced the initial definition of low-light image as follows: where L(x) and R(x) denotes the degraded and the original image, respectively. And T (x) is the illumination map to encode the light intensity condition with the • a pixel-wise multiplication operator. Hence, we formulated the problem of lowlight image enhancement to be the estimation of the non-linear degradation function between the normal-and low-light images.
The main goal here is to accurately simulate the mapping function to recover the original image R(x).

Overview of network architecture
Classical computer vision algorithms, such as image denoising, de-blurring and super-resolution, are all inclined to use conventional CNNs architecture module to achieve image enhancement and reconstruction. However, these existing methods usually only consider the pixel mapping between low-light image and corresponding ground truth(GT) while ignoring the similarities in the feature level. Based on these researches, we attempt to convent these methods into the GANs model. The primary purpose is to generate high-quality and robust features to reconstruct the degraded images. It is proven that deep learning networks containing such modules have excellent performances for image reconstruction tasks. With insufficient training datasets at hand, we adopt the GANs module to increase the volume and diversity of the training images. The whole architecture is similar to the standard GANs network, one generator and its corresponding discriminator. Nevertheless, we adapt these modifications to improve the performance of this architecture and employ the whole resnet-based architecture for the generator, whose details are shown in Figure 1.

Densely residual block (DRB)
In light of the huge successes of almost all CNN-based algorithms [19], we adopt the densely connected scheme and residual strategy to design a novel feature generator, to combine the advantages of both standard CNNs and GANs. Specifically, the generator has a modular architecture composed of three DRB, and each block consists of five convolutional layers with densely skip connections, as shown in Figure 1(a). Each convolutional layer has a 3×3 kernel.

Enhancement block (EB)
To further improve the diversity of extracted features, we introduce the enhancement block (EB), illustrated in Figure 1(c), to extract intermediate features with different scales. Specifically, the EB can extract multi-scale features from low-level edge features to high-level semantic features. And the initial motivation of this strategy is to establish a connection between the local patch and the global contents. We expect to improve the feature representation capacity with the effective fusion of multi-scale features. The block receives five features processed by an average pooling layer with pooling sizes of 1/2, 1/4, 1/8 and 1/16, respectively. Then, we concatenate these features as input of the convolutional layer with a 3 × 3 kernel. Afterwards, we alter the filter size and padding to align the input and output matrices to avoid the overlapping and grid from the de-convolution and up-sampling operations.

Discriminator
Inspired by [29], we propose to remove the batch normalisation to improve computing efficiency, as shown in Figure 1(b). Indeed, WGAN-GP [30] alters the norm of the gradient of the discriminator with respect to each input, invalidating the batch normalisation. Therefore, the proposed discriminator follows the basic structure of PatchGAN [31] without batch normalisation. Furthermore, we introduce one binary scale value(either real or fake), and the discriminator produces a corresponding 32×32 feature matrix to represent the result from the perspective of high level. Consequently, the discriminator could differentiate images at the feature patch level.

Loss function
Recovering high-quality images with high contrast and chromatic richness from low-light images is a highly ill-posed problem, in which the design of appropriate loss function is often essential. The better loss function is supposed to constrain the training process to ensure optimal network training. In the following, we will present each component's effect in the joint loss function and illustrate the contributions in producing sharper edges and more detailed textures. In the optimisation process, the proposed joint loss  RDGAN consists of the GAN loss, the perceptual loss L per and the contextual loss L CX as follows:

Gan loss
Recently, the relativistic discriminator structure has been widely adopted in several researches [32]. This function estimates the probability that real data is more realistic than fake data, and also directs the generator to synthesise a fake image that is more realistic than the real images. The definition writes: and where C indicates the discriminator, x r and x f are samples selected from the real ℙ fake and fake distribution ℙ fake . And represents activation function.
For the discriminator, we employ the relativistic discriminator and take the least square GAN (LSGAN) [33] as the activation function.
Thus, the  Gan is the sum of  G (generator loss) and  D (discriminator loss) as defined by: and where D indicates the discriminator, x r and x f are samples selected from the real ℙ fake and fake distribution ℙ fake , respectively.

Perceptual loss
To obtain realistic images and preserve the semantic details properly, we introduce the perceptual loss [34] based on the pretrained VGG features to constrain the brighter region with rich structured features, which is defined as where H (⋅) represents the feature extractor and H (⋅) i, j indicates pixel in the ith column and j th row of the network feature, each of which is of size H × W . In this study, we adopt the VGG-19 network pre-trained on the ImageNet [35] as a feature extractor. The perceptual loss function  per is designed to measure the differences between images in the feature space instead of the pixel space and guide the training process on the semantic level.

Contextual loss
Contextual loss [36,37] has been recently studied to improve the visual quality of generated images in the GANs network. For image style transfer and super-resolution, the main purpose is to establish the similarity between the input and the desired target. The available strategies include both pixel and global content loss, such as mean square error (MSE) and perceptual loss. However, the pixel loss function constrains the model per paired pixel between the input and the ground truth, probably resulting in over-smoothing, while the content loss is unregulated in the local patch and cannot preserve the details in the generated image. Therefore, the contextual loss focuses on the similarity between the features regardless of the spatial positions. We aim at targeting the darker regions with low illumination, with responses across multiple channels by spatial weighting feature maps. Specifically, minimising the differences between the weighted low-level feature maps should improve the perceptual quality of brightness region in the enhanced outputs. Therefore, the loss is defined as by: where x denotes the input image and y the target image, and CX the similarity measures the features maps Φ l (x) and Φ l (y) from the l th layer in the perceptual network VGG19 Φ(⋅). Note that the similarity is measured by the sum of regions with the same objects invariant of the corresponding spatial locations. Overall, a pair of images is considered similar when most features of one image can also be found in the other. Hence, the contextual similarity function CX could be defined as follows: Then, we detail the similarity definitions between features. The loss function relies on the cosine distance, noted as d i j , between the feature x i and y j : (10) When d i j << d ik , we assume that features x i and y j have similar contexts. To simplify the calculation, the cosine distance is normalised as follows:d with = 1e − 5. Using an exponential operation, we transformed the distance into similarity: where we set h = 0.5. Hence, the normalised similarity to define the contextual similarity between features is as follows: The main objective of this loss function is to guide the model to generate images with natural image feature distribution. Hence, the function measures the differences in each spatial location feature map per channel.

EXPERIMENTAL VALIDATION
In this section, we discuss the dataset for synthetic low-light image and the detailed setups of the proposed method. Then, we present the performances of our DRGAN in comparison with the reference state-of-the-art methods for several image quality evaluation metrics. Finally, we conclude the ablation of losses in this model and compare the performances in face detection, a high-level visual task.

Synthetic low-light image
Our method is conducted by 30K paired images, that is, lowlight and bright, synthesised by VOC2007 dataset. Each lowlight image is randomly generated from the original image and the non-linear degradation function by the following simulation method: where F(⋅) represents the gamma adjustment function and G (⋅) the noise component with the given standard deviation . Random gamma darkening with controlled noise levels allows to generate a huge variety of synthetic training images to validate the robustness of the whole model. Specifically, we adopt the additive Gaussian noise in the synthetic images to model the noises in the camera shooting process. However, synthetic images cannot completely replace the role of real-life low-light image data. To fully evaluate the performances of the proposed method, we also include images from various scenes from the LOL [17] and the Exdark [2] datasets in compression experiments. The LOL dataset is used for objective and subjective evaluations since it includes highly degraded images for which most methods cannot achieve promising results. And the ExDark dataset consists of 7363 low-light images with annotation of 12 object classes. Due to the relative small volumes of the datasets, such as NPE [16], MEF   [42], we need to make sure the robustness and scalability of methods.

Implementation details
For hyper parameters p , c in the loss function, we empirically use 0.5 and 0.5 to weight the component adopted in whole function. All convolutional layer kernels are set to 3 × 3 in size except in the EB, where the 1 × 1, 3 × 3 and 5 × 5 kernels are used to extract multiple feature, following concentration to rebuild the original dimension by one 1 × 1 convolutional layer. Specifically, we trained all models for 200 epochs with a batchsize 16, and the loss was minimised using the Adam [43] optimiser with a learning rate of 10 −4 . And we adopt the Tensor-Flow [44] libraries to implement the proposed network with two NVIDIA GeForce GTX 1080TI GPUs for computing acceleration.

Referenced metrics
Two standard metrics are adopted to investigate the performances of the enhancement, namely the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM). The PSNR approximates the reconstruction quality of a generated image x compared to the corresponding GT y based on the Mean Squared Error(MSE) as follows: Here, max(I ) is the maximum possible pixel value of the image I . On the other hand, the SSIM measures the image patches based on three properties: luminance, contrast, and structure. The metric is formulated as follows: SSIM (x, y) = (2 x y + c 1 )(2 xy + c 2 ) where x and y denotes the mean, 2 x and 2 y the variance of x and y, respectively. And the xy denotes the cross-correlation between x and y. We fix c 1 = (255 × 0.01) 2 and c 2 = (255 × 0.03) 2 to ensure numeric stability.

Non-referenced metric
To address the limitation of ground truth image in the ExDark dataset, non-referenced evaluation methods are also needed. We adopt the Natural Image Quality Evaluator(NIQE) to examine the performance differences among the compared methods. Tables 1 and 2 report the numerical results among the competitors on the LOL and the ExDark dataset individually. For the test image in LOL dataset, we investigate both of referenced metric and non-referenced metric, while in the ExDark only the non-referenced metric is compared. Firstly, Table 1 compares the numerical results among the competitors on LOL dataset. In this dataset, each low-light image has its corresponding normallight image, and we take both referenced and non-referenced metrics. Obviously, we conclude that the proposed model significantly outperforms all the other reference methods in all the metrics. It can be noticed that these traditional methods, including LIME, LECARM, and BIMEF, generate huge random noises in several scenes, while the CNN-based or GANbased methods, including our DRGAN, RetinexNet, Enlight-enGAN and Zero-DCE, could effectively overcome this issue. in these scenes.
For the Exdark with a larger number (7363) of low-light images and high diversity of exclusive light conditions, Table 2 illustrates the non-referenced metric NIQE scores. The proposed network demonstrates clear advantages over the others while showing some slight weaknesses in several classes, that is, Bicycle, Car, Cat, Motorbike and Table. Furthermore, the Zero-DCE, EnlightenGAN and BIMEF are comparable in the total average score and have the best performances in certain categories. Overall, our DRGAN performs significantly better compared with competitors. Figures 2 and 3 illustrate the visual comparisons on some selected images from the LOL and the Exdark datasets. We can notice that most of the methods brighten the low-quality images. However, severe distortions exist due to inappropriate light adjustment, obstinate noise artifacts and colour alterations. For instance, the results from the RetinexNet induce significant noises while the EnlightenGAN and Zero-DCE could not enhance effectively in several extreme dark regions with low noise levels. By contrast, the proposed method outperforms in these cases and recovers the darker regions more successfully. The edge preservation and noise rejection results both corroborate the superiority of our method.

4.4
Ablation study Figure 4 presents the ablation study results to show the effects of each component, L per and L CX , as part of the loss function. We can clearly observe that the results without L per has relatively lower contrasts and model removing the L CX fails to recover the colour variations, and the contextual details. The results in Figure 2 regulated by all the loss components contain clearer details and higher contrasts, especially in the zoomed regions. By introducing the joint loss function, the network keeps focus on the local patches in order to recover the details, such as edges and smaller objects. Hence, we could conclude that both loss components have played a significant role in the proposed model. In addition, Table 1 presents the loss component ablation results from the image metrics point of view.

Analysis: Face detection in the dark
To further analyse the effect brought by low-light enhancement methods, we also investigate the face detection task as an extra experimental task. Firstly, we take the Dark Face dataset [18], with over 10,000 images in low-light conditions, as a testing dataset to measure the performances. Secondly, the Dual Shot Face Detector(DFSD) [45], trained on the Wider Face dataset [46], is used as the baseline model. Finally, to guarantee fair comparison, we select 1000 images from the train set in the Dark Face and feed the enhanced results by the above methods to the baseline. Furthermore, we examine the performances by the average precision (AP), shown in the precisionrecall (P-R) curves in Figure 5. We also add the AP curve from the standard toolkit provided in the Dark Face dataset [18]. Overall, the precision of DSFD increases considerably compared to that using only the original low-light images, which means the enhancing methods play critical roles in improving the precision in the high-level task of face detection. The Dce-

FIGURE 5
The performance of face detection in the DARK FACE [18], contains the P-R curves and AP zero and the RetinexNet perform the best with the AP metric but neither could achieve high scores in pixel-measured metrics, as computed in Table 3. The major reason is that these enhancing methods introduce noise artifacts during the enhancement and significantly reduce the performances in image quality evaluation metrics.
By contrast, the EnlightenGAN and proposed method achieve the best performances in these quality metrics but are insufficient in the task of face detection. This can be explained by the fact that both methods are based on the GAN that might introduce additive features to distort the original ones and thus interfere with the detectors.
As a general rule, higher performances in pixel-wise metrics cannot guarantee better results in high-level visual tasks.

CONCLUSIONS AND FUTURE WORKS
In this work, we proposed a deep network for low-light image enhancement with the objective of information retrieval instead of physical restoration. We make several adaptations in the loss function design and basic architecture to establish a robust connection between the local patches and global contents. Experimental results demonstrate the superiority of the proposed enhancement method and show competitive performances over existing light enhancement methods, both qualitatively and quantitatively. In future work, we intend to exploit the more effective low-light enhancement frameworks via unsupervised learning, to reduce the dependency for paired training data. Besides, limiting the interferences brought by generated feature is an interesting topic, to improve the performance measured by both pixel-wise metrics and high-level visual tasks.