Face illumination processing via dense feature maps and multiple receptive ﬁelds

Recently, illumination processing of facial image based on generative adversarial networks has made favourable progress. However, the image quality is not so satisfactory and the recognition accuracy is low when the face image under extreme illumination conditions. For these reasons, an elaborately-designed architecture based on convolutional neu- ral network and generative adversarial networks for processing face illumination is presented. A novel dense feature maps loss that computes loss by using the varisized feature maps extracted from different convolutional layers of pre-trained feature network is put forward. More- over, multiple-receptive-ﬁelds-based generator that uses multiple encoders during encoding phase is also proposed, and these encoders have the same structure with different kernel size. A variety of experimental results demonstrate that the method is superior to the state-of-the-art methods under various illumination challenges. Code will be available soon at https://github.com/ling20cn/IP-GAN

Introduction: The performance of numerous computer vision tasks such as face recognition [1,2], expression recognition [3] and so on reduces significantly when the images encounter severe illumination. Therefore, illumination processing of diverse images under a variety of illumination conditions is highly desired. In this letter, we mainly focus on the illumination processing of facial image.
In the past several decades, experts around the world have put forward various methods to process illumination. Though these traditional methods have promoted the development of illumination processing, their image quality is unsatisfactory and identification accuracy is also required to be improved further. Inspired by the success of deep convolutional neural networks (DCNNs) [4] in computer vision tasks such as image classification [5,6], object detection [7] and image synthesis [8]. Some researchers begin to use deep-learning-based techniques to process face illumination. Ma et al. [9] processes illumination of facial images by generative adversarial networks (GANs) [8]. Then Ma et al. [10] combines triplet loss and GANs to process face illumination. Han et al. [11] uses asymmetric joint GANs to process facial illumination. For the sake of processing face illumination, Zhang et al. puts forward IL-GAN [12] model based on variational auto-encoder and GANs. Ling et al. [13] normalises face illumination by instance normalisation based on fixed gamma value and GANs. Ling et al. [14] processes face illumination by using multi-stage feature maps loss and residual block at down-sampling and their method obtain good image quality and high recognition accuracy.
Although the aforementioned methods obtain favourable performance in processing various illumination, there are some problems. For instance, the image quality is not so satisfactory and the recognition accuracy is low when the face image under extreme illumination. For these reasons and inspired by the success of GANs in image-to-image translation such as CycleGAN [15], EDIT [16], Pix2Pix [17] and [18], we consider that the illumination processing is similar to the way of image translation. The faces with standard illumination belong to a domain, whereas the poor-lighted faces belong to another one. Method: In this letter, X means poor-lighted faces and Y denotes standard illumination faces. Given sets x i j ∈ X , y i ∈ Y , i denotes identity and j means light type. We hope the synthesized face images G(x i j ) and the corresponding standard illumination y i have the same identity i, namely, H (G(x i j )) = H (y i ). H is the feature extractor such as ResNet-50 [19], LightCNN-9 [20] and LightCNN-29v1 [20]. Next, we denote x i j and y i with x, y for short.
Framework: As illustrated in Figure 1, our method mainly includes four parts: encoding network, feature fusion network, decoding network and loss network. In order to expand the receptive field of the encoding network, multiple encoders are used. Although we only draw two encoders on the chart, we can use multiple encoders during the encoding phase according to actual performance. The ellipsis means encoder. The summation formula is as follows: where f m ∈ R T ×C×W ×H is a tensor (feature map) including T images that have C channels, W is width and H is height. f m 1i tcwh means the feature maps outputted by the ith layer of encoder 1. N means the number of encoder.
The architectures of generator and discriminator: As shown in Figure 2, the generator is made up of 3 parts: encoding network, feature fusion network and decoding network. The encoding network contains an encoder with kernal size = 3, an encoder with kernal size = 5. These encoders have the same architecture except for having different kernel size, which is called multiple receptive fields. This unique design expands the receptive field of the encoding network. After encoding, the feature fusion network fuses the encoding results obtained from the multi-encoders. The feature fusion network includes a convolution layer and four residual blocks. The decoding network contains four up-sampling layers and six convolution layers. Each upsampling layer is appended a convolution layer. After four up-sampling operations, the feature maps and the original input have the same size in width and height. The last two convolution layers do not enlarge their feature maps size in horizontal and vertical directions.
The discriminator of our method is inspired by the components of Pix2Pix [17]. The ReLU is used as activation after each convolution layer and we replace BatchNorm with InstanceNorm [21]. The input size of the discriminator is designed to be a 128 × 128 3-channels image.

Objective function:
Adversarial loss: The adversarial process is formed by the training procedure of the generator G and the discriminator D. G strives to generate lifelike fake image G(x) to foolish discriminator D, whereas the latter attempts to discriminate the generated fake image G(x) and the ground truth image y. The adversarial loss is as follows: D(G(x)))], (2) in the above equation, x is the input image (poor-lighted face), whereas y is standard illumination face (well-lighted face).
Dense feature maps loss: If two images are very similar in structure and texture, their feature maps obtained from each convolution layer of feature extraction network are also similar. Inspired by this, we propose the dense feature maps loss. Because the higher layers of feature extraction network cause colour distortion of the generated image. Therefore, we use the lower and middle convolution layers of VGG-16 [6] to obtain feature maps and then use these maps to compute dense feature maps (DFM) loss. The former loss can be computed by: In the above equation, F means the VGG-16 [6] network that contains 13 convolution layers. F 1 denotes the first convolution layer and the subsequent activation layer of VGG-16. Likewise, we can get F 2 , F 3 , F 4 and F 5 .ŷ means the output result of our generator and y is the ground truth, namely, face image with standard illumination.
The Extended YaleB that has 38 subjects under 64 illumination conditions is widely used to evaluate the performance of various illumination processing methods. According to the literature [25] and [26], 1 to 28 (1792 images) are used to train all the deep-learning-based approaches, 29 to 38 (630 images) are used for testing. Next, we call it YaleB for short.
Comparisons of illumination processing methods: Figure 3 shows some face images after processing illumination by five methods on the YaleB database. It is noticeable that the illumination of the original varies greatly. From the second column of Figure 3, we can learn that the face images generated by CycleGAN are unnatural and noisy. In the third column, we can see that these face images are not natural and have many highlight areas. Though the images in the fourth column are natural and clear, their identities do not preserve effectively and some faces have been distorted. In the fifth and sixth column of Figure 3, although our method and Ling et al. seem to have similar visual effects, but Ling et al. has some colourful artefacts above the eyebrows after enlarging these   images. Moreover, the right eye is darker than the left eye in the third face image of Ling et al. From the Table 1, we can learn that all the methods improve the image quality after processing the face illumination. The proposed method obtains higher image quality than other state-of-the-art methods.
In the Table 2, we can learn that all the methods improve the identification accuracy after processing the face illumination except for Pix2Pix and EDIT. The main reason is that Pix2Pix and EDIT cannot preserve identities well and some faces are distorted when they meet extreme illumination. From the final row of Table 2, we can see that the identification accuracy reaches to 96.67% after processing illumination by our method then using Light-CNN-29v1 to identify.
Ablation study: As shown in the Table 3, we can know that our DFM loss obtains higher identification accuracy than other losses when we use ResNet-50, LightCNN-9 and LightCNN-29v1 to extract feature for face identification. In the Table 4, we illustrate the identification accuracy of our model trained under different receptive field and one encoder with atrous spatial pyramid pooling (ASPP) [29] that is used to expand receptive field. From the fifth row of Table 4, comparing to one encoder, we can see that the identification accuracy is improved by 3.33%, 2.38% and 1.43% when we use two encoders to train our model and then use ResNet-50, LightCNN-9 and LightCNN-29v1 to extract feature for identification.

Conclusion:
We present a novel scheme for processing face illumination in this letter. Experimental results demonstrate that our method surpasses the state-of-the-art methods in image quality and identification accuracy. In the future, we will extend it to other image-to-image translation tasks such as photo to cartoon, photo to sketch, sketch to photo and so forth.