Data augmentation with occluded facial features for age and gender estimation

Funding Information Ministry of Science and Technology; 106‐2221‐E‐011‐153‐MY2 Abstract Here, the feature occlusion, a data augmentation method that simulates real‐life challenges on the main features of the human face for age and gender recognition is proposed. Previous methods achieved promising results on constrained data sets with strict environmental settings, but the results on unconstrained data sets are still far from perfect. The proposed method adopted three simple occlusion techniques, blackout, random brightness, and blur, and each simulates a different kind of challenge that would be encountered in real‐world applications. A modified cross‐entropy loss that gives less penalty to the age predictions that land on the adjacent classes of the ground truth class is also proposed. The effectiveness of our proposed method is verified by implementing the augmentation method and modified cross‐entropy loss on two different convolution neural networks, the slightly modified AdienceNet and the slightly modified VGG16, to perform age and gender classification. The proposed augmentation system improves the age and gender classification accuracy of the slightly modified AdienceNet network by 6.62% and 6.53% on the Adience data set, respectively. The proposed augmentation system also improves the age and gender classification accuracy of the slightly modified VGG16 network by 6.20% and 6.31% on the Adience data set, respectively.


| INTRODUCTION
In recent years, facial analysis tasks have become a very popular research topic. Among these tasks, age and gender classification provides basic and essential information about the human face. The information obtained from age and gender classification can be valuable to a wide variety of applications, such as automatic surveillance [1], targeted advertising [2] and so on. In some previous studies, [3][4][5][6] showed promising results in constrained environment settings. Even with the promising results in [3][4][5][6], there is still room for improvements in real-life and unconstrained situations. There are challenges with reallife and unconstrained situations, such as the subject's face being obstructed, images with low resolution, blurry images, extreme lighting conditions or subjects with missing distinctive features, that even humans cannot identify easily.
Deep learning methods have proved a lot of success compared to the traditional machine learning methods in age and gender classification tasks. The most recent state-of-theart methods [8][9][10][11] all use convolutional neural networks (CNN). When training a CNN, the training data plays a significant role. With more training data, CNN can achieve better generalization. However, data sets with correctly labelled data are very scarce and difficult to collect. When training data is limited, overfitting becomes a critical issue.
To deal with the overfitting issue, several types of strategies can be adopted, such as weight regularization [11], dropout [12] layers, and data augmentations [13][14][15] and so on. There are some common methods used in practice. Data augmentation is a way to increase the number of training data without collection of new data.
Deep learning methods have proved a lot of success compared to traditional machine learning methods in the age and gender recognition tasks, as most recent state-of-the-art methods [7,8] in these tasks utilize a CNN. In CNN, the training data plays an important role. With more training data, the network is easier to achieve a better generalization. However, correctly labelled data can be scarce and difficult to collect for age and gender recognition using only faces. In a limited training images situation, overfitting is prone to happen.
The proposed method aims to augment images by simulating the real-life challenges described earlier, obstruction, extreme lighting, and low resolution, in the region around the facial feature of the human face. This creates a more diverse training data for the network to adapt to the real-life environments, and improving the generalization and robustness of the network. A modified cross-entropy loss is also proposed; this modified cross-entropy loss gives less penalty to the predictions that lands on adjacent classes of the ground truth classes. By doing so, the network learns to predict closer to the ground truth, yielding better results.
Our augmentation method applies three simple image processing techniques, blackout, random brightness, and blur, on random areas of the input facial image, each simulating different kinds of real-life challenges that could happen in reallife situations. This augmentation method is tested on the Adience database [11] and implemented on two networks, the simpler AdienceNet [10] and the more complex VGG16 [12], to show that our augmentation method works on both simple and complex networks.
With the outbreak of COVID-19 in 2019, wearing a face mask is required in many public places. Wearing a face mask occludes the two main features of the human face, the nose and the mouth. Without these two features, the CNN must rely on other feature of the human face to perform age and gender classification, which is what the proposed method aims to achieve.
For this study, the main contributions are as follows: (1) We proposed an image augmentation method to improve the generalization and robustness of CNNs by simulating real-life situations. (2) We also proposed a modified cross-entropy loss function, which gives less penalty to the age predictions that land on adjacent age classes of the ground truth age class. (3) Our proposed augmentation method was implemented and tested on two different networks, the slightly modified AdienceNet [16] and the slightly modified VGG16 [17], to show the viability of our augmentation method despite the complexity of the network. (4) Our proposed augmentation method uses simple image processing techniques, making it easy to implement and consumes very little computing resources.
Based on our contributions listed above, the proposed augmentation system improves the age and gender classification accuracy of the slightly modified AdienceNet [16] network by 6.62% and 6.53% on the Adience data set [18], respectively. The proposed augmentation system also improves the age and gender classification accuracy of the slightly modified VGG16 [17] network by 6.20% and 6.31% on the Adience data set [18], respectively.
The proposed modified cross-entropy loss improves age and gender classification accuracy of the slightly modified AdienceNet [16] network by 0.98% and 2.53% on the Adience data set [18], respectively. The proposed modified crossentropy loss also improves the age and gender classification accuracy of the slightly modified VGG16 [17] network by 0.96% and 1.37% on the Adience data set [18], respectively.
The rest of this study is organized as follows, Section 2 reviews some preceding works on gender and age recognition tasks and data augmentation methods, Section 3 describes the proposed augmentation method; Section 4 describes the test results on different implementations, and analyses the effectiveness of the proposed feature occlusion and the proposed modified cross-entropy loss; and Section 5 concludes the study by summarizing the proposed method, and discusses potential future works.

| RELATED WORKS
Many feature extraction methods have been used to learn important information from facial images, and these features provide meaningful attributes for machine learning classifiers to learn and perform age and gender estimation tasks.
Tapia and Perez [19] used local binary patterns [20] to extract features and performs gender estimation with feature selection based on mutual information. Luu et al. [3] combined support vector machines (SVMs) [21] and active appearance models [22] to perform age estimation tasks, utilizing the shape and texture of the faces. El Dib and El-Saban [23] combined active shape models [24] and Gabor features [25] on SVM [21] to perform age estimation tasks. Hu et al. [5] utilized the gradient information by extracting local directional pattern features [26] on facial images to perform age and gender estimation. All of the methods mentioned above were tested on constrained data sets with under control conditions, such as FERET [27], FG-NET [28] and MORPH [29]. Even though these methods showed promising results, there is still room for improvement in unconstrained environments.
The results achieved on the constrained environments are promising, but the results of the unconstrained environments are far from the requirements to perform in real-life situations. For unconstrained environments, Wei et al. [30] proposed a dynamic image-to-class warping to perform occluded face detection and occluded face recovery. Yang et al. [31] proposed the nuclear norm-based matrix regression, which used the minimal nuclear norm of the residual representation image as a criterion in the matrix regression model. Song et al. [32] proposed a pairwise differential siamese network that uses a mask learning strategy to find and discard the corrupted feature elements for facial recognition. All of the methods mentioned above uses reconstruction, dictionary, filter and matching based methods to perform facial recognition. All the methods showed promising results, but only facial recognition is performed.
On the other hand, deep learning-based methods have shown better performance in both unconstrained and constrained cases than traditional machine learning methods, but deep learning methods require more computing resources.
With the rise of deep learning, CNNs have been extensively used in recent age and gender estimation tasks. One of the major advantages of CNNs is that the feature extraction processes are managed automatically by the convolution layers. Through the numerous training passes, the convolution layers learn to obtain meaningful information from the images by updating the weights of the network in each training pass.
Levi et al. [16] proposed a simple convolutional network, the AdienceNet [16], to perform age and gender classification tasks. Lapuschkin et al. [33] implemented age and gender classification tasks on multiple popular CNNs, such as Caffe-Net [34], GoogleNet [35], and VGG16 [17], with different facial alignment settings and pre-trained model weights. Chen et al. [36] used a deep convolutional neural network (DCNN) that consists of three modules, the classifier, the regressor, and the error-correcting module, to perform age estimation tasks. Wolfsharr et al. [37] proposed a hybrid machine learning system that extracts features with a pre-trained CNN and classifies the gender with an SVM [21]. Rodriguez et al. [8] approached the age and gender classification tasks by using an additional CNN utilizing attention mechanism [38], feeding small patches of images with higher attention into the classification network instead of the entire face images. Rothe et al. [9] proposed deep expectation, which used VGG16 [17] as the backbone and expected value formulation for age regression to perform age estimation.
Even though the above-mentioned methods produced recognizable performance, none of the mentioned methods performs age and gender estimation in one single model. Levi et al. [16], Lapuschkin et al. [33] and Rodriguez et al. [8] perform age and gender estimation with two separate models. Performing age and gender estimation together in one model, can save training time and computing resources. The performance can still be improved by implementing different training strategies, such as training losses, regularizations, and data augmentations.
Wang and Perez [14] compared the performance between generating augmented data with generative adversarial networks [39] and using traditional transformations, which consists of shifting, scaling, rotation, flipping, distortion and hue shading of the training images. They also proposed a network that attempts to learn the optimal augmentation techniques to be used for a given data set. Zoph et al. [13] identified 22 data augmentation methods that are beneficial to object detection tasks.
DeVries and Taylor [40] proposed cutout, a simple regularisation technique that randomly masks out square regions of input images during training, and it can be interpreted as an extension of dropout [12] in the input space, and showed performance improvements in object detection tasks.
All of the above-mentioned methods provided positive improvements to the performance of the networks, but most of these methods were implemented and tested on object detection and recognition tasks. The proposed method was designed for facial analysis tasks and aimed to simulate real-life situations that could happen to a human face, such as obstruction, blur and extreme lighting.

| PROPOSED METHOD
In this section, the proposed method, feature occlusion, the slightly modified AdienceNet [16], the slightly modified VGG16 [17], the modified cross-entropy loss and each of the component are discussed in details.

| Feature detection
The purpose of the proposed method is to improve the generalization and robustness of convolutional neural networks by augmenting facial images to simulate real-life and unconstrained situations. These situations include facial features being blurred, obstructed or missing. For example, wearing sunglasses blocks the eyes and part of the nose, wearing a facial mask blocks the mouth and the nose or extreme lighting that cause features to disappear. To create real-life and unconstrained situations, we perform occlusions on the three main features, eyes, nose and mouth, of human faces. To locate these features, we perform face detection on the input facial images, and face alignment is also performed to improve the quality of input images. Detailed information on face detection and face alignment used in the proposed method is described in Sections 3.1.1 and 3.1.2, respectively.

| Face detection and alignment
To perform occlusion on facial features, we have to find the facial features first. We extract the facial region and facial landmarks using the MTCNN [41] face detection tool. The MTCNN [41] is a multitask cascaded CNN unifying facial detection and alignment tasks proposed by Zhang et al. The MTCNN [41] pipeline consists of three stages, including two networks and three different tasks. The first stage uses a proposal network to find the candidates of facial regions, and these candidates are fed into the second stage's refine network to reject false candidates. Finally, the third stage identifies and locates facial region and five facial landmarks. The accuracy of the MTCNN [41] is very high, in the 18,622 images of the Adience data set [16], the MTCNN found all but except for 34 faces. Most of these images are extremely low in quality or having the major facial features heavily obstructed. In some cases, humans cannot even estimate the age of the subject. For the heavily obstructed images, they are still used for training, but they will not participate in the feature occlusion process. An example of facial region and landmark detection of MTCNN [41] and low-quality images are shown in Figure 1. After the features are detected, an affine transformation is performed on the original image using the OpenCV [42] image processing library to align the facial image.

| Feature occlusion
After facial landmarks are obtained and facial images have been aligned, feature occlusion is performed. Section 3.2.1 describes how feature regions are selected, Sections 3.2.2-3.2.4 describe how each type of occlusion is performed on the selected feature regions.

| Region selection
As described in Section 3.1.1, we detected five facial landmark coordinates, left eye, right eye, nose, left mouth corner and right mouth corner, from the input facial image. We use these coordinates to determine the region of occlusion. There are five different regions selected in the proposed method, left eye, right eye, nose, both-eye and mouth regions. The five selected regions are shown in Figure 2.
To determine the region size, we use the distance between two eyes to calculate o, the offset that is used to transform landmark coordinates into rectangular regions. The calculation of o is computed with Equation (1): where γ is the parameter that controls the size of the occluded region, which is set to three in our experiments; Re x 0 and Le x 0 are the x coordinates of the left and right eye.
For the left eye, right eye and both-eye regions, we calculate the selected region by expanding the coordinates into four different coordinates, creating a square. The expansion process of the left eye coordinates into four coordinates of the region is shown in Equation (2): where Le0; Le1; Le2 and Le3 are the four coordinates of the selected region for the left eye, Le x 0 and Le y 0 are the left eye coordinates from the aligned MTCNN [41] facial image, the expansion process is the same for the right eye and nose region.
For the mouth region, the left mouth and right mouth corner coordinates are used for the expansion; we expand the left mouth and right mouth corners into four coordinates, creating a rectangle. The expansion process for the mouth region is shown in Equation (3):

| Random brightness
Extreme lighting conditions not only affect the whole input facial image but also affect certain regional areas of the input facial image. For example, heavy sunlight can cause bright spot on facial images, inadequate or blockage of light source can cause strong shadows casting on facial images, and the angle of the face when the image is taken can cause uneven light distribution on the facial image. This can cause a significant difference in contrast and light intensity to neighbouring regions on the facial image. When these extreme lighting cases happen on the main facial features, it can confuse the network and decrease the classification accuracy. Figure 3 shows some examples of extreme lighting cases from the Adience data set [18], the top row shows where extreme lighting cases affecting the whole facial image, and the bottom row shows where extreme lighting cases affecting only partial region of the facial image. We adjust the brightness of the input facial image by combining two methods, multiplying the input facial image with a scale and adding or subtracting with a bias. The computation of random brightness is shown in Equation (4): where R h;w and R h;w 0 represent the region of height h and width w before and after adjusting brightness, pixel values in F I G U R E 2 Five selected regions of the proposed method on images from the Adience data set [18]: (a) left eye, (b) right eye, (c) both-eye, (d) nose and (e) mouth R h;w 0 were clipped between 0 and 255 to prevent the values from getting out of boundary. α and β are the randomly generated scale and bias, which controls the brightening and darkening of the region, as shown in Equation (5): We set α to between 0.8 and 1.0 for darkening because when α is smaller than 0.8, the darkening results does not look normal, and the β value is also set to the higher negative to compensate for the small range of α. The results of random brightness performed on image from the Adience data set [18] with different α, β and bias are shown in Figure 4.

| Blackout
Blackout is the second occlusion technique used in the proposed method, the goal of blackout is to recreate real-life situations where the facial features of the subject are obstructed. This can happen when the subject is wearing glasses, sunglasses, masks, making hand movements, or being behind objects. Figure 5 shows some examples of obstructed faces from the Adience data set [18].
We perform blackout by setting the selected region's pixel values to zero, which is black in colour, similar to the method used in Cutout [40]. Cutout [40] is a regularization technique that applies a fixed-size zero-mask to a random location of each input image, which can be interpreted as applying a spatial dropout before the input space. Blacking out facial features forces the network to focus more on the minor details of the face instead of the main facial features.
By doing so, we can improve the robustness of the network when the main features of the face are obstructed. Figure 6 shows the results of blackout on an image from the Adience data set [18].

| Blur
For the last occlusion, we use the blur occlusion. The main goal of the blur occlusion is to stimulate low resolution or blurry images. Low-resolution image is caused by the upscaling of the input image because the size of the face is smaller than the input size of the network. While the image is taken, movement causes the image to be blurry or improper focus of the camera lens can also lead to blurry images. Figure 7 shows examples of low-resolution and blurry images from the Adience data set [18].
The blurred occlusion is applied to the selected region to achieve the blur effect, and the process of blurring the selected region is shown in Equations (6) and (7): where R h;w and R h;w 0 represent the region of height h and width w before and after applying blur occlusion, * symbol denotes the convolution operator, and K is the twodimensional filter kernel of height k h and width k w , k h and k w were both set to 15 in our experiments. Figure 8 shows the result of blur occlusion on an image from the Adience data set [7].

| Data augmentation
Due to the limited amount of training data in the Adience data set [7], we performed data augmentation to prevent the possibility of overfitting. After feature occlusion is performed, we obtained an occluded image with the desired height and width of 256 � 256 pixels, and we cropped from the centre and four corners of the occluded image, yielding five images with the size of 228 � 228 pixels. Horizontal flip is then applied to the cropped images, resulting in ten times the amount of original data in the Adience data set. Figure 9 shows the results of data augmentation on an image from the Adience data set [7].

| Convolutional neural network
To verify the effectiveness of feature occlusion, modified cross-entropy loss and data augmentation on general CNN architectures, we trained and tested on two different networks. The first network is a slightly modified AdienceNet, originally proposed in [16]. The AdienceNet [16] is a neural network with three convolutional layers and three fully connected layers. It is a shallow neural network with around 8.5 million parameters. The second network is slightly modified VGG16 [17]. VGG16 [17] is a deep CNN with much more parameters compared to the AdienceNet [16]. With more parameters, the VGG16 [17] is capable of learning more complex features, but it requires longer training time. According to [33], VGG16 [17] performs very well in age and gender classification tasks, and it can achieve better scores than other popular CNNs such as Caf-feNet [34], GoogleNet [35] and AdienceNet [16].
Detailed information of the slightly modified AdienceNet [16] and the slightly modified VGG16 [17] is described in 3.4.1 | Modified AdienceNet [16] We modified the AdienceNet [16] to best suit our experiment. We changed the last layer of the network from one fully connected layer to two fully connected layers. Of the two fully connected layers, the first one has an output of two for gender F I G U R E 7 Examples of low-resolution and blurry images from the Adience data set [18]. Top row shows the upscaled images and bottom row shows the blurry images prediction and the second one has an output of eight for age prediction. This modification was made so the loss can be calculated separately. The architecture of the slightly modified AdienceNet [16] is shown in Figure 10.
The modified AdienceNet [16] takes in input images with a size of 228 � 228 � 3 pixels, and then the image is fed into the following convolutional layers: (1) Conv1 with 96 filters of the kernel size 7 � 7, stride of 4.
Each convolutional layer is followed by a rectified linear operator (ReLU) [43], a batch normalization layer [44], and a max-pooling layer with the kernel size of 3 � 3 and stride of two. The output of Conv3 is then fed into four fully connected layers: (1) FC4 with 512 outputs, followed by the ReLU [43] activation and a dropout layer. (2) FC5 with 512 outputs, also followed by the ReLU [43] activation and a dropout layer. (3) FC6 with 8 outputs, which represent the eight age classes. (4) FC7 with 2 outputs, which represent the two gender classes.
3.4.2 | Modified VGG16 [17] VGG16 [17] is a deep network proposed by Simonyan et al. with 13 convolutional layers and 3 fully connected layers. We also modified the VGG16 [17] to best suit our experiment. We changed the last layer of the VGG16 [17] from one fully connected layer to two fully connected layers. Of the two fully connected layers, the first one has an output of two for gender prediction and the second one has an output of eight for age prediction.
The network is fed with images with the size of 224 � 224 � 3 pixels, followed by five groups of convolution layers: (1) Conv1 and Conv2 with 64 filters of kernel size 3 � 3, stride of 1. Each group of layers is followed by a ReLU [43] and a max-pooling layer of kernel size 2 � 2 with a stride of 2. The feature maps from Conv13 is then fed into four fully connected layers: (1) FC14 with 4096 outputs, followed by a dropout layer.
The output of each fully connected layer except FC16 and FC17 is applied with a ReLU [43] function to enhance the nonlinearity of the network. The architecture of the slightly modified VGG16 [17] is shown in Figure 11.

| Training details
The proposed method can be broken down into four parts. The first three parts are feature detection and alignment, region occlusion and augmentation, which have been described in the previous Sections. This section describes the training details used in the proposed method, such as initialization, loss function, optimizer and training method, are described in following sub-Sections.

| Initialization
We trained the slightly modified AdienceNet [16] and the slightly modified VGG16 [17] from scratch, the weights were initialized with the Glorot uniform initialization [45] implemented in the Pytorch library. The Glorot initialization [45] is a F I G U R E 1 0 Architecture of modified AdienceNet [16]. The red box shows the modifications made LIN AND LIN weight initialization technique that tries to make the variance of the outputs of a layer to be equal to the variance of its inputs. The weights of each layer were filled with the values sampled from a uniformed distribution shown in Equation (8) where U is the uniformed distribution where the values are sampled from, gain is the scaling factor, which is set to 1 in our experiment, f an in is the number of input units in the weight tensor and f an out is the number of output units in the weight tensor.

| Loss function
For the training loss, age loss and gender loss are calculated separately. Gender loss uses cross-entropy as the training loss, and age loss uses a modified version of cross-entropy loss. The calculation of total loss is shown in Equation (9): where L t denotes the total loss, L g denotes the gender loss and L a denotes the age loss. The detailed explanation of L g is shown in Equation (10): where X is the set of training samples, p g ðxÞ is the true gender probability distribution of sample x and q g ðxÞ denotes the estimated gender probability distribution of sample x.
For the age loss, there are two different situations, exact/ off and one-off. Exact/off is when the predicted age class is correct or completely wrong, but the wrong predictions do not include the classes adjacent to the true age class. One-off is when the predicted age class is not correct but falls in the class adjacent to the true age class. The loss for each situation is calculated differently. The detailed explanation of L a is shown in Equation (11): where L e is the exact/off loss and L o is the one-off loss. L e and L o are described in the next two paragraphs.
In the first situation, where the predicted age is either correct or completely wrong, excluding the predictions adjacent to the true age class, the exact/off loss is used as the age loss, which is the same as the cross-entropy loss, shown in Equation (12): where X is the set of training samples, p a ðxÞ is the true age probability distribution of sample x and q a ðxÞ denotes the estimated age probability distribution of sample x.
For the second situation, where the predicted age class is not correct but falls in the class adjacent to the true age class, we calculate the exact/off of the predicted age class and subtract a L o term from the calculated exact/off loss. The L o term is the loss of the adjacent class multiply by an impact factor of ð1 − q oa ðxÞÞ c . By subtracting L o from L e , the penalty of the age loss is reduced when the prediction is adjacent to the true age class, thus encouraging the network to predict closer to the true age class. The calculation of L o is shown in Equation (13). F I G U R E 1 1 Architecture of the slightly modified VGG16 [17]. The red box shows the modifications made where X is the set of training samples, p oa ðxÞ is the true oneoff age probability distribution of sample x, q oa ðxÞ denotes the estimated one-off age probability distribution of sample x and c is the parameter that controls the impact of L one−of f on L a , which is set to 2 in our experiments.

| Optimizer
Stochastic gradient descent with momentum [46] was used as the optimizer for our training process. The weights of the network are updated with the formula shown in Equation (14): where W t denotes the weights of the network at the given time t, ∇ W L is the gradient, η is the learning rate, which is set to 0.001 in our experiments and m is the momentum parameter, which is set to 0.9 in our experiments.

| Training method
In the proposed method, we trained the network with a fivefold cross-validation protocol, defined in [16]. The five folds are split according to the five folders of the Adience data set. During the training process, one folder of the Adience data set [18] is removed from the rest of the data set and used as validation data. After we perform feature occlusion and data augmentation on the remaining data, they are fed into the network for training with the removed folder as the validation data. We repeat this process five times, and each time removing a different folder for validation. The details of each folder are discussed in the next chapter.

| EXPERIMENTAL RESULTS
In this section, the results and analysis of the proposed method are discussed. Section 4.1 describes the experimental environment; Section 4.2 introduces the Adience data set [18]; Section 4.3 describes the performance evaluations of the proposed method, and Section 4.4 describes the evaluation of subjects wearing face masks.

| Experimental environment
The proposed method was implemented using Python 3.6 with Pytorch to perform deep learning tasks. Both training and testing were performed on the same machine with an NVIDIA GTX1080Ti GPU, an Intel E5-1650 CPU, and 46 GB of memories.

| Adience data set
The Adience data set [18] is an age and gender data set consists of a variety of unconstrained images gathered from Flickr [47], an image hosting service website. The images from Flickr [47] were uploaded by users directly from their smartphones, so the data set contains a wide variety of real-life unconstrained images, such as occlusion, low resolution, blurry, multiple faces, non-frontal faces, extreme lightning, infants and so on. Figure 12 shows some examples of the Adience data set [18]. The gender label of the Adience data set [18] contains two classes, male and female, and the age label contains eight classes, 0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53 and 60+. However, not all of the images have both the age and gender labels, so we filtered out the images with missing labels. Table 1 shows the breakdown of the labelled images included in the data set before and after the filtering process.

| Performance evaluation
In this section, we evaluate the results of the proposed methods implemented on the slightly modified AdienceNet [16] and VGG16 [17]. The testing process uses a fivefold cross-validation protocol. After the testing process, the final results are obtained by averaging the results from each fold. The gender accuracy, age accuracy and age one-off accuracy of each fold are evaluated. Section 4.3.1 shows the results of age and gender classification.

| Age and gender classification
In this section, the results of the proposed methods are discussed. Sub-section "modified cross-entropy loss" describes the results of the proposed modified cross-entropy loss and sub-section "feature occlusion" describes the results of the proposed feature occlusion.

F I G U R E 1 2
Examples of the Adience data set [18] LIN AND LIN -9

| Modified cross-entropy loss
The results of age and gender classification using the proposed modified cross-entropy loss are shown in Table 2 (ADN. = slightly modified AdienceNet [16], VGG. = slightly modified VGG16 [17], CE = cross-entropy loss, MCE = modified cross-entropy loss. The green numbers indicate the improvements made by the modified crossentropy loss.). We evaluate gender accuracy, age accuracy and one-off accuracy. The one-off accuracy is calculated by accepting the predictions of adjacent age groups of the ground truths as correct predictions.
From the results of Table 2, the proposed modified cross-entropy loss achieves a 0.98% improvement in the age accuracy with the slightly modified AdienceNet [16] and 0.96% with the slightly modified VGG16 [17]; the one-off age accuracy improves by 0.67% with the slightly modified AdienceNet [16] and 0.77% with the slightly modified VGG16 [17], and the gender accuracy improves by 2.53% with the slightly modified AdienceNet [16] and 1.37% with the slightly modified VGG16 [17].

| Feature occlusion
The results of age and gender classification implemented with the proposed feature occlusion and modified cross-entropy loss are shown in Table 3 (CE = cross-entropy loss, MCE = modified cross-entropy loss, FO. = feature occlusion, AND. = slightly modified AdienceNet [16], VGG. = slightly modified VGG16 [17]. The green numbers indicate the improvements made by feature occlusion and feature occlusion and modified cross-entropy loss.). The gender accuracy, age accuracy, and the one-off accuracy are evaluated. We compared the results of the baseline, which is the slightly modified AdienceNet [16] and the slightly modified VGG16 [17] without feature occlusion and implemented with the normal cross-entropy loss, with our proposed method, which is the slightly modified AdienceNet [16] and the slightly modified VGG16 [17] with feature occlusion and implemented with the modified cross-entropy loss.
The proposed method achieves a 6.62% improvement in the age accuracy with the slightly modified AdienceNet [16] and 6.2% with the slightly modified VGG16 [17]; the one-off age accuracy improves by 7.16% with the slightly modified AdienceNet [16] and 6.97% with the slightly modified VGG16 [17], and the gender accuracy improves by 6.53% with the slightly modified AdienceNet [16] and 6.31% with the slightly modified VGG16 [17].
We also evaluate the results of the proposed modified cross-entropy loss implemented on the proposed feature occlusion to show the effectiveness of our proposed modified cross-entropy loss.
The modified cross-entropy loss achieves a 0.8% improvement in the age accuracy with the slightly modified AdienceNet [16] and 0.97% with the slightly modified VGG16 [17]; the one-off age accuracy improves by 0.46% with the slightly modified AdienceNet [16] and 0.55% with the slightly modified VGG16 [17], and the gender accuracy improves by 2.21% with the slightly modified AdienceNet [16] and 0.98% with the slightly modified VGG16 [17].
In summary, both the modified cross-entropy loss and the feature occlusion improved the accuracy of both the age and gender classification. This proved that the training data generated by the proposed feature occlusion method improves both the generalization and robustness of the network. Even though the modified cross-entropy loss only reduces the penalty of the one-off age predictions, it still improved the gender accuracy. We believe this is because when the one-off age loss is reduced, the total loss of the network is also reduced, since the total loss is calculated by adding the age loss and the gender loss together, thus helping the network predict both the age and the gender classes better.
We also compared the proposed method with other stateof-the-art methods, as shown in Table 4. In Wolfsharr et al. [37] only the gender classification is performed on the Adience data set [18]. In Rothe et al. [9] and Chen et al. [36], only the age classification is performed on the Adience data set [18]. Eidinger et al. [7], Levi et al. [16] and Lapuschkin et al. [33] T A B L E 1 Breakdown of (a) age labelled images and (b) gender labelled images of the Adience data set [18] before and after filtering

| Morph database [48]
To verify the effectiveness of the proposed method on different data sets, we evaluated the proposed method on the academic Morph database [48]. The Morph database [48] contains around 55,000 images of 13,618 subjects with the age range of 16-77. These images were taken under constrained environments, which is different from the Adience data set [18]. We convert the age label of the Morph database [48] to group ranges, just like the labels we used for the Adience data set [18]. Table 5 shows the results of the proposed method trained with the Morph database [48]. From the results of Table 5, the proposed feature occlusion achieved a 4.45% improvement in the age accuracy with the slightly modified AdienceNet [16]. The proposed feature occlusion with the proposed modified cross-entropy loss achieved a 4.91% improvement with the slightly modified cross-entropy loss. This shows the versatility of the proposed feature occlusion and the modified cross-entropy. On both of the data set, the proposed methods achieved significant improvements in both age accuracy and gender accuracy.

| Blackout colour
We also experimented with different colours for blackout, the results are shown in Table 6. From the results of Table 6, the blackout colour of black resulted in the best performance for both age and gender. We believe this is because when the region is set to black (0), the network is forced to learn from other parts of the image, so it will not rely heavily on the facial features.

| Mask evaluation
We collected 20 images of human subjects wearing different kinds of face masks and tested on our trained network. Figure 13 shows some of the images used for testing. The results of testing subjects wearing facemask are shown in Table 7.
From the results of Table 5, we suffered a significant accuracy drop in both age accuracy and one-off accuracy. This is because our feature augmentation method only occludes one feature at a time, but when a subject wears a face mask, both the nose and the mouth are covered, which increases the difficulty for age prediction.
We also added another region to our proposed feature occlusion method. During region selection, we select both the mouth and the nose as one region to perform occlusion, as shown in Figure 14. The result of the feature occlusion + method is also shown in Table 5. But after adding the feature occlusion + method, the age and gender accuracy did not increase, only the one-off accuracy increased. We believe this is due to the limited number and the limited varieties of our testing data. In some cases, even humans cannot predict the age of the testing subject correctly.

| CONCLUSIONS
Here, we proposed a data augmentation method with feature occlusion and a modified cross-entropy loss to perform age and gender classification at the same time. With data augmentation, we aimed to reduce the overfitting problem that often happens when the training data is limited. With the modified cross-entropy loss, we aimed to give less penalty to the one-off age predictions.
The proposed data augmentation method with feature occlusion augments the input data by simulating challenges that could happen in real-life situations. Given an input facial image, the proposed data augmentation with feature occlusion first randomly select a facial feature region and randomly perform one of the three occlusion techniques. The occlusion techniques include random brightness, blackout and blur, which simulates where the faces are under extreme lighting conditions, obstructed, and low resolution. By doing so, we can generate more training samples with the facial feature regions being occluded, thus improving the robustness and generalization of the network.
The proposed modified cross-entropy loss helps the network learn better by giving less penalty to the total loss when wrong age predictions land in the adjacent classes of the ground truth age class. With the modified cross-entropy loss, we improved the age, one-off and gender classification accuracies on both the slightly modified AdienceNet [16] and the slightly modified VGG16 [17]. The proposed data augmentation method with feature occlusion and modified cross-entropy loss achieves a 6.62% improvement in the age accuracy with the slightly modified AdienceNet [16] and 6.2% with the slightly modified VGG16 [17]; the one-off age accuracy improves by 7.16% with the slightly modified AdienceNet [16] and 6.97% with the slightly modified VGG16 [17], and the gender accuracy improves by 6.53% with the slightly modified AdienceNet [16] and 6.31% with the slightly modified VGG16 [17].