Face recognition based on adaptive margin and diversity regularization constraints

Funding information National Natural Science Foundation of China, Grant/Award Number: 61876158; Sichuan Science and Technology Program, Grant/Award Number: 2019YFS0432; National Key Research and Development Project, Grant/Award Number: 2016YFC0802209; Key Research and Development Program of Guangzhou, Grant/Award Number: 202007050002 Abstract In recent years, a more robust facial feature can be learned by convolutional neural networks once introducing margins into loss functions. Those methods set a margin for each class manually to squeeze the intra-class variations within each class equally. However, the internal feature distributions of different persons in the real world are highly unbalanced, and the distance between different identities is not uniform either. As a result, applying the same margin on all classes might not lead to higher inter-class differences. To address this problem, this paper proposes an adaptive margin based on feature distribution to squeeze the feature interior spaces of different classes. Simultaneously, because the inter-class margin can adequately represent the distribution of different classes in the feature space, this paper proposes a novel diversity regularization method. The regularization weights of each class are dynamically set depending on their margins. This method proposed in this paper is intuitively interpretable and can be easily applied to other classification scenarios. Experiments on current existing benchmarks have demonstrated the superiority of our method over state-of-the-art competitors.


INTRODUCTION
As one of the most common computer vision tasks, face recognition has made significant progress in recent years [1][2][3]. Face recognition is essentially a small sample size problem. Therefore, intra-class compactness and inter-class separability are two critical factors for designing a good face feature extraction algorithm. Before deep learning was widely used in face recognition, much work has been done on improving the discrimination of feature extraction or classifiers, such as Fisher linear discriminant analysis (Fisherface) [4] and support vector machine [5].
The development in this area shows that large-scale training data is significant to improve the accuracy of face recognition in the wild. However, it is difficult for traditional algorithms to handle large-scale face data. In recent years, convolutional neural networks (CNNs) have gradually become the prioritized choice for face feature extrac-  real-world data is highly unbalanced, as illustrated in Figure 1(a), and the distance between different classes is uneven. The uniform treatment is not able to increase the separability between classes. For intra-class feature learning, based on feature distribution, we propose an adaptive margin to squeeze the feature interior spaces of different classes.
In this paper, we propose a novel loss function, i.e. adaptive margin and regularization loss (AMR-Loss), which uses an adaptive margin and diversity regularization to impact the feature distribution. For output layers, diversity regularization aims to distribute the classifier neurons (classifier neurons are the projection bases of the last layer (i.e. output layer) before input to softmax) as uniformly as possible to improve the interclass feature separability, as shown in Figure 4(b). Based on the feature distribution, the margins of different classes are determined adaptively, and the margins of each class are variable and learnable. Formally, m i denotes margin trained for each class (m i can be regarded as the shortest distance between class C i and all other classes), such that the decision boundary is given bys(cos( 1 + m 1 ) − cos 2 ) = 0. The decision boundary of ArcFace is s(cos( 1 + m) − cos 2 ) = 0 (s is the scale factor), as shown in Figure 2. On the other hand, we find that the margin corresponding to a class can fully reflect the class's classification performance in the training phase-small margins correspond to low classification performance, as shown in Table 1. Because a class's corresponding margin can fully reflect its classification performance during the training process, we weight the diversity regularization by the margin, so that diversity regularization focuses on the class with poor discrimination. This weight is called dynamic diversity regularization weight. Overall, our main contributions are as follows: 1. We introduce an adaptive margin, which ensures that the model can learn a particular margin for each identity to squeeze its intra-class variations adaptively. 2. We present a regularization term based on the adaptive margin, which explicitly enlarges the distance between different identities. Feature discrimination is improved by promoting inter-class separability for face recognition. 3. The experimental results on image classification and existing accessible face datasets show that the proposed method can be applied to other variants of softmax to improve their performance further.

RELATED WORKS
As CNNs and face recognition are well studied, in this section, we mainly discuss two sub-fields closely related to this paper: (1) loss function; (2) diversity regularization.

Evolution of loss function
The loss function is an essential part of a deep CNNs, which points out the direction of optimization for network training. There are two types of loss function server for two purposes: verification and classification. The contrastive loss [10,11] is a verification loss function that optimizes the Euclidean distance between paired features in the feature space. Besides, softmax loss is the essential classification loss function, and there are various evolutions [12]. The center loss proposed by Wen et al. [2] penalizes the Euclidean distance between the feature and its corresponding clustering center, to reduce intra-class variations. L 2 -softmax adds a constraint to the features during training such that the L 2 -norm remains constant, which provides similar attention to both good and bad quality faces. NormFace [13] builds a cosine layer in a standard CNNs model by normalizing the features and weights of the last inner product layer. Recent studies have found that adding a cosine angle margin between different classes has a noticeable effect on improving feature discrimination. Large-margin softmax and A-softmax [14] add a multiplicative angular margin to squeeze each class. Similar to A-softmax, CosFace and ArcFace add cosine margin through addition. Unlike ArcFace setting fixed margin, in AdaptiveFace [15], the problem of the unbalanced number of samples in different classes of face datasets is solved by the adaptive margin corresponding to the number of training samples in each class.

Diversity regularization
Although CNNs have achieved great success, overparameterization [16] can lead to redundant and highly correlated neurons (e.g. weights of the last inner-product layer), as shown in Figure 4(a). In order to reduce the redundancy of neurons and release the generalization ability of the network, current works address the redundancy problem by enforcing relatively large diversity between pairwise projection bases via regularization. Such methods are diversity regularization [17]. Diversity regularization is widely used in sparse coding [18], ensemble learning [19], metric learning [20] etc. The early studies of sparse coding have shown that the generalization ability of codebook (CodeBook) can be improved by diversity regularization. Diversity is usually modelled using the (empirical) covariance matrix. Many recent works have been carried out to improve the diversity regularization in the neural network [21], in which the regularization is mainly realized by promoting large-angle orthogonality or reducing the covariance between base vectors. The diversity of neurons is promoted by minimizing the hyperspherical energy of neurons on the hypersphere [22,23].

INTUITION AND MOTIVATIONS
This section explains our motivation for the proposed diversity regularization based on the adaptive margin.

The relationship between adaptive margin and class separability
We experiment on CIFAR-10 to analyze the relationship between margin values and feature distribution. We use a simple eight-layer CNNs and softmax Loss for training. At test time, we test the accuracy of each class, shown in Table 1; we found that the classification accuracy of bird, catand dog in CIFAR-10 is significantly lower than in other classes.
We randomly selected 20 samples from each class in the test set and used T-SNE [24] to reduce the network's output dimension from 512-D to 2-D for visualization. As shown in Figure 1(a), there is no apparent separation between the features of birds, cats and dogs in CIFAR-10 and other classes' features, which leads to poor classification results. Then, to verify the relationship between the distribution of class features in feature space and adaptive margin, we experimented with AMR-Loss on CIFAR-10. The changes in the margin in the training process of AMR-Loss are shown in Figure 3 (see Section 5.4 for experimental settings). Figure 3 suggests that the classes (such as a car, ship) with good separability (the feature distribution is separated from other classes) in the training process, their margin's growth rate is much higher than other classes, and finally keep a large value (0.91, 0.90), the class with bad separability (such as bird and cat), margin keeps a small value (0.55, 0.51) for a long time. Therefore, the value of the margin should be adapted to the distribution of features.

Diversity regularization
Inter-class separability and intra-class compactness are two key factors that affect the feature distinguishability. However, CNN's methods based on softmax, such as Center Loss, SphereFaceand ArcFace, mainly focus on the intra-class compactness.
Let G (⋅) be all the network layers of the model except the output layer, and be the parameters of the model. The matrix W ∈ R d ×n represents the parameters of the output layer (d is the length of the output features; n is the number of classes in the training set), which maps the image features predicting the classes. Given the input imageI i , we get its feature representa- softmax loss is as follows: where x i ∈ R d represents the output feature of the i_th sample in a batch, and its class label is y i . W j ∈ R d represents the j _th column of the weight W ∈ R d ×n of the output layer, and b j ∈ R n is the bias term. The number of samples in a batch is M , and the number of classes in the training set is n.
Applying l 2 normalization on w j and x i to make ‖w j ‖ = 1, ‖x i ‖ = 1, the feature distance can be expressed as the following feature angle: where j is the angle between x i and its corresponding w j . Based on Equation (3) and fix the bias b = 0, modify Equation (2) as follows: Our method improves the distribution of clusters by extending the distance between clusters in the feature space so that there is an obvious gap between their distributions. In Equation (4), y i is the angle between the feature embedding x i and the corresponding weight vector w j . Minimizing Equation (4) is equivalent to minimizing y i . Therefore, the weight vector w j can be considered as the cluster centres of all x i labelled j (y i = j ).
In Section 5.4, we visualized the output layer weight w j (clustering centre of the j _th class) of the CIFAR-10 model trained by the softmax to the three-dimensional sphere, as shown in Figure 4(a), it can be seen that the distance between w j of poorly separable classes (such as bird, cat, dog) is relatively small. Therefore, we can increase the separability between classes by focusing on expanding the distance between w j of these poor separable classes to improve the discrimination of features potentially. Combined with these two parts, we use the feature distribution adaptive margin to adjust the weight of our diversity regularization dynamically. The training focuses on expanding the less separable classes and forces the clustering centre w j away from each other during training.

MARGIN AND DIVERSITY REGULARIZATION OF ADAPTIVE FEATURE DISTRIBUTION
In this section, we will give details of our approach, i.e. adaptive margin's diversity regularization based on feature distribution.

Softmax loss of margin with adaptive feature distribution
In order to improve the compactness within the class, a fixed margin is introduced based on Equation (4), the expression is as follows: where m represents the margin value. The m in the above formula is usually set experientially as a hyperparameter and is a fixed value. Using a fixed m contains an implicit assumption that all classes are evenly distributed in the feature space, so a constant margin will indifferently squeeze the feature distribution within each class. However, the distribution of each class in the public face dataset is highly uneven. To solve this problem, our goal is to automatically calculate the margin guided training process that fits the class distribution. We transform Equation (5) to (Adaptive Margin Loss, L AM ): m y i is a margin corresponding to class y i . m y i is defined as a trainable variable and changes adaptively during training. The corresponding m y i gradient for Equation (6) is as follows: where (condition) = 1 if the condition is satisfied, and (condition) = 0 if not. The value of m affects the size of the intra-class feature space: the larger m, the more compact the intra-class distribution. Thus, we use the following formula to limit the changes in margin values during training (Dynamic margin Loss, Ldm): The above formula calculates the average margin of all classes in a batch. m y i is the margin corresponding to class y i , and M is the size of the number of samples in a batch. The adaptive margin value is between 0.5 and because if there is no minimum limit of 0.5, m will become negative about the beginning of the training, which will affect the training instead. We also limit the maximum value to to ensure the stability of the training We add a sigmoid function to limit the gradient of Ldm, which makes the margin change smoother so that we can use margin to change the training weights of our diversity regularization dynamically. Equation (8) ensures that the margin's value is as large as possible, provided that the training data can be fitted. Without this item, the model tends to make the margin 0 during training. The gradient of m y i in Equation (8) is shown in Equation (9).

Diversity regularization
As mentioned in Section 3.1, the separability of different classes can be represented by an adaptive margin with feature distribution in our method. Different regularization weights are set according to different margin values, which are defined as In the above formula, m y i is the margin corresponding to class y i , and is the scale factor, which enlarges the difference between k i . = 10 is used in this paper. The formula represents the margin's increase by calculating the difference between m y i and the initial value of 0.5 during training, and then re-scale it by . Finally, the sigmoid function is used to limit the values between 0 and 1, and subtract from 2.0 to get the corresponding regularization weights for each class. The weight k i obtained by Equation (10) changes with the change of m y i , and the larger FIGURE 5 Relationship between regularized weights and margin m y i , the smaller the weight k i , and vice versa, as shown in Figure 5.
As described in Section 3.2, w j can be considered as the cluster centre for all samples labelled y = j . To increase the distance between feature distributions of different classes of samples, we introduce a diversity regularization that enlarges the distance between feature cluster centres and adds weight to our regularization using the weight k i derived from the feature distribution's adaptive margin. The smaller the margin, the larger the k i , which makes the regularization focus on expanding the distance between cluster centres of the less separable classes. After Equation (10), the regularization term is defined as (Regularization Weights Loss, Lrw) n is the number of classes. w i and w j represent the cluster centres of Class i and Class j, and (k i ) s is the diversity regularization weight of Class i, where ( * ) s stands for stop gradient, that is, the k i gradient computed by Lrw is no longer propagated backward (tf.stop_gradient (k i ) command is used in Tensorflow). Because k i is the diversity regularization weight set by the adaptive margin, in Equation (11), we only need to use the adaptive margin to reflect the class's classification performance. We do not need to update the margin through Equation (11), so ( * ) s is used. In Equation (11), the max() function is used to only punish the distance between the weight vector of other classes that are most similar to the corresponding weight vector of each class and the weight vector of that class, to reduce the calculated amount of backpropagation. The corresponding w i gradient for Equation (11) is as follows: By forcing w j to be separated from each other during training, Equation 11 makes w j more diverse and reduces its redundancy, as shown in Figure 4(b).

Final loss design
Combining feature distribution adaptive margin with adaptive regularization, the overall loss function is (AMR-Loss, L AMR ) where is the balance factor between the two formulae. Because Ldm and Lrw are mutually reinforcing, the farther clustering centres deviate from each other, the larger the margin will be. Through experiments, it is better to use the same balance factor for Ldm and Lrw.
In Algorithm 1, we summarize the learning detail in the CNNs with AMR-Loss.

ALGORITHM 1 The Adaptive Margin and Regularization Loss
Input: Training data{x i }. Initialized parameters c in convolutional layers.
Parameters W and m y i in output layers, respectively. Hyperparameter and learning rate u t . The number of iteration t ← 0.

Output:
The parameters c .

3:
Compute the joint loss by L t AMR = L t AM + (L t dm + L t rw ).

7:
Update the parameters c by t +1

EXPERIMENT AND ANALYSIS
To evaluate AMR-Loss, we conducted experiments on CIFAR-10 and CASIA-WebFace. First, we report the results of AMR-Loss visualization on the CIFAR-10 dataset to validate the effectiveness of AMR-Loss. Second, we compare the performance of AMR-Loss against existing methods on all four datasets. Finally, we provide visualizations and analysis to illustrate how AMR-Loss has achieved its effectiveness.

Experimental settings
All experiments in this paper are implemented in Tensorflow.
We set the batch size to 32 and train the model on an NVIDIA GeForce GTX 1080Ti (11GB) GPU. The learning rate starts from 0.1, and the learning rate of 200000 times per iteration is divided by 10, and a total of 1.5 million times of training. We use the Momentum optimizer and set the momentum to 0.9 and weight decay to 5e-4.

Network structure
The network structure of this paper uses ResNet20 architecture used in [2,10]. The network structure is shown in Figure 6. The network accepts 112 × 112 RGB images as input, and four residual blocks contain a total of 20 convolutional layers, and the shape of the output feature map is 7 × 7 × 512. Then, the feature map of 7 × 7 × 512 is mapped to a 512-D vector by the first fully connected layer (FC1), and then the 512-D vector and the parameters of the last fully connected layer (output layer) are normalized with l 2 . After normalization, x and w are dot product to get the angle between the current feature and the target weight, then add our adaptive margin. Ldm is used for training the adaptive margin of feature distribution.w is transferred into Lrw before normalization. Lrw uses margin to add weight to our diversity regularization so that each class of corresponding w i has a further distance from each other.

Dataset
As shown in Table 2, we train our CNNs model with CASIA-WebFace, and some samples of this dataset can be found in Figure 7.
We use standard test methods to test each dataset. We use LFW and YTF sets. LFW dataset includes 13,233 face images from 5749 different identities. YTF dataset includes 3424 videos from 1595 different individuals, with an average of 2.15 videos per person. The clip durations vary from 48 frames to 6070 frames, with an average length of 181.3 frames. Both LFW and YTF contain faces with large variations in pose, expression and illumination. Considering that LFW has been well solved, we further evaluate our method on the more challenging LFW BLUFR protocol [30].
We use the CFP-FP with a large change in attitude and the AgeDB-30 with a large change in age. The CFP-FP dataset includes 500 people, each with 4 profile faces and 10 front faces. AgeDB-30 data set is a face data set in a natural scene with great changes in posture, expression, occlusion and age. There are 12240 face pictures, 440 different people; most of them are celebrities, including actors, writers, scientists, etc. Each picture in the data set has class, age and gender information.

Parameter analysis
As shown in Equation (10), the only hyperparameter in AMRloss is the weight factor . Given a large the Ldm and Lrw play a leading role in training. If is small, Lam loss will dominate the learning process. We use different training methods and show the accuracy of LFW. When = 0, the loss described in Equation (10) only works with L AM . As shown in Figure 8, with the increase of , the performance of LFW increases rapidly and reaches the peak value when = 8. So in the following experiments, we use = 8.  To intuitively understand the changes in the margin using CASIA-WebFace training, we randomly selected five identities to track their margin changes during training, as shown in Figure 9.

Test on CIFAR-10 dataset
To intuitively show AMR-Loss's effect, we designed an experiment (network structure shown in Table 3) to demonstrate the feature distribution trained by different loss functions.  In our experiment in CIFAR-10, we randomly selected 20 samples for each class in the test set, extract features from the network trained by different loss functions and use T-SNE to reduce features from 512-D to 2-D for visualization.
As shown in Figure 1(a), the features obtained from the model trained by softmax have certain separability, but there is no obvious boundary between the features of some classes (such as bird, cat, dog). By adding a margin, each class's intraclass features are more compact (Figure 1(b)). We propose an adaptive margin to improve intra-class compactness and an adaptive regularization to improve inter-class separability. As shown in Figure 1(c), AMR-Loss gets the features that are more  compact within a class and have obvious boundaries between classes.
Then we also visualize the weight (w j ) of the network output layer. As shown in Figure 4(b), it is evident that the feature clustering centres of each class obtained by our method are more scattered.
Finally, we test the accuracy of the CIFAR-10. As shown in Table 4 on CIFAR-10, L AM + Lrw(L AMR ) performance is significantly better than softmax and its variants, which proves the effectiveness of AMR-Loss in the image classification task.

Tests on LFW and YTF
We use a network architecture shown in Figure 6 to train softmax, its variants and our proposed AMR-Loss. The original and left and right flipped face images are input into the model, and the corresponding output is connected as the final feature. We evaluate performance on LFW and YTF datasets under the standard test method. As shown in Table 5, our proposed AMR-Loss can significantly improve the performance of the original softmax loss. L AM + Lrw(L AMR ) combination is better than L AM loss alone and has a very obvious difference. AMR-Loss also has better performance than other softmax variants (such as Sphereface, ArcFace). It is shown that under the joint supervision of adaptive margin and adaptive regularization, the model can learn more distinguishing features, proving the effectiveness of AMR-Loss. Considering that LFW has been well solved, we further evaluate our method on the more challenging LFW BLUFR protocol, which focuses on low FARs. We report the result in Table 6. As it is clear, our method is superior to all current state-of-the-art methods.

5.6
Tests on CFP-FP and AgeDB-30 In Table 7, we show the validation accuracy of softmax-Loss, ArcFace, L AMR , L AM + Lrw (L AMR ) in CFP-FP and AgeDB-30 datasets. Experiments show that our method has a significant improvement compared with existing methods such as ArcFace in the more difficult test set. It shows that AMR-Loss has a better effect on various datasets than the existing methods.

Visualizations
Furthermore, to visualize AMR-Loss's effect, we designed a toy experiment to demonstrate the feature distributions trained by  The margin distributions different loss functions. We select face images from six identities in CASIA-WebFace to train CNNs models (Table 3), which output three-dimensional features. We normalized the obtained three-dimensional features and plotted them on the sphere. The participated losses are softmax, ArcFace and the proposed AMR-Loss with different . As shown in Figure 10, we can observe that the bad aggregation and separability within softmax loss classes lead to bad decision boundaries. Arc-Face reduces intra-class variations but ignores the separability between classes to some extent. Its clustering centre has not changed. AMR-Loss reduces intra-class variations and increases inter-class variations. Furthermore, by increasing the , the features of those classes are almost clustered at one point. After training with CASIA-WebFace, we counted the distribution of adaptive margin, as shown in Figure 11.

CONCLUSION
In this paper, we propose a new loss function AMR-Loss for face recognition, which consists of two parts. First, we introduce the margin of adaptive feature distribution for each class, increasing the compactness of the intra-class adaptively. The second is the diversity regularization of adaptive weights, which focuses on the classes with low separability by adaptively adjusting the regularized weights and forcing the expansion of the boundary between classes in training. Compared with the existing methods, this method increases the intra-class compact-ness and expands the inter-class separability. As described in the experimental part, our method provides significant improvements in accuracy on multiple face datasets.