Sparse representation for face recognition: A review paper

With the increasing use of surveillance cameras, face recognition is being studied by many researchers for security purposes. Although high accuracy has been achieved for frontal faces, the existing methods have shown poor performance for occluded and corrupt images. Recently, sparse representation based classiﬁcation (SRC) has shown the state-of-the-art result in face recognition on corrupt and occluded face images. Several researchers have developed extended SRC methods in the last decade. This paper mainly focuses on SRC and its extended methods of face recognition. SRC methods have been compared on the basis of ﬁve issues of face recognition such as linear variation, non-linear variation, undersampled, pose variation, and low resolution. Detailed analysis of SRC methods for issues of face recognition have been discussed based on experimental results and execution time. Finally, the limitation of SRC methods have been listed to help the researchers to extend the work of existing methods to resolve the unsolved issues.


INTRODUCTION
Since the last two decades, face recognition (FR) is playing a very significant role in authentication and security purpose. Recently, several homes and companies have face recognition based biometric system for security purpose. The world lurks under the threat of terrorism and the need of the hour is a good system of human face recognition, so that it recognizes the human face from the images and video. Since, the researchers have been working in the area of face recognition and they have achieved state-of-the-art results for frontal face images. Face recognition is mainly a two-step process: first step is feature extraction and the second is classification. Discriminative feature extraction from the face image is a difficult task due to occlusion, noise, disguise, variation, and similar structure of the faces. Several methods have been proposed to extract discriminative features from the face image like principal component analysis (PCA) [1], local binary pattern LBP [2], Fisher discriminant analysis (FDA) [3], independent component analysis (ICA) [4], and locality preserving projections (LPP) [5], curvelet [6]. The aforementioned feature extraction methods are known to support the linear data and not non-linear data. The non-linear data can be converted into a linear data by converting original space into a high-dimensional space with the help of kernel trick method or kernel mapping function like kernal  [7], kernel FDA (KFDA) [7], and support vector machine (SVM) [8]. After the feature extraction process, these extracted features are used for classification to recognize a label of probe sample from the training dataset. Classification plays an important role to find out a correct class for the probe sample from the training dataset. There are several existing classifiers which are used for face recognition such as nearest neighbour (NN) [9], nearest feature line (NFL) [10], hidden Markov model (HMM) [11], nearest feature plane (NFP) [12], nearest feature subspace (NFS) [12], SVM [8], linear regression based classification (LRC) [13], and sparse representation based classification (SRC) [14]. There exists numerous challenges in face recognition such as linear variation, pose variation, rotation, non-linear variation, undersampled, scaling, and low resolution [15]. These issues degrade the performance of face recognition methods and they failed to solve all the issues of face recognition. Since a decade, SRC has shown good performance over all the existing methods on an occluded face image [14,16,17]. Initially, sparse representation was used in signal processing for compression [18] and representation rather than inference and classification. The main purpose of sparse representation was to produce the most compressed signal with the help of a linear combination of atoms in an overcomplete dictionary [19]. Many researchers have proved that the sparse representation is an extension of traditional methods like Fourier, wavelet and is being used widely in the field of reconstruction of signal. SRC method has been used in many practical applications of the computer vision such as face recognition [14], image superresolution [20], denoising [21], clustering [22], image segmentation [23], image deblurring [24], image classification [25]. For the very first time Wright et al. [14] introduced the SRC method on face recognition in 2009 and since then SRC has been used in face recognition application with some modifications. SRC has shown better performance and more discriminating power than existing classifier. NN classifies a probe sample into the given training class that contains the closest neighbour sample. NN uses a single sample from the class for classification. NN classifier degrades the performance of face recognition due to outlier or noise. NFL uses two samples to express the probe sample from the same class. NFL enhances the express capacity of the classifier by passing the line through each pair of the samples belonging to the same class. NFP enhances the representation capacity of the probe sample by passing the plane through three samples from the same class. NN, NFL, and NFP are not efficient when the classes have a number of samples. In NFS. the probe sample approximates with all samples from the same class and predicts a class label which produces minimum reconstruction error. SVM is mostly used for linear classification but for non-linear classification kernel method is needed. It is difficult to choose SVM parameters and appropriate kernel method. SVM also needs a feature extraction method to extract features from the image. SRC represents a probe sample by a linear combination of dictionary atoms, which means the samples from all classes and a probe sample belonging to the minimum residual producing class. SRC does not care about the feature of input data and it needs a large number of samples for each class. It has achieved state-of-the-art results in face recognition application. Wright et al. applied SRC on noisy, occlusion, and disguise images. In [14], authors have achieved better performance for face recognition and it has been concluded that SRC tolerates small pose variation. SRC is a global representation that never considers locality of the feature during classification. Its performance depends on a sparse coefficient and dictionary. Researchers have developed several extended methods of SRC by modifying dictionary and sparse coefficient vector. Initially, a dictionary was constructed by using the training samples, but due to variation in probe sample, it cannot be well represented by dictionary atoms. There are many popular dictionary learning algorithms [26] which update the dictionary atoms with proper prototypes such as k singular value decomposition (KSVD) [27], label consistent KSVD (LC-KSVD) [28], Fisher discrimination dictionary learning (FDDL) [29], D-KSVD, support vector guided dictionary learning (SVGDL) [30], and DLSPC-G. The linear equation used in sparse representation can be solved by convex relaxation method [31] and greedy method [32].
The main aim of the survey paper is to find the strength and limitation of SRC methods, which helps the researchers to extend the work of existing methods to resolve the unsolved issues.
The paper is organized as follows. In Section 2, the SRC framework has been introduced. Section 3 discusses five different categories of SRC and its extended methods for face recog-nition. The different extended SRC (ESRC) methods have been compared based on their performance of face recognition and this has been provided in Section 4. Section 5 presents experimental result. Section 6 discusses conclusion and future aspects.

SPARSE REPRESENTATION BASED CLASSIFICATION
SRC finds the sparsest representation of a probe sample from the dictionary which is constructed by all the training samples and classifies a probe sample to the class which produces less residual error.
Let A = [A 1 , A 2 , .., A k ] ℜ m * n be the set of training samples of all classes are called as dictionary. Each dictionary column is referred to as an atom or a basis. There are total n number of atoms in the dictionary. k represents number of classes and let A i = [a i,1 , a i,2 , … ., a i,N ] represent the set of samples that belongs to the ith class. N is number of samples in a class.
x ℜ n is a sparse vector which has very few non-zero elements.
In ideal case, samples of same class lies in linear subspace and the probe sample y can be approximated with ith class, that is, y = i,1 a i,1 + i,2 a i,2 + ⋯ ⋅ ⋅ + i,n a i,n . A probe sample y can be linearly represented in terms of all training samples by using Equation (1) [14].
where ‖.‖ 0 refer to as 0 -norm which counts non-zero entries in a vector. To solve a linear Equation (2) by 0 -minimization is an NP-hard problem where it is difficult to find an approximation. If a solution of x 0 is sparse enough, then it can be solved by using 1 -minimization in polynomial time [35]. Then Equation (2) can be modified with the help of 1 -minimization or convex relaxation methods [31] as shown in Equation (3).
where ‖.‖ 1 refers to 1 -norm which performs summation of all in X vector. When the images are captured in an unconstrained environment it may be affected by noise and hence, Equation (3) is not enough to find the proper approximation for noisy ALGORITHM 1 Sparse representation based classification [14] 1.
Step 1: A matrix of training samples (each column has feature vector of sample image) A = [A 1 , A 2 , … ., A k ] ℜ m * n for k classes, probe sample y ℜ m .

2.
Step 2: Normalize the column of A.
images. Then the linear Equation (3) can be extended for noisy or corrupted images with the help of error or noise vector as shown in Equation (4).
When the probe sample is partially occluded or corrupted then the linear model that is defined in Equation (2) can be extended by adding occlusion dictionary I as shown in Equation (5): where, e is error vector, I is occlusion matrix. The basic idea of SRC is to search representative atoms from the dictionary which is constructed by all the training samples that can sparsely represent a probe sample y and its sparse coefficient is obtained by solving the convex problem of the linear system using Equations (3) and (5). A probe sample is assigned to the class that minimizes the residual error. SRC assumes that the samples from the same class lie on same subspace. Hence, probe sample y is classified based on minimal residual or the class which has a more number of non-zero entries in sparse coefficient vector x ℜ n , the residual is computed by Equation (6).
where r i (y) is a residual of probe sample y with respect to the ith class. Non-face images can be validated or classified separately using SRC. If non-zero coefficients are spread across all over the classes of sparse vector x 1 then that probe sample is neither from gallery set nor a face image at all. Thus it can be classified separately as no face image. The SRC has been performing better under specific condition only. These conditions are: • SRC needs the assumption that all samples from single class lie on same subspace. • Subspace from one class should not intersect with other classes. • Each class has enough number of samples. • Training samples should not be corrupted.
• Training samples contain more variation and pose variation.
Several ESRC methods have been proposed to overcome the above-mentioned SRC issues.

BROAD CATEGORIES OF SRC BASED FACE RECOGNITION ALGORITHMS
When the image is captured in an unconstrained environment, it may be possible to have linear variation, non-linear variation, pose variation, and low-resolution issues. There are other issues of face recognition due to less number of samples, lack of variation in training samples, and corrupted/noisy training samples. In this survey, SRC and its extended methods have been broadly classified for different problems of face recognition as follows:

Linear variation for face images
Illumination, occlusion, and corruption degrade the performance of face recognition. SRC [14] has overcome the issues of linear variation and performed better than existing holistic methods like PCA, LDA, and ICA. As SRC is robust to occlusion [14], it uses occlusion dictionary to solve the problem of occlusion in face recognition but occlusion matrix size is very big and it is observed that big dictionary increases the time complexity. Yang [36] proposed Gabor feature based SRC (GSRC) method where the authors used Gabor feature based occlusion dictionary instead of occlusion dictionary to reduce the size of occlusion dictionary. Gabor features are invariant to constrained illumination, rotation, expression, and scaling. These features help GSRC to improve FR performance. The GSRC represent test sample by modifying (5) in Equation (7).
where G (A) is a dictionary which contain Gabor feature and G (I ) is a Gabor feature based occlusion dictionary. The GSRC method used Gabor features of training samples as a dictionary atoms whereas SRC method used training samples as a dictionary atoms. SRC requires the assumption that the training samples of each class follows linearity (globally linear) or lies on the same subspace. Sometimes subspace of samples of any class can intersect the subspace of other classes. In such situations, the probe sample can be represented by dictionary atoms of other classes, though the training sample of correct class lies on the same subspace. SRC still requires another assumption that different classes subspace should not intersect with each other, that is SRC does not support when the data is globally non-linear. It is noted that if the data is globally non-linear then SRC can misclassify the probe sample. Hui et al. [35] proposed sparse neighbour representation based classification SNRC method, that finds first k basis from each class which are the nearest neighbours to probe sample y, whereas SRC finds a basis from all the classes. First, k-nearest neighbour samples of each class form the local dictionary and SNRC classifies the class which can be best approximate with the probe sample y.
where, i is a dictionary that contains first k nearest samples of y in class i. In SNRC, the authors first used KNN method to build local dictionary but that is not appropriate on complex image data or corrupted face images. It is hence said that it cannot select the correct nearest neighbour samples for the probe sample y from each class of training dataset. After they applied SRC on non-appropriate selected samples or atoms to approximate sparse coefficient of probe sample y, it is thus observed that some correct samples of training data could not be part of the local dictionary. The inappropriate local dictionary will not harness sparsity, hence it could misclassify probe sample. The modular approach in SRC [14] divides the face image into different parts or modules and uses majority voting for classification. In this approach, occluded and clean model takes part equally in final classification decision. But in case of equally weighted, clean module looses correlation between other modules which is most important for final classification. To overcome these issues, Lai et al. [38] proposed weighted global sparse representation (WGSR) method that considers only clean models for classification and utilizes correlation information of clean modules in the recognition process. The reliability of module is obtained by sparsity and the weight of the module is determined by the residual of the module. The weight of the module reconstructs global feature vector that helps the final classification. This method only works for frontal images and affects the performance if images have pose and non-linear variation. SRC fails to provide good performance, when the face image data distribution is non-linear. Non-separable linear data can be converted into linear separable data by using kernel trick or non-linear mapping function. Non-linear mapping function transfers data from original feature (original dimension) space ℜ m into the high-dimensional space H , ℜ m → H , A→ (A). Yin et al. [39] proposed kernel based SRC called KSRC, in this method a mapping function was proposed to convert original feature (original dimension) space into high-dimensional feature space and then apply SRC on the high-dimensional feature space. If input images are noiseless then the computational cost of SRC and KSRC are same. If the dimension of the sample is more than a number of samples in each class then KSRC works better than SRC, contrary to this KSRC is less efficient than SRC. Let A = [ (A 1 ), (A 2 ), … ., (A k )] represent dictionary of high-dimensional training samples. To approximate sparse coefficient, KSRC used 1 -minimization to solve the convex problem by Equation (9).
It is an improvement of SRC but the selection of the kernel function for different images is a difficult task and still poses as an open challenge for the researchers. KSRC method increases the dimension of the data, hence the time complexity increases and fails to resolve the issues of SRC. SRC decides the class label of probe sample y based on the minimum class residual. Li et al. [16] suggested that the residual based decision rule is not efficient for sparsity. Jiang proposed new decision rule for SRC based on the sum of the coefficient. Sparse coefficient measures the similarity between a probe sample and the training samples. Maximum sum of coefficient (SoC) easily decides the class label of probe sample by Equation (10).
where, 1 i is a vector or can we say identity vector, i indicates class label. Here sparse coefficient value is directly proportional to the similarity. SoC based SRC performs better than residual based SRC on ideal images. If the probe sample is partially corrupted, occluded, and affected by illumination then it is observed that SoC based SRC method produces less coefficient with correct class or produces more coefficient with incorrect class. Hence, it may generate less similarity with samples of correct class. In low-dimensional or downsampled image, subspaces have very less discriminative information and hence linearity structure is not enough for the SRC. When linearity does not support for classification that time locality can do a better job in classification. Lu et al. [40] proposed a weighted sparse representation based classification (WSRC) method that enforces the locality restraint on the sparsity regularized reconstruction problem. In this scenario, SRC generates different sparse coefficient vector even for similar probe samples using 1 -minimization method. It is hence noted that, the sparse coefficient classifies the probe sample by training samples which may belong to the incorrect class, that leads to the unreliable classification result. WSRC method unites both data locality and sparsity structure into expression (11) to overcome the downsampled problem. While sparse linear approximation WSRC conserves the similarity between probe sample and neighbouring training samples. The WSRC approximate sparse vector by Equation (11) using 1 -minimization method.
where, W is the locality adaptor that calculates the distance between probe sample y and training samples by Equation (12).
where, s is a locality adaptor parameter if s=0, then WSRC works similar to SRC, c represents number of class, and nc is a total sample. The same person having different variation samples produces different distances using Equation (12). Distance measure cannot work when the samples have a different variation in it. WKSRC [41] is a combination of KSRC [39] and WSRC [40], by combining these methods the authors have tried to overcome the issues of locality and non-linearity. KSRC solved the linearity but does not carry the locality in feature data. Similarly, WSRC supports locality but does not support non-linearity. Liu et al. [41] proposed WKSRC, this method first transfers the data from low-dimensional space to high-dimensional space using non-linear mapping function or kernel trick and then applies dimensionality reduction method such as KPCA and KFDA on high-dimensional feature space to reduce the cost of computation and storage space, after this it applies weighted matrix to sustain locality in data. Illumination is a big challenge in face recognition application, but illumination variation is overcome by applying the MSR algorithm. This method obtains efficient coefficient vector by combining kernel feature space and locality information but did not solve the issues of SRC. According to Hu et al. [42], dictionary has a significant role in SRC, but has a challenge in finding relationship between class label and dictionary atoms due to the variation in training and probe samples. In simple SRC, the dictionary model exploits the classification on the basis of the residual and sparse coefficient that restricts the performance of face recognition. To overcome this issue, Hu et al. [42] introduced a new novel dictionary building method, where dictionary was constructed by a common basis and discriminative basis. Apply PCA on a matrix of training data A and select p number of eigenvector, which is shared by all individuals of training sample called molecular dictionary G and training data itself act as an atom dictionary A which is discriminative in nature. Then, the extended dictionary is incurred by the merging of the atom dictionary and molecular dictionary. Extended is replaced by dictionary in Equation (13).
Non-zero elements of the sparse vector tell the similarity between probe sample and dictionary atom. Any image finds more similarity with the image itself but produces less similarity with Eigenvector of all training samples. Hence there is no valid reason for merging the molecular dictionary with the atom dictionary.
Lu et al. [43] proposed the FDLSR method which replace 1 by 2−1 -norm, which takes advantage of both the norms. 1norm is discriminative in nature and 2 -norm uses for systematic representation. The 2−1 -minimization is used to decrease the negative impact of an outlier and it is performed to select class-specific samples across all the data points with some sparsity This method shows stronger discriminant power than the other representation models because dictionary is learned by Fisher discrimination criteria. The 2−1 -norm is robust to the noise but this method does not tackle the issue of unconstrained condition. The 2−1 -norm can be generalized as shown in Equation (14).
where A is a matrix and a is a vector. The WSCR [38] method considers the same weights for all the patches and it ignores the different variations in patches. The different variations can change the weight of the patches. Ma et al. [44] introduced a new method robust SRC in an adaptive weighted spatial pyramid structure (RSRC-ASP) which makes the use of sparsity evaluation based self-adaptive weighting strategy for residual aggregation, which provide different weights with respect to the variation of the patches. This algorithm needs proper alignment images and it is not possible practically. In RSRC-ASP, the final aggregated residual estimated by Equation (15).
where 1 − ‖ k ‖ 1 is the self-adaptive weight corresponding to subpatch k, k i is the sub-sparse coefficient vector over the ith class dictionary atom of patch k.
Nowadays deep learning has shown tremendous performance in computer vision to extract the automatic features from the image. The authors Chang et al. [63] have used stacked convolutional autoencoder (SCAE) to extract the features to construct a dictionary and then applied SRC for classification. The deep learning method extracts complex feature whereas other feature extraction methods do not extract the complex features. Thus this method is able to solve linear variation and illumination but does not solve the pose and the undersampled issues of FR.
The resolution of an image plays a significant role in the performance of FR and resolution need to change for betterment of algorithm. Author Jitendra proposed a new method multiresolution sparse representation based classification (MRSRC) [67],this method used compress wavelet features as a dictionary atom but this method is not effective for non-linear variation.

Non-linear variation for face images
Expression, smile, and local deformation comes under nonlinear variation. The SRC uses the holistic feature in the dictionary that cannot handle expression and local deformation effectively. If the probe sample contains non-linear variation, then a single dictionary is not enough for the classification. An extra dictionary needs to be added that contains non-linear variation or any mechanism that can find a proper prototype for each class so that the dictionary can represent non-linear variation also. In the GSRC [36], the Gabor feature is less sensitive to non-linear variation, hence the Gabor feature of training samples is used as a dictionary atoms. The GSRC has shown an improvement over SRC. Nevertheless, SRC needs uncorrupted and more variation training samples to resolve non-linear variation. The ESRC [17] has overcome non-linear variation issues by including intraclass variation dictionary in Equation (4) and this variation dictionary is constructed by taking the variation from the samples of the same class. The intraclass dictionary contains variation information of face images like illumination, occlusion, expression, and interclass dictionary containing natural images. If the training samples with all variations are not available then the performance of ESRC is poor. Jiang et al. [45] proposed sparse-and dense-hybrid representation supervised low-rank (SDR-SLR) method to alleviate the problem of less variation and corrupted samples in the training class. The SDR-SLR represent a probe sample by collaboration of classspecific dictionary and intraclass dictionary. The ESRC, GSRC, and SDR-SLR need more variation for good classification but Gao et al. [46] proposed semi-supervised sparse representation based classification (S 3 RC ) method that uses a non-linear variation of unlabelled and labelled samples to find the better prototype. S 3 RC uses the gaussian mixture model (GMM) to find a single prototype for each class using labelled and unlabelled samples. The GMM learns prototype or centroid of each class using a semi-supervised expectation-maximization (EM) algorithm. Image pixel value changes due to small variation, then it is difficult to observe or decide the discriminative information through threshold value. The first time Yadav et al. proposed a new method extended interval type-II and kernel based sparse representation (ExIntTy2KBSRM) [47] which combined fuzzy logic with kernel sparse based presentation. The authors use extended type membership function to fuzzify the database using Equation (16) and it helps to measure unseen information present in the pixel. After the fuzzification of train dataset, finds the k nearest atoms to the probe sample from the dictionary and then apply KSRC to estimate sparse vector. This method handles illumination, occlusion, and expression by introducing fuzzification to the pixel values. But it failed to solve undersampled and pose problem.
where a and d locate the feet of the function curve, b and c denote the shoulders.
Authors uses two different extended type membership functions for the generation of extended interval type-II membership function, as shown in Equation (17).
where subscripts L and U represent lower and upper membership functions, respectively. The quality of any algorithm depends on the accuracy and time complexity of that algorithm. The accuracy of FR depends on features of the image and time complexity is based on the time required to execute the algorithm. The authors have considered the aforementioned issues and solved these issues by developing a new method which used Gabor features that are invariant to illumination, expression, and rotation. The least number of features can give high results and faster execution hence authors have used supervised locality preserving projections (SLPP) to reduce the dimension of features and this method used fast matching pursuit (FMP) method to represent a probe sample which is faster than homotomy algorithm which is used in other SRC based methods. The authors have combined all the advantages and proposed Gabor-SLPP-FMP (GSF) framework [61] which is efficient than conventional SRC methods. This method does not solve undersampled, pose issues of FR.
Existing methods had used global features and local features to resolve FR issues. But it is difficult to solve FR issues by single global features or local features. The authors have used both the features with addition to statistic property and logarithmic operation. The authors have proposed LWS for feature descriptor and fast extended sparse-weighted representation classifier (FESWRC) for the representation of a probe sample. The logarithmic-weighted sum (LWS) feature descriptor combines the significance of local features (DOST), global features (Gabor), a statistical property of image (covariance), and logarithmic operation. The FESWRC used the primal augmented Lagrangian method (PALM) for fast execution. Finally, FDWLSR [65] is obtain by combining LWS and FESWRC. This method can solve linear and non-linear variation with the help of the aforementioned preprocessing method but is unable to solve undersampled, pose low-resolution issues. The FESWRC can be explained by Equation (18).
where W A and W D I are weight coefficient vector of corresponding dictionaries A and D I , respectively. The symbol ⊗ represents the Hadamard product operator. The sparsity constrained ‖x‖ 1 and fidelity ||y − Ax|| 2 are two important terms in Equation (3). The 2 -norm is used as fidelity term and this term is impacted by a large error caused due to more variation in samples. When the corrupted pixels is more then the linear Equation (3) may fail. The Mi [68] have proposed a new method, robust supervised sparse representation (RSSR), by improving fidelity term and collaborative representation strategy is helpful to improve recognition accuracy. This method use Huber loss as fidelity term in linear equation.
The RSSR uses two-phase representation scheme to implement supervised sparse representation. The first phase, referred to as the coarse representation, uses all training samples to represent a probe; while in the second phase, referred to as a fine supervised sparse representation, only training samples which have high contribution in the coarse representation are used and the coefficients for the rest of the training samples are set to zero.

3.3
Undersampled for face images SRC has shown prominent results when a large number of samples are available in each class, but in the case of undersampled, SRC has performed worst. If the training dataset has less number of samples in each class and samples are noisy, then the individual class does not have an appropriate prototype for the classification. Hence, the probe sample fails to sparsely represent a correct class due to noise in training samples. To overcome this issue of undersampled, Deng [17] proposed a new ESRC method. In the ESRC method, authors have introduced an extra intraclass dictionary that is created by subtracting the prototype or natural image of a class from the samples of the same class. The linear representation of Equation (4) is modified by introducing intraclass dictionary D and is shown in Equation (19).
where, D I is the intraclass variant matrix that represents the local expression, environmental illumination, and occlusion of the face image, where these changes cannot be modelled by a small dense noise matrix z in Equation (4). The ESRC method represents a probe sample using two dictionaries A and D I by Equation (20).
In ESRC, the first dictionary takes part in the classification process of probe sample and the second dictionary is used to generate a more stable sparse vector. The extra variant dictionary is constructed from the variation of the samples of the same class but when within the class, variation samples are found to be less, then probe sample classifies the other class that has more variation samples. The second dictionary of the ESRC method does not take part in the classification. Each class has a different variation which means some classes have less variation in samples and the first dictionary also contains the same variation hence ESRC reduces efficiency. The above-mentioned problem having less variation is solved by the same author with the help of a new method called a superposed SRC method [48]. In this method, the first part of the dictionary atoms of each class is replaced by a single prototype or centroid of a class. However, this method does not provide a reliable classification due to the single prototype or centroid of each class. The centroid of each class is used to represent and determine the label of probe sample. This single prototype or centroid can lead to misclassification in case of variation. Zhang et al. [49] proposed the OSPP-SR method, which is an enhancement to the ESRC method by improving the intravariant dictionary. The OSPP-SR method required a lesser number of atoms in the intravariant dictionary, resulting in reduced computational cost. Iliadis et al. [50] combine sparsity approach of ESRC with least square and proposed a new method, sparse representation, and regularized least squares (SR + RLS) that first finds sparse coefficient x using ESRC Equation (20) and then constructs a small dictionary with non-zero dictionary atom and finally, approximate f using regularized least square (21).
In this method, the authors used two classification mechanisms to find a correct class instead of using any learning mechanism for a proper prototype of the respective classes. The SR_RLS method has improved performance over ESRC but fails to overcome the issues of ESRC. The ESRC cannot solve the problem of corrupted training samples and a few variations containing samples of the individual class. These issues are resolved by Jiang who proposed the SDR-SLR [45] method that represents probe sample by a collaboration of class-specific dictionary and the dense combination of the non-class-specific dictionary (intraclass variation dictionary). The ESRC uses the variation to represent a probe sample from the same class, whereas SDR takes the collaborative variation of different classes. The ESRC represents the probe sample through training samples using Equation (20). In SDR-SLR, first-class-specific components are extracted from natural images and then it is by SRC as shown in Equation (22).
Where a is a class-specific component and represented by dictionary A that contains class-specific component using linear model a = A + e a , b is a non-class-specific component and it is represent by dictionary B that uses linear model b = B + e b , s indicate random sparse noise subspace or noise matrix. The SDR represents the probe sample that is shown in Equation (23). It uses two dictionaries and SDR-SLR approximates their sparse coefficient using Equation (24). The SSRC requires more number of samples to get a single centroid or prototype of the class. The centroid or prototype of a class is estimated by performing a linear operation like average. That does not give discriminative information for classification. To overcome the issue of non-linear variation in single labelled sample per person (SLSPP), Gao et al. [46] proposed an S 3 RC method which used the ESRC framework with modification of the gallery dictionary.S 3 RC used two dictionaries: gallery dictionary is estimated by Equation (26) and variation dictionary is estimated by Equation (26). The intravariate dictionary helps to rectify linear variation like illumination, occlusion, and the gallery dictionary is used to reduce non-linear variation like expression, smile. ESRC and SSRC methods did not learn a gallery dictionary to resolve the non-linear issue. S 3 RC used GMM to find a single prototype for each class using labelled and unlabelled samples. The GMM learns prototype or centroid using a semi-supervised EM algorithm. The performance of FR is depends on the result of GMM-EM but data needed to follow normal distribution. The S 3 RC represent a probe sample by Equation (27).
where P is a gallery dictionary and V is variation dictionary. The image features are complex structures (non-linear), but simple SRC methods do not solve the non-linear structure. But kernel methods can convert the non-linear structure into a linear structure. This problem can be solved by the KSRC method in sparse representation. Fan et al. have taken advantage of this method and proposed a method called kernel coordinate descent based on virtual dictionary (KCDVD) [51], which constructs the virtual dictionary by approximately symmetrical face images (ASFI) by using Equations (28) and (29). Due to the virtual dictionary, it alleviates the problem of undersampled but the KCDVD method does not solve the other issue of KSRC except undersampled.
where is the learning rate and t represents the iteration or time , I 1 is a left part of the face image (column vector) similarly I 2 is a right part of the face image (column vector). The main motive of ASFI method is to reduce the error of ‖I 1 − I 2 ‖ 2 , which is done by gradient descent scheme. The KCDVD method generates a virtual sample from the same image but is not able to add different variations in the newly created images. Thus this method can only solve linear variation problem but is not able to solve non-linear variation, and pose variation issues of FR. The undersampled issue was solved by authors [62] by developing a random filter sparse representation (RFSR) method. This method follows the two-step scheme. In the first step create new virtual random filtering images from original training samples using Equation (30) and create a new training database by combining newly created samples with original samples. In the second step perform CRC on the newly created training dataset. Authors also add noise in newly create virtual images. But realistic variations are totally different than creating random filtering images, hence it could not solve the realistic variation problem.
where p and q are the X − Y coordinates of an image A, s is a scale of of filter L = t * t , t is always an odd number and decides the size of the filter. When the training sample is insufficient then many existing methods have developed intravarient dictionary to improve the performance of FR. But authors proposed new a method transfer learning based sparse representation and weighted fusion (TLSRWF) to tackle undersampled issue where it use transfer learning method and weighted fusion scheme for sparse representation. The residual is estimated by Equations (31) and (35).
We combined the weights from two different reconstruction errors and finally classified test samples by evaluating which class leads to the minimum weighted fusion If kk = argmin f c , then the test sample is finally assigned to the kth class.
The SDR-SLR extract class-specific information using single value decomposition (SVD) but when the training images are corrupted by non-Gaussian noise then it is hard for SDR-SLR to obtain optimal low-rank matrix. The low-rank component contains important facial features and are beneficial for better representation. A low-rank recovery is a key technique that can be separate corrupted information from the training face images. Yang et al. [52] proposed a new sparse low-rank component based representation (SLCR) method based on low-rank recovery. It constructs the dictionary by low-rank component and non-low-rank component which are approximate by using low-rank matrix recovery as shown in Equation (37), hence this method is better than the other SRC methods for the undersampled and low-quality images. The linear equation of SLCR is represented by Equation (38) and a sparse coefficient is estimated by Equation (39).
where and are constants for a compromise.
The residual is estimated by Equation (40) r where I is an identity matrix, L is the low-rank dictionary of SLCR, C i is a class label matrix of the training dataset D for class i, its element C i (k, k) = 1 if the kth training image originates from class i and all other elements of C i are zero. The ESRC [17] method extract variation from the samples of the same class and KSRC [39] method transfer low dimensional space data into high-dimensional space. The Fan et al. combine the concept of KSRC with ESRC and proposed fast kernel ESRC method which takes advantage of both the methods but does not alleviate the issue of non-linear variation when the samples of a class do not have different variation. The FKESRC achieved high computational efficiency by using the coordinate descent method in the feature space.

Pose variation for face images
Pose variation is a big issue in face recognition, when the probe sample has pose variation with training samples then face recognition performance degrades rapidly. Pose variation changes the features of the face image with respect to their position, so these features can make a strong correlation with samples of other classes having the same pose rather than a sample of the correct class. To overcome this issue, Feilong et al. [53] proposed the multi-weighted sparse representation (MW-SR) method. In case of pose variation, a similar pose shares more contribution and if the sample is from the same person then it contributes the most. In this method, first all training samples are separated into p group on the basis of poses. A i ℜ m * m i i = 1, 2, ..p; and order the direction of the face from left pose to right pose. The dictionary A and a probe sample y are given, the sparse vector x is estimate by solving the linear Equation (3) then compute residual by using Equation (6). The minimum residual decides the closest pose group of a probe sample y. Once the pose of the probe sample is obtained then all weights for all pose with respect to probe sample are calculated. Weights are calculated by Equation (41) and the authors assume that weights followed a Gaussian distribution.
where, u is the sub-index of the determined closest pose by SR method and s is the sampling step. There may be a chance of selecting a different pose sample, thus finding appropriate similar pose weights has been introduced into SRC. The MW-SR model can extend by 1 -minimization as shown in Equation (42).
In the MW-SR method, weights build the relation between different poses of the face. Even if the training dataset does not have a similar pose like a probe sample, still this method is able to classify or identify a probe sample. The MW-SR method needs more training samples with different poses, but in realworld scenarios, it may not be possible to obtain different pose images for all training classes. In the case of undersampled and non-linear variation, the MW-SR method degrades the performance drastically. When the training and testing samples are captured in a different environment then the distribution of the images might be different due to unconstrained variations such as pose, illumination, and expression. For better result distribution mismatch needs to be reduced. The authors [64] proposed a new method subspace extended sparse representation classifier and learning discriminative feature SESRC-LDF to tackle the above face recognition problem. This method has used two classification strategies on the basis of probe-sample symmetry. The symmetry of a sample has small variation then it will be recognized by SESRC, otherwise it will be recognized by LDF. Before solving sparse linear objective function, the goal of subspace learning is to find an optimal transformation matrix that transforms all testing and training samples into a common subspace as shown in Equation (43). The transform matrix is used to reduce the intraclass differences and enlarge the inter-class differences.
where A represents a texture structure gallery dictionary and B is a texture structure invariant dictionary. A is a transform matrix that is used to learn discriminative subspace.

Low resolution for face images
Low-resolution (LR) images have less discriminative features for image classification, hence LR images degrade the performance of face recognition. When the training dataset has highresolution images and a probe sample is a low-resolution image then there is a mismatch of resolution and discriminative feature. The major challenge of low-resolution face images to be resolved are: first, convert LR images into high-resolution (HR) images, second HR training images convert into LR images, and third both LR and HR images are embedded into common subspace. There are several ways to tackle the low-resolution problem of face recognition like super-resolution (SR), interpolation. Bilgazyev et al. [54] have shown improvement using the superresolution technique in face recognition. Initially, Yang et al. [20] proposed super-resolution (SR) using sparse representation, for the construction of HR images from LR images. In [55,56], authors have used SR using sparse representation in face recognition and shown improved performance. The SLCR construct the dictionary by low-rank component and non-low-rank component which are approximate by using low-rank matrix recovery, hence this method is better for lowquality images.

COMPARISON OF DIFFERENT SRC METHODS
In this section, a comparison of SRC and all its extended methods have been done on five problems of face recognition.
• Linear variation: linear variation contains illumination effect, noise, and occlusion. • Non-linear variation: local expression or deformation.
• Undersampled: less number of samples in each class of training dataset. • Pose variation: should not be more than 20 degrees. • Low resolution: very few pixels contain the whole image information.
All SRC methods have been compared on the basis of their significant advantages, disadvantages, and learning mechanism as used in SRC. All ESRC methods are developed by modification of sparse coefficients, dictionary, and fusing new methods into SRC. The SoC SRC, WSRC, MW-SR, WKSRC are developed by modifying the sparse coefficient. The GSRC , ESRC, SSRC, SR-RLS, SDR-SLR, RFSR, molecular, KCDVD, CSR, S 3 RC , TLSRWF, FDWLSR, ExIntTy2KBSRM, SESRC-LDF, RSRC-ASP, SLCR, and MRSRC are developed by modifying dictionary and SNRC, KSRC, WKSRC, Gabor-SLPP-FMP (GSF), SRC-SCAE, FKESRC methods are developed by fusing existing methods into SRC.
From Table 1 it is clear that all SRC methods are able to solve linear variations, but to solve non-linear variation requires one more intraclass variation dictionary or needs a learning mechanism like GMM, arithmetic operation to find proper centroid or prototype of the class.
The methods which do not used any extra dictionary to estimate sparse coefficient, these methods are able to solve the only linear issue such as SRC, WGSR, KSRC, SNRC, SoC SRC, WSRC, molecular, SRC-SCAE, WKSRC, MRSRC. The methods that are able to solve linear as well as a non-linear issues such as GSRC, ESRC, SSRC, SR-RLS, SDR-SLR, S 3 RC , SLCR, CSR, TLSRWF, Gabor-SLPP-FMP(GSF), RFSR, FDWLSR, RSRC-ASP, ExIntTy2KBSRM, FKESRC failed to overcome the issues of low resolution and pose variation (> 20 0 ). The methods which are used to create new dictionary with the help of virtual samples or probe samples are able to solve undersampled issue such as ESRC, SSRC, SR-RLS, SDR-SLR, S 3 RC , KCDVD, RSRC-ASP, TLSRWF, RFSR, SLCR, FKESRC. A pose variation issue is solved by the MW-SR and SESRC-LDF methods by keeping all pose variation images in training samples but in real life scenario to gather all types of pose images is not possible. Till date, no such SRC method is developed to overcome the issue of low resolution. The SDR, SLRC, and S 3 RC are the best-performing methods among all SRC methods and are used to solve the issue of linear variation, non-linear variation, and undersampled.
The ESRC, SSRC used one more intravariant dictionary and extract varition from other samples using arithmetic operation hence these methods are able to solve non-linear variation and undersampled.
The S 3 RC methods find a single prototype for each class using labelled and unlabelled samples. This method uses GMM and the parameter learns by EM method which is able to solve linear variation, non-linear variation, and undersampled issues of FR. The SRC-SCAE method is able to solve linear variation and illumination due to deep learning features. The deep learning technique required a large number of samples with different variations but is not possible for all classes, hence this method solves only linear variation. The ExIntTy2KBSRM method combined fuzzy logic with KSRC. The membership function fuzzify the face database and it helps to measures the unseen information present in the pixels. This method solves linear and nonlinear variations. The Gabor-SLPP-FMP (GSF) framework uses the Gabor feature which is invariant to non-linear variation, the same as GSRC but it is faster than GSRC methods. This method solves linear and non-linear variations. The FDWLSR has used global features (Gabor), local features, statistical property, and logarithmic operation, these feature are invariant to linear and non-linear issues of FR. The KCDVD method generates a virtual sample from the same image and is not able to add different variations in the newly created images, hence it solves the linear and undersampled issue of FR. The RFSR method creates virtual random filtering images and adds noise in the newly created image but realistic variations are totally different than creating random filtering images. This method also solves the linear and undersampled issue of FR. The SLCR method is based on low-rank recovery, the low-rank component contains important facial features and is beneficial for better representation. The FKESRC method solved undersampled and linear variation issues but ESRC uses arithmetic operation to extract variation from other samples of the same class which is not effective for non-linear variation. The RSRC method used Huber loss as a fidelity term and two-phase representation scheme which alleviates the non-linear issue of FR. The class-specific component features are better for all variations except pose variation. The MRSRC method uses compressed wavelet features as a dictionary, thus it only solves linear variation. The SESRC-LDF method has uses two classification strategies on the basis of probe-sample symmetry. If the symmetry of a sample has small variation then it will be recognized by SESRC, otherwise it will be recognized by LDF. Before solving sparse linear objective function, the goal of subspace learning is to find an optimal transformation matrix that transforms all testing and training samples into a common subspace. The transform matrix is used to reduce the intraclass differences and enlarge the interclass differences. Till date, no such SRC method has been developed to overcome the issue of low resolution. The SDR-SLR, SLRC, and S 3 RC are the best-performing methods among all SRC methods and are used to solve the issue of linear variation, non-linear variation, and undersampled.

EXPERIMENTAL RESULTS
In this section, different state-of-the-art SRCs and its extended methods of face recognition have been evaluated. All SRC methods have done experimentation on face databases given in Table 2. AR and Extended Yale B (EYB) are the most used databases in the experimentation of SRC and its extended methods. Here, we have performed experimentation on different methods using six databases such as AR [50], EYB, labeled faces in the wild (LFW) [51], CMU PIE [59] georgia tech (GT), and olivetti research laboratory (ORL).

Experiment setup
In this subsection, the experimental details of different SRC methods is provided. The regularized parameter was set 0.001 for all methods and homotomy algorithm was used to represent probe sample in sparse vector. MATLAB 2016 was used for the experimentation having the configuration of 8 GB RAM, 3.6 GH i-7 processor.

Experimental results on different issue
Here, we have setup experimentation on different issues of face recognition such as illumination, occlusion, expression, undersampled, and low resolution.

AR database
We have taken 100 individuals and each individual has 12 samples. We choose six neutral samples per individual as training dataset and six illuminated samples per individual as testing dataset. The experimental result is shown in Figure 7.

CMU PIE database
We have taken 30 individuals and each individual has 21 samples varying with different illumination effect. We randomly selects 2, 3, 4, 5, 7, 9, and 11 number of samples for training and rest of the samples for testing. The experimental results are shown in Figure 8.

EYB database
We have taken 30 individuals and each individual has 20 samples varying with different illumination effects. We randomly selects 2, 5, 10, 15 samples for training and rest of the samples for testing. Both the sets have illuminated variation samples. The experimental result is shown in Figure 9.

EYB database
We have taken 100 individuals and each individual has 12 samples. We choose six neutral samples per individual as training dataset and six occluded samples per individual as a testing dataset. The experimental result is shown in Figure 10.
The low-rank approximation is performing better for occlusion. Thus SDR performs well in case of the occlusion.

AR database
We have taken 100 individuals and each individual has 14 samples. We choose (1) 4 neutral samples and (2) 10 neutral with illumination samples per individual as training dataset. Choose four samples with expression per individual as testing dataset. The experimental result is shown in Figure 11.

AR database
We have taken 100 individuals and each individual has 14 samples with the size of 44 × 44. We choose randomly 2, 3, 4, 5, 6, 7, 8, and 9 samples per individual as training dataset and rest of the samples for testing dataset. The experimental result varies based on number of training samples and is shown in Figure 12.

EYB database
We have taken 38 individuals and each individual has 16 samples with the size of 32 × 32. We choose randomly 2, 3, 4, 5, 6, 7, 8, and 9 samples per individual as training dataset and rest of the samples for testing dataset. The experimental result varies based on number of training samples and is shown in Figure 13.

LFW database
We have taken 34 individuals and each individual has 9 samples with the size of 50 × 50. We choose randomly 2, 3, 4, 5, 6, 7, and 8 samples per individual as training dataset and rest of the samples for testing dataset. The experimental result varies based on number of training samples and is shown in Table 3.
In case of undersampled issue, low-rank component and intravariant dictionary are performed better hence, SDR ESRC are performed better.
GT database FIGURE 13 Graph of the experimental result of undersampled issue on EYB database We have taken 50 individuals and each having 15 samples. We randomly choose 2, 3, 4, 5, 6, 7 ,8, 9 samples per individual as a training dataset and use all the others for testing with 63 × 42 size. The experimental result varies based on number of training samples and it is shown in Figure 14.

ORL database
We have taken 40 individuals and each individual has 10 samples with the size of 44 × 36. We choose randomly 2, 3, 4, 5, 6, 7, and 8 samples per individual as training dataset and rest of the

LFW database
We have taken 35 individuals and each individual has 9 samples. We choose randomly five samples per individual as a training and rest of the samples for testing. Dimension of the images were resized from 10 ×10 to 90×90. The experimental result is shown in Figure 16.

ORL database
We have taken 40 individuals and each having 10 samples. We randomly choose five samples per individual as a training dataset and used all the others for testing. Dimensions of the images are resized from 9 × 11 to 92 × 112. The experimental result varies based on resolution of samples and it is shown in Figure 17.

Analysis of execution time
The SRC and its extended methods are expensive with respect to the execution time. In this paper, the comparison of different methods with respect to execution time such as SRC, GSRC, ESRC, SSRC, SR-RLS, SDR-SLR, and CSR have been provided. In SRC based methods, the computational cost is directly proportional to the size of the dictionary. The execution time of SRC and its extended methods are given in Table 4. The SR-RLS method consumes less execution time and the order of execution time is followed by SRC, SSRC, ESRC, GSRC, SDR-SLR, and CSR. The SR-RLS method, that constructs a new dictionary on the basis of sparse coefficient estimated by ESRC which suppresses many training samples to zero, hence SR-RLS reduces the execution time. The ESRC constructs an extra variant dictionary from the samples of the same class and uses two dictionaries to calculate sparse vector, hence the execution time increases for ESRC. Whereas in case of SSRC, the first part of dictionary atoms of each class are replaced by single prototype, which reduces the size of the dictionary, and improves the time efficiency of SSRC above ESRC. The GSRC method has used occlusion dictionary which increases time complexity. The SDR-SLR is the slowest, it consumes time to decompose images into the class-specific component, nonclassspecific component, and needs a lot of optimization process to extract proper component. The CSR method requires more time to execute. Table 4 has shown the time consumption of SRC and its extended methods on five different databases such as AR, EYB, LFW, ORL, and GT. For experimentation, the dataset is divided into training images set and testing images set randomly. All experimental setup and image sizes are the same as mentioned in the undersampled Section 5.3.4

Analysis of experimental result
The experimental setup has been categorized on the basis of five issues such as illumination, occlusion, expression, low resolution, and undersampled. The experimental design helps to analyze the results better. The following observations have been observed: 1. When the training dataset contains neutral samples and test dataset contain same pose with illumination and expression variation sample then ESRC and SSRC performed better. 2. If the training dataset contains neutral samples and test dataset contain same pose with occlusion variation then SDR performed better because low-rank approximation is efficient for occlusion. 3. If both training and testing have similar variation then SDR-SLR performed better. 4. Low-resolution images can reduced the performance for all SRC methods. 5. In undersampled case, if the extra variance dictionary is not used then the performance of SRC methods decrease rapidly. 6. From the Figures 14-18 it can deduced that when the training samples for each individual increase then the performance also increases. 7. In case of one sample per person (OSPP), SDR-SLR performs better in all the experimental databases but in LFW it has been shown to be giving less accuracy. 8. SR-RLS is faster among all the methods and SDR-SLR method is slowest among all the methods. 9. Sparse coefficient value indicates the similarity between train and test sample. The similar images have high sparse coefficient value as shown in Figures 18 and 19. In SRC, the  Figures 18  and 19, we can conclude that the class which has the highest coefficient value is not enough to classify the correct class. SRC needs more number of samples which can give high sparse coefficient and produce minimum residual for a test sample. In some scenario, SRC may misclassify when with the correct class have more different variation and less similar samples. Figure 18 shows sparse coefficient value of LFW dataset where samples have more variation and Figure 19 shows sparse coefficient value of AR dataset, where samples have less variation. By observing both the figures, we conclude that the less variation training sample have better representation than more variation samples.

CONCLUSION
In this paper, we have provided a review of SRC and its extended methods for face recognition, which are developed by a modifying sparse coefficient and dictionary.The different variants of SRC have been classified on the basis of their capability of solving the issues of face recognition such as linear variation, non-linear variation, undersampled, pose variation, and low resolution. The comparison of all variants of SRC methods and have seen their advantages and disadvantages with respect to issues of face recognition, have been provided. It is observed that some methods are able to solve linear variation, non-linear variation, and undersampled problems. It is also observed that SRC had not fully explored the issue of low resolution and pose variation. We have also provided the experimental results of various SRC and its extended methods for different issues of face recognition with different numbers of training and testing samples of the face databases such as AR, EYB, LFW, CMU PIE, GT, and ORL. The result analysis of SRC along with its extended methods have been discussed on the basis of their experimental result. It is observed that SDR-SLR, CSR, ESRC, and SSRC performs better in non-linear variation and undersampled case. It is concluded that SR + RLS performs better when samples contain less variation. As we decrease the number of labelled samples performance also decreases, but in case of OSPP S 3 RC method performs better as compared to all other undersampled methods.
In the future, SRC may be developed in all its respect to explore the issues of low resolution and pose variation in face recognition.