Face hallucination based on cluster consistent dictionary learning

Face hallucination is a super-resolution technique specially designed to reconstruct high-resolution faces from low-resolution faces. Most state-of-the-art algorithms leverage position-patch prior knowledge of human faces to better super-resolve face images. How-ever, most of them assume the training face dataset is sufﬁciently large, well cropped or aligned. This paper, proposes a novel example-based face hallucination method, based on cluster consistent dictionary learning with the assumption that human faces have similar facial structures. In this method, the paired face image patches are ﬁrstly labelled as face areas including eyes, nose, mouth and other parts, as well as non-face areas without requiring the training face images cropped and aligned. Then, the training patches are clustered according their labels and textures. The cluster consistent dictionary is learned to represent the low-resolution patches and the high-resolution patches. Finally, the high-resolution patches of the input low-resolution face image can be efﬁciently generated by using the adjusted anchored neighbourhood regression. As utilizing the labelled facial parts prior knowledge, the proposed method represents more details in the reconstruction. Experimental results demonstrate that the authors’ algorithm outperforms many state-of-the-art techniques for face hallucination under different datasets.


INTRODUCTION
Face image related techniques have been well developed and investigated in recent years. These techniques have been widely used in many applications such as face recognition, video surveillance, facial expression recognition, digital entertainment, 3D face modelling, and so on. However, due to the limitations of capturing systems and the changes of environment, human face images captured are very often of low resolution. The poor quality of face images has adverse effect on the performances of computer vision and pattern recognition applications. To solve the problem, it is necessary to render a high-resolution (HR) face image from the corresponding low-resolution (LR) one. This technique is named face hallucination (FH) or face superresolution (SR) [1,2]. The major difference between face hallucination and the general super-resolution problem is that the face images have regular structures and textures. Compared with the general superresolution problem, face hallucination is challenging because sketch priors. The major drawback of using edge priors is that they focus on preserving edges so the performance in relatively smooth regions is mediocre [2,7]. Yang et al. combined a landmark localization method and the gradient map to estimate and align facial features for hallucination [8]. Gradient profile prior was used to enhance the quality of the hallucinated HR image [9]. However, this method is strongly dependent on the results of landmark localization.
Example-based super-resolution schemes have proven to be able to reconstruct significantly finer details from an LR image compared with the interpolation-based schemes [4]. The general idea of example-based approaches is to learn the statistical correlation between pairs of LR and HR images from a face dataset. The learned correlation is then applied to an input LR image to reconstruct the corresponding HR image [3]. Different methods have been studied to learn the mapping relationship between LR and HR images [1,10,11], such as 1) Sparse representation-based approaches [12][13][14]; 2) Subspace learning approaches, including local linear embedding and linear subspace learning-based approaches [15][16][17][18]; 3) Bayesian inference approaches: learning priors from numerous feature vectors to generate a function, mapping features from LR images to HR images [4,16,19].
Performance of learning-based SR methods heavily depends on the similarity between the training and the testing images to query input LR face images. The quality of the edges in a reconstructed HR image can be significantly degraded when the edges in training images cannot be matched or aligned well with the corresponding input image [20].
Many researchers have presented that the structural constraints can be applied to improve the results of face hallucination. For example, Markov random fields can be used to reduce the ambiguity between LR and HR images by learning the statistical relationship between a global face image and its local features [3]. Face image structures like facial components are exploited to transfer the high-frequency details for preserving the structural consistency [8].
Similarly, the position information about face images can be used to improve the face hallucination performance. Ma et al. synthesized the high-resolution image patch using the same position image patches of training image pairs [21]. Similar strategy was also proposed in [22] by using convex optimization. Jiang et al. proposed a face super-resolution via localityconstrained neighbour representation based on the position information [23] and contextual information [24]. Lie et al. presented a robust locality-constrained bi-layer representation model to hallucinate the face images [25]. Lu et al. proposed manifold-regularized group locality-constrained representation (MGLR) to exploit the multiple manifold structures rooted in grouped self-similarly patches [26].
From these methods, we can see that faces are highly structured and the positions information of patches from facial images are greatly concerned to improve the face hallucination performance by getting the same face patches' positions. However, most of the existing methods require the faces to be well cropped and accurately aligned, which are challenging tasks, especially for real world face images. In addition, they ignore a fact that only using a single patch with local constraints may result in unstable solutions [26].
Recently, deep learning based super resolution methods have been proposed and claimed the state-of-the-art performance. The pioneer work proposed by [27] is termed as SRCNN, which learns an end-to-end mapping between the bicubic interpolated LR images and the HR images. To get better performance, Dong et al. redesigned the SRCNN structure as FSRCNN by introducing a deconvolution layer at the end of the network [28]. Smaller filter sizes but more mapping layers were adopted to speed up the method. Yamanaka et al. proposed a model with skip connection and network in network (DCSCN) to improve the efficiency [29]. Kim et al. used a very deep convolutional network based on VGG-net to improve the accuracy of super resolution problem [30]. Li et al. proposed a feedback network (SRFBN) to refine low level representations with high-level information [31]. Similarly, structure information was considered in the deep learning framework. For example, Lu et al. developed a parallel region based deep residual network (PRDRN) to predict the missing detailed information for accurate face hallucination [32]. Usually, deep learning based methods demand a large training dataset, intensive computation and memory resources [29].
On the other hand, generative adversarial network (GAN) was proposed for super resolution and face hallucination problems recently. The seminal work proposed by [33] was capable of generating realistic textures during single image super resolution. Wang et al. introduced residual-in-residual dense block (RRDB) to improve the performance of the original SRGAN. The discriminator predicted relative realness instead of the absolute value for better visual quality [34]. Yu et al. proposed transformative discriminative neural networks to avoid heavily relying on accurate alignment of low-resolution (LR) faces before upsampling them [20]. However, the hallucinated face details by GAN based algorithms are often accompanied with unpleasant artifacts.
Inspired by the successful application of example-based learning methods for the super resolution problem and highly structural characters of face images, in this paper, we propose a novel example-based face hallucination method based on cluster consistent dictionary learning. Traditional dictionary learning algorithms [10,14,15] focus on the best sparse representation for the training signals of the learned dictionary, but do not consider the consistent capability of the dictionary [36]. Specifically, in the face hallucination problem, similar image signals (in feature spaces) may be represented by different atoms in the dictionary. However, in the test stage, we strongly expect the test image signal has a similar sparse representation to a training sample if they are close in a feature space from the same face part (or in the same cluster). This is particularly useful for face hallucination problems, as face images have similar structures including eyes, nose, mouth etc. Unfortunately, most dictionary learning methods have not considered the cluster consistency. In this project, we train a cluster consistent K-SVD (CC K-SVD) dictionary combined with the adjusted anchored neighbourhood regression [18] for the face hallucination problem.
The main contributions are summarized as follows: we study the face patches clustering by both the texture similarity and face parts positions. The generated clusters benefit to construct consistent sparse dictionary. We develop a novel example-based face hallucination method, based on discriminative cluster consistent dictionary learning, to exploit facial parts similarity prior without requiring the training face images cropped and aligned. Extensive experimental results on several benchmarks indicate that, by utilizing the prior knowledge of labelled facial parts, our proposed method represents more details in the reconstruction from a small training dataset than many state-of-the-art methods.
The remainder of this paper is organized as follows. Section II reviews the related dictionary learning methods for super resolution and face hallucination problems. In Section 3, we present the details of our proposed method. In Section IV, implementation details are described, and experimental results are presented to show the performance of our method. Finally, in Section V, we draw a conclusion.

RELATED WORK
Sparse coding has been successfully applied to the superresolution problem. The performance of sparse coding related applications heavily relies on the quality of the over-complete dictionary D. As our proposed method extends the traditional dictionary learning algorithm for face hallucination problem, in this section, we review the related dictionary-based methods for SR.

Sparse coding approaches
Sparse coding approaches try to represent the patches by training a codebook of dictionary atoms.
where D l is the learned dictionary, x is the low resolution input patch, is a weighting factor, and is the sparse coefficient of dictionary atoms for x.
In order to construct a high resolution image y, the LR and HR dictionaries are jointly trained so that they can represent HR patches and their corresponding LR counterparts using one sparse representation [13,14,35]. For example, ,D l , D h are the LR, HR dictionary respectively, is the sparse representation for both X = [x 1 , … , x N ] and Y = [y 1 , … , y N ] denoting the LR and HR image patches pairs in the training dataset, is a weighing factor to balance the importance of the sparsity regularisation, and controls the tradeoff between matching the LR input and finding an HR counterpart. Once the dictionaries are trained, given the optimal solution * of input testing LR patchx, the high-resolution patch can be easily reconstructed asŷ = D h * .

Adjusted anchored neighbourhood regression
Instead of considering the whole dictionary like the sparse encoding approach, anchored neighbourhood regression (ANR) reformulates the patch representation problem as a least square regression regularised by the l2-norm in local neighbourhoods like, where N l is the local neighbourhoods of the dictionary atoms. A projection matrix can be precalculated based on the neighbourhood. Finally, an LR input patch can be projected to HR space as,ŷ where N h is the local neighbourhoods of HR dictionary; P j is the stored projection matrix for dictionary atom d l j .
To improve the reconstruction quality, in adjusted anchored neighbourhood regression (A+), the neighbourhood in terms of the dense training samples rather than the sparse dictionary atoms are used in the ridge regression formulation of ANR. The optimization problem then can be presented as, where matrix S l contains training samples that lie closest to the dictionary atom to which the input patch y is matched; is the weight vector for representing x. Similar to Equation (4), the regressor can be defined by: matrix S l and S h contain training samples that are most correlative to the corresponding D l and D h dictionary atoms to which the input patch x is matched; and is the regularization parameter. Instead of N l and N h , S l and S h use the full training samples learning the regressors on the dictionary as ANR method, which improves the reconstruction performance.

PROPOSED METHOD
It is well known that human face images contain complicated local structures. To represent the underlying face geometric structures well, we advocate dividing the processed image patches into several groups such that each group shares similar geometric structures of face attributes. To this end, we utilize positions of facial landmarks to group the positioned training patches. After that, we cluster the patches in each group based on their texture similarities. Then, inspired by the scheme from [36] for discriminative dictionary learning, we design a cluster consistent dictionary learning for the FH problem.
In the traditional dictionary learning based method for SR, the atoms in the dictionary are learned from the training patches independently. These atoms spanned the whole space are used as anchors in the construction step. However, in the testing stage, the similar patches may be expressed by different anchors, especially, when they are located in the boundary of different areas spanned by different anchors. This inconsistent expression of similar feature patches may cause poor results in the reconstruction step for the FH problem. In this paper, we learn a single over-complete dictionary keeping cluster consistent jointly, which yields dictionaries so that face feature patches with the same class labels have similar sparse codes. Similar to advances in the SR family of dictionary learning models and anchored regressors, we also elaborate on designing a set of simple yet efficient linear regressors for FH reconstruction based on the learned CC dictionary to find the underlying local manifold structures such that the learned anchored points can better approximate the subspace of the training dataset. Figure 1 demonstrates the framework of the proposed face hallucination method. It is comprised of two stages, that is, the training stage and the testing stage. Firstly, we collect training face images with face area label boxes and key feature positions like eyes, noses, mouths etc.. Labelled face datasets [37,38] or classical face detection methods [39] can be utilized here for getting face area label boxes and key feature positions. Then, a large set of LR and HR paired patches are created from the training set. We divide the patches into five areas as eyes, noses, mouths, other face areas and non-face areas based on the distances between the patch centre positions and the key feature positions. Next, the five areas are clustered respectively. A cluster consistent matrix is constructed by the full clustered patches. Based on the label consistent matrix, different with the traditional K-SVD method, a cluster consistent dictionary is learned to represent the LR and HR patches jointly.
In the testing stage, a given LR image is first divided into the different patches. An LR input patch can be projected to the HR space by the neighbourhood in terms of the dense training samples using ridge regression formulation. Integrating all of the obtained HR patches according to their positions, the final HR image can be generated by averaging pixel values in the overlapping regions.

Patches clustering
In order to apply the structure information of face images, we first get the face areas (face boxes) from the training dataset. For those datasets without images with labeled face components, classical face detection algorithms can be used here, such as those in [39][40][41]. The face components in a face box are grouped as eyes, mouths and noses. For the face areas without face components and background areas, we group them into face-areas or non-face areas, as shown in Figure 2. Specifically, for a patch in a face box area, if the minimum distance between the patch centre and the marked position of a face component is less than 1/4 of the width of the face box, we group the patch  Figure 3 shows examples of different groups of patches in facial areas.
Then we cluster the patches separately based on their group labels by traditional clustering algorithms like K-means. We empirically set the number of clusters in each group on different face datasets. More details are described in Section 4. The total cluster number is equal to the summary of the cluster numbers in all groups of different labelled patches.
By performing a clustering algorithm like k-means, we can label the training face patches with K clusters. Each cluster is composed of the patches with similar geometric and texture structures. To well adapt to different contents in an image, once these clusters are formed, we can construct a cluster consistent matrix for the cluster consistent dictionary learning.

Cluster consistent dictionary learning
In this section, we describe the construction of a cluster consistent matrix and learning of a cluster consistent dictionary. We aim to leverage the facial structural information (i.e. position based labels) of input training image patches to learn a reconstructive and label consistent dictionary. The dictionary atoms can reveal different image structures in each cluster, which spans the whole feature space. Each dictionary atom will be chosen so that it represents a subset of the training patches ideally from a single class (cluster), for example, each dictionary item d k can be associated with a particular cluster such that representing the corresponding underlying structure.
Consider a collection of N LR and HR image patch pairs in the training dataset, denoted by X = [x 1 , … , x N ] ∈ R m×N and Y = [y 1 , … , y N ] ∈ R M ×N , m and M are dimensions of LR and HR image patches respectively. To learn a dictionary with K items D = [d 1 , … , d K ] for sparse representation of X, the consistent matrix H of all the training patches can be defined according to the patch cluster labels.
For example, assuming five training patches [x 1 , x 2 , x 3 , x 4 , x 5 ] are from two clusters. Specifically, x 1 and x 2 are from class 1, x 3 , x 4 and x 5 are from class 2. Then the label consistent matrix can be constructed by We say that h i is a cluster label vector of input patches X. The non-zero values of h i at those indices indicate that the corresponding patches are from the same cluster. For obtaining label consistent sparse codes X with the learned D, an objective function for dictionary construction is defined by: The parameter controls the tradeoff between the reconstruction error and the label consistent regularization, T is a sparsity constraint factor. H ∈ R M ×N is the cluster label consistent matrix. A denotes a linear transformation matrix. If the nonzero values h i occur at those positions, then the corresponding input patches from X and the dictionary items from D share the same label. The first term represents the reconstruction error, while the second term represents the cluster label consistent error, which enforces that the sparse codes approximate the label consistent matrix sparse codes H and forces the patches from the same class to have very similar sparse Considering that only a few atoms that are closely correlated to the input contribute to the representation, it is reasonable to divide the whole feature space into different groups, such that the atoms in each group are closely correlated to each other. Therefore, the dictionary learned in this way will be adaptive to the underlying face local structure of the training data (leading to a good representation for each member in the set with strict sparsity constraints), and will generate consistent sparse codes regardless of the size of the dictionary. The anchored points from the dictionary can better approximate the subspace of the training dataset. In the next section, we will show that the consistent property of sparse code benefits the performance of face reconstruction.

Optimization
Traditional sparse coding methods for super resolution applications can use K-SVD algorithm to find the optimal solution for all parameters simultaneously in Equation (2) [14]. K-SVD performs dimensionality reduction on the patches through PCA and using orthogonal matching pursuit (OMP) for the sparse coding. For our face hallucination application, we learn dictionary D, A and coefficient simultaneously. From Equation (2), Equation (9) can be rewritten to, . The parameter controls the tradeoff between matching the LR input and finding an HR patch that is compatible with its neighbours. In all of our experiments, we simply set = 1. LetỸ = (ŷ; This is the classical problem that K-SVD solves [42]. Following K-SVD, we can learnD, (e.g.,D, A, ) simultaneously. We use the efficient K-SVD algorithm to find the optimal solution for all parameters, which produces a label consistent sparse representation regardless of the size of the dictionary.
, where j is the jth row in , d j is the jth column of dictionary D. LetẼ k ,̃k denote the result of discarding the zero items in E k and k , respectively. d k and̃k can be estimated by solving the following equation [12]: Applying SVD decompositionẼ k = UΣV T , then d k and̃k are computed as:

Training and test
At the training stage, given a set of HR face images, the corresponding LR images are generated by using a bicubic kernel function. We randomly extract a large set of HR and LR patches from the HR and LR image pairs to form a training set. The patches represented as feature vectors with mean values subtracted are assigned to different clusters based on their textures and facial structures as described in Section 3.1. We expect that the patches in the same cluster have a similar distribution, which does not require the face images to be strictly cropped or aligned. Then, we learn a sparse dictionary D l and its corresponding D h by enforcing the same coefficients and cluster consistency in the HR and LR patch decompositions over D l and D h as Equation (11).
Instead of considering the whole dictionary like the sparse encoding approach, local neighbourhoods of the dictionary or neighbourhoods of training samples have been proven to provide better reconstruction quality [18]. Therefore, we follow the A+ strategy to reconstruct HR patches by reusing the training samples. Specifically, for each dictionary atom d l k , k ∈ (1, K ), we define the neighbourhood {S l k , S h k } in terms of the dense training samples rather than the sparse dictionary atoms. Then, the regressor can be defined by Equation (6).
For testing, the input LR image is divided into patches. Each LR patch is represented by the learned cluster label consistent dictionary. When all of the K regressors corresponding to the K clusters of patches obtained, for a given testing LR patch x l , the most correlated dictionary atom d l k and its corresponding regressor are applied to estimate the output y bŷ where F j is the most matched regressor with x l , which is measured by the maximal absolute value of correlation between x l and the atoms cross the dictionary D l , that is where abs(⋅) denotes the absolute function. Finally, the HR face image is reconstructed by averaging the overlapped areas of the HR patches. The whole process is shown in Algorithm 1. Testing stage: 1: Patition the input test image x l into overlapping patches x 1 l , … , x n l . 2: Find the best matched regressor F j for each patch, and compute the HR patch via Equations (15) and (14).
Output: HR face imageŷ by integrating all the HR patches according the positions and averaging overlapping regions.
The key difference between our method and A+ method is: our dictionary gives patches in the same cluster a similar sparse coefficient, for example, a consistent reconstructed result. As only a few atoms that are closely correlated to the input feature vector to the sparse representation, when we choose suitable correlative neighbours of each LR atom from a learned LR and HR dictionary pair, the atoms in the learned CC dictionary can reveal different face image structures. We find such CC dictionary can better represent face details in the face hallucination application. The main reason is the patches from the same face parts have a similar sparse representation in our proposed method. Therefore, they can be consistently reconstructed by the anchored training patches, which give us better face hallucination performance. This is particularly useful for face key parts areas, as eyes, nose and mouth etc.

EXPERIMENTS
To illustrate the performance of the proposed method, we evaluate our algorithm on three popular face databases: FEI face database [43], CelebA database [37] and LFPW database [38]. We compare our method with some classical and state-of-theart face hallucination and related image super resolution methods including bicubic, sparse coding Yang's [12], A+ [18], FSR-CNN [28], TRNR [44], SRGAN [33], TLcR-RL [24], ESRGAN [34], SRFBN [31], and CCR [47]. For all the compared methods, we have retrained the models using the same training dataset as our proposed method. PSNR (valuated on the luminance channel in YCbCr color space for color images) and SSIM are used as the objective measurements of the image quality.

Implementation and parameters setting
In this section, we describe the implementation details and the main parameters of our proposed method. Since A+ is the clos-est related method of ours, as a patch-based method, the training image patch is recommended with size of 12 × 12, and the overlap between two adjacent patches is suggested to be 1 pixel. We set the dictionary with 1024 anchored points and 2048, the correlative neighbourhood size, for example, p = 2048 for fast training as suggested by [18,45]. We set = 0.001 empirically for controlling the tradeoff between the sparse dictionary learning and the label consistence regularization. As some deep learning based methods only provide a training code on scale factor 4, without loss of generality, we only compare the results on upscale factor 4. Before training, we first detect the face parts by classical detection methods. Without loss of generality, we extract the key part positions including eyes, noses, mouths, and face boxes by MTCNN [39], which is an efficient and effective open-source tool for face detection. For those databases like CelebA and LFPW, the images have been well labelled. The key part positions can also be extracted based on landmark points from the ground truth.
Then, the patches are grouped into face areas or background (non-face) areas based on their positions to the face boxes. Different from the methods for preparing traditional dictionary training patches like KSVD [46], in this paper, the patches are combined with position information for further labelling. In the labelling step, the distances between the position of each patch and the positions of key parts in each image are computed. We compare the minimum distance with a predefined threshold (e.g., the average distance between left and right eyes). If the minimum distance is less than the predefined threshold , the patch is labelled as the corresponding face part. Specifically, if the minimum distance between the centre of the patch and the key part is less than 1∕4 width of the face box, the patch is labelled corresponding to the nearest face key part group. Otherwise, the unlabelled patches are grouped into the face area or the background (non-face) area according the position in or out the predicted face box.
Depending on the labels, we cluster the patches in each group separately based on their textures. The number of cluster in each group is set empirically. More details about the effect from the cluster number K and the consistency controlling parameter will be discussed later.

FEI database
The FEI face database is composed of 400 frontal face images of 200 persons, for example, two images per person with smiling expression and neutral expression respectively. In our experiments, the training set contains 360 face images, which are randomly selected from the FEI database. For fair comparison, we do not use data augment technology as applied in GAN based methods like [31]. 1,300,000 LR and HR face patch pairs are collected for training. The remaining images are used for testing. Without loss of generality, we magnify the input face image with a factor of 4. In other words, the original images form the HR image dataset, which are down-sampled with a factor of 4 to form the LR image dataset. We set four clusters   Table 1. The visual results are shown in Figure 4. From Table 1, we can see that the simple bicubic interpolation method cannot produce more high frequency details. Our algorithm performs better than classical dictionary based methods such as Yang's, A+, as well as clustering and collaborative representation method with a margin of improvement of 0.36 in PSNR and 0.01 in SSIM.
Both TNRN and TLcR-RL apply a simple neighbour representation for face hallucination problem. Concretely, TNRN uses a simple neighbour representation with Tikhonov regularization and position information. TLcR-RL uses context information and reproducing learning by adding the hallucinated HR face image to the training set. TLcR-RL presents competitive results on the FEI dataset, however, they require the training images well cropped (not suitable for FLPW dataset as described in in the following section) and the reproducing learning in the reconstruction step is time consuming.
As for the deep learning based methods, such as FSRCNN and SRFBN, we retrain them with FEI. Unfortunately, FSR-CNN shows blurry results on facial images with a upscaling factor 4. SRFBN can well maintain the face contours due to their global optimization scheme. However, it fails to capture high frequency details (refer to the eyes, noses and mouths).
When compared with GAN based methods, SRGAN and ESRGAN can be seen as the currently most popular super resolution and face hallucination methods. Although some results from GAN based methods present better perceptual loss for super resolution of natural images, particularly, ESRGAN achieves relatively sharper face contours, they tend to bring artifacts for human faces. For example, human faces, such as the eyes parts are shown in columns (f) and (i) of

CelebA database
To further examine our algorithm, we also conduct the same experiment on a large-scale real face CelebA dataset [37].  CelebA dataset consists of 20,000's of face images, and each image is labelled with five landmarks (two eyes, nose and mouth corners). We randomly select 250 images as training set. For the testing data, 100 images are randomly chosen in the remaining images. As shown in Table 2 and Figure 5, traditional interpolation, dictionary learning and example based upsampling methods, that is, Yang's, A+, TRNR and TLcR-RL cannot hallucinate clear facial details. Particularly, the sparse coding based super-resolution methods Yang's and A+ may reconstruct similar image signals (in feature spaces) by different atoms in the dictionary without considering finding a consistent correspondence between LR and HR patches. CCR uses local geometry property by clustering to improve the performance. However, the improvement is small. For TRNR and TLcR-RL methods, we adapt the original public codes for colour images. As the samples in CelebA dataset including more different poses are not well aligned as FEI dataset, face structure depended TRNR and TLcR-RL methods cannot present a good performance as shown on FEI dataset. With the help of the discriminator network, SRGAN and ESRGAN methods can achieve good perceptual loss on CelebA dataset. However, the PSNR and SSIM performances are mediocre and they may hallucinate distorted facial details especially in eyes areas. For deep convolutional network based methods SRFBN shows better performance than FSRCNN. However, the performances on the structural CelebA face images by SRFBN are not as good as shown on natural images. Besides, both of them with a large number of parameters, require more qualified hardware support (e.g. GPU) and training samples. This also implies that our up-sampling method is more suitable for the face hallucination task.

LFPW database
To further examine the robustness of our algorithm, we also conduct the same experiments on the labelled face parts inthe-wild (LFPW) face database without cropping and alignment [38]. The LFPW database contains 1,432 images with different sizes downloaded from the websites such as google.com, yahoo.com, and flickr.com with large variations in pose, expression, illumination and occlusion. Each image is labelled with 35 landmark points. We randomly select and downsample 100 images as training set. For the testing data, 100 images are randomly chosen in the remaining images. As TRNR and TLcR-RL methods require the training samples to be well cropped, and the LFPW database cannot meet this requirement, we only provide the performances of other methods. Comparing with other algorithms, as shown in Table 3 and Figure 6, our method yields better performance in terms of both PSNR and SSIM.

Choice of sensitive parameters
An important parameter is the regularization parameter in the cluster consistent dictionary learning model in Equation (10).
In this experiment, we investigate how the empirical parameter affects the performance of the CCFH method. To obtain an optimal parameter to adapt the reconstruction error and the regularization term, we analyse the PSNR and SSIM performances by varying in the range from 10 −6 to 1. Figure 7 shows the average performances of the PSNR and SSIM scores corresponding to different regularization values on CelebA dataset. Based on the results, we find that the value of does affect the cluster consistent based FH. Too small value of is insufficient to keep the consistent relationship, while too large value may impact the reconstruction loss results. We can see that the best regularization value corresponding to the highest PSNR and SSIM is around 10 −3 . Based on the statistical results, we empirically set to 10 −3 in our experiment.
Another important parameter is the number of clusters. In order to choose a reasonable cluster number for better reconstruction, we make an empirical study on how the parameter affects the construction quality by varying the cluster number K within the range from 5 to 75.
The test is conducted on the training dataset CelebA containing the dictionary with 1024 anchored points for the magnification factor as 4. Figure 8 displays the averaged PSNR and SSIM when different cluster numbers are applied to train the cluster consistent dictionary. Based on the figure, we find that larger number of cluster benefits lower reconstruction error. However, when K is greater than 40, there is no obvious decrease in the reconstruction error. Instead, too many clusters may impact the reconstruction results. According to the experiment, we suggest prefixing K = 40 throughout our experiment.

Computational complexity analysis
In this section, we discuss the running time of the proposed FH algorithm. Because the running time in the testing stage is the main factor for learning-based FH approaches, here we only focus on the discussion about the computational complexity in the testing stage. Our implementation of CCFH has a similar computation time to that of A+ because of similar strategy, whose time complexity for encoding LR input patches to HR output patches is linear in the number of input image patches and the number of anchoring atoms. Thus, the major procedures of our method involve two parts, that is, the transformation from LR features to HR features and the adaption to estimate the best regressor. The feature transformation from LR to HR using the precomputed mapping matrix costs O(NmM ). The computation of finding the most correlative neighbours of N inputs takes O(NmKp) operations by projecting onto the LR dictionary D l and choosing the most correlative p neighbours with the nearest neighbour searching algorithm. Thus, for our proposed CCFH framework, the total complexity is about O(Nd l Kp + Nd l d h ). We further compare the computational efficiency of the different methods. All the comparative experiments are performed on a 1.8 GHz Intel Core i7 CPU with 8 GB RAM and GTX1080 GPU for deep learning based methods. Figure 9 shows the averaged PSNR performance versus runtime (in seconds) evaluated on the CelebA dataset for an upscaling factor of 4. As demonstrated, our method can achieve better FH quality with competitive computational time.

DISCUSSION AND CONCLUSION
In this paper, we have presented a novel example-based face hallucination method based on cluster consistent dictionary learning with the assumption that all face images have similar local pixel structures and similar face image patches should be reconstructed consistently on the same training dictionary. We have grouped the face patches based on their positions to the face key parts and face boxes. Then, all the patches are clustered in each group separately. Cluster consistence matrix can be constructed for learning a constrained dictionary. We have found that such dictionary can better represent face details in the face hallucination application. The main reason is that the patches from same face parts have a similar sparse representation in our proposed method. Therefore, they can be consistently reconstructed by the anchored training patches, giving us better face hallucination performance. This is particularly useful face key parts areas, as eyes, nose, mouth etc. Experimental results show that the proposed method performs well in terms of both reconstruction error and visual quality. The PSNR and SSIM results of the experiments show that our method can achieve competitive performance for face hallucination.
Moreover, our label consistent method, which is a more flexible constraint to describe the neighbourhood of face image pix-els, does not require the training face images well cropped and aligned which is significant for some traditional methods.