Real-World Super-Resolution of Face-Images from Surveillance Cameras

Most existing face image Super-Resolution (SR) methods assume that the Low-Resolution (LR) images were artificially downsampled from High-Resolution (HR) images with bicubic interpolation. This operation changes the natural image characteristics and reduces noise. Hence, SR methods trained on such data most often fail to produce good results when applied to real LR images. To solve this problem, we propose a novel framework for generation of realistic LR/HR training pairs. Our framework estimates realistic blur kernels, noise distributions, and JPEG compression artifacts to generate LR images with similar image characteristics as the ones in the source domain. This allows us to train a SR model using high quality face images as Ground-Truth (GT). For better perceptual quality we use a Generative Adversarial Network (GAN) based SR model where we have exchanged the commonly used VGG-loss [24] with LPIPS-loss [52]. Experimental results on both real and artificially corrupted face images show that our method results in more detailed reconstructions with less noise compared to existing State-of-the-Art (SoTA) methods. In addition, we show that the traditional non-reference Image Quality Assessment (IQA) methods fail to capture this improvement and demonstrate that the more recent NIMA metric [16] correlates better with human perception via Mean Opinion Rank (MOR).


Introduction
Face Super-Resolution (SR) is a special case of SR which aims to restore High-Resolution (HR) face images from their Low-Resolution (LR) counterparts. This is useful in many different applications such as video surveillance and face enhancement. Current State-of-the-Art (SoTA) face SR methods based on Convolutional Neural Networks (CNNs) are able to reconstruct images with photo-realistic appearance from artificially generated LR images. However, these methods often assume that the LR images were downsampled with bicubic interpolation, and therefore fail Original ESRGAN [45] Ours Figure 1: ×4 SR of a real low-quality face image (100×128 pixels) from the Chokepoint DB [48]. Our method enhances details and removes noise while the ESRGAN [45] amplifies the corruptions.
to produce good results when applied to real-world LR images. This is mostly due to the fact that the downsampling operation with bicubic downscaling changes the natural image characteristics and reduces the amount of artifacts. Hence, when using algorithms trained with supervised learning on such artificial LR/HR image pairs, the reconstructed images usually contains strong artifacts due to the domain gap. This paper is about SR of real low-resolution, noisy, and corrupted images, also known as Real-World Super-Resolution (RWSR). We apply our proposed method to face images, but the method is also applicable to other image domains. To create a SR model that is robust against the corruptions found in real images, we create a degradation framework that can produce LR images that have the same image characteristic as the images that we want to superresolve, i.e. the source domain images. By creating LR images from clean high-quality images, i.e. the target domain, allows us to train a SR model that learns to super-resolve images with similar characteristics. This approach is inspired by the work of Ji et al. [22] who propose to perform RWSR via kernel estimation and noise injection. However, we observe that their framework for image degradation is not ideal for SR of LR face images from surveillance cam-eras, as these are often also corrupted by compression artifacts. Hence, we extend the degradation framework from [22] to include JPEG compression artifacts. We use the ES-RGAN [45] model, which is one of the SoTA models for perceptual quality, as our backbone SR model. However, we find that the combination of loss functions for the ES-RGAN is not ideal for optimal perceptual quality. To this end, we exchange the VGG-loss [24] with PatchGAN [53] loss for the discriminator similar to [22]. Inspired by Jo et al. [23], we additionally exchange the VGG-loss [24] with Learned Perceptual Image Patch Similarity (LPIPS) loss [52] for better perceptual quality. Different from existing models for face SR [7,12,8], we do not restrict our model to only work for face images of fixed input sizes, which makes our model more useful in practice. To the best of our knowledge, we are the first to propose a method for SR of real LR face images of arbitrary sizes.
We evaluate our method on two different datasets. To enable comparison of the SR performance against Ground-Truth (GT) reference images, we artificially corrupt highquality face images from Flickr-Faces-HQ Dataset (FFHQ) [25] and report quantitative results using conventional Image Quality Assessment (IQA) methods and the most recent methods for assessment of the perceptual quality. For evaluation on real LR face image from surveillance cameras we use the Chokepoint DB [48]. In this case, as no GT image is available, we report the results using Mean Opinion Rank (MOR) and several non-reference based IQA methods. In both cases we show the effectiveness of our method via quantitative and qualitative evaluations. Furthermore, our evaluations show that most existing nonreference based IQA methods correlate poorly with human perception, while the recent Neural Image Assessment (NIMA) [16] metric provides a good correlation with human judgment as proven with MOR.
In summary, our contributions are: • We propose a novel framework for generation of LR/HR training pairs that includes the most common image degradation types in real-world face images. Our framework includes blur kernel estimation, noise injection and compression artifacts.
• We also propose an improved ESRGAN [45] based SR model with PatchGAN [53] and LPIPS loss [52] for better perceptual quality, and show the benefit on real LR face images from the Chokepoint DB [48] and artificially corrupted face images from the FFHQ DB [25].
• Quantitatively, we evaluate our method using the most popular non-reference based IQA methods, and find only the recent NIMA [16] metric to correlate with human judgment via MOR.

Related Work
Recent advancements within deep-learning have proven very successful for use within super-resolution, and models of this type often achieve SoTA results. The first deeplearning based method for super-resolution was proposed by Dong et al. [15] who successfully trained a CNN to learn a non-linear mapping from LR to HR images. Later proposals relied on deeper networks and residual learning [27,33], recursive learning [28], multi-path learning [21], and different loss functions [29] to reduce the reconstruction error between the super-resolved image and the GT image. However, while these methods yield high Peak Signal-to-Noise Ratio (PSNR) values, they tend to produce over-smoothed images which lack high-frequency details. To overcome this, Ledig et al. [32] proposed to use Generative Adversarial Networks (GANs) for SR with the SRGAN, to achieve realistic looking images according to human perception. The ESRGAN [45] further improves the SRGAN [32] by several changes to the discriminator and generator. The LR images needed for training the aforementioned deep-learning based super-resolution models are typically created by downsampling HR images with an ideal downscaling kernel, typically bicubic downscaling. However, the images generated by this kernel do not nescessarily match real SR images. Additionally, in the downscaling process, important natural image characteristics, such as image sensor noise is removed, which the super-resolution algorithms are then prevented from learning. This results in poor reconstruction results and unwanted artifacts when a real-world noisy LR image is super-resolved [35].
Real-World Super-Resolution One way to address the the lack of a proper imaging model for RWSR, is to create datasets that consist of real LR/HR image pairs captured using two cameras with different focal lengths [9,43,47]. However, this method is cumbersome and has inherent problems with the alignment of the image pairs. To overcome the problem of missing real-world training data, Shocher et al. [2] propose a zero-shot approach where a small CNN is trained at test time on LR/HR pairs extracted from the LR image itself. Soh et al. [42] extend the work of [2] by using meta-transfer learning phase to exploit information from an external dataset. Gu et al. [20] train a kernel estimator and corrector CNNs under the assumption that the downscaling kernel belongs to a certain family of Gaussian filters and uses the estimated kernel as input to a super-resolution model. To super-resolve LR images with arbitrary blur kernels, Zhang et al. [50] propose a deep plug-and-play framework which takes advantage of existing blind deblurring methods for blur kernel estimation. Bell-Kligler et al. [4] trains a GAN to estimate blur kernels from LR images and combines it with the ZSSR SR model [2]. Fritsche et al. [17] train a GAN to introduce natural image characteristics to images downsampled with bicubic downscaling, which is then used to train a super-resolution for improved performance on real-world images. Zhang et al. [49] propose an iterative network for SR of blurry, noisy images for different scaling factors by leveraging both learning and model-based methods. Most recently Ji et al. [22] propose a degradation framework for the creation of LRHR image pairs for training. The degradation framework estimates blur kernels and noise distributions from real LR images in the source domain which are used to degrade HR images in the target domain. This enables training of a GAN based SR model which is shown to perform better on real LR images. However, a key limitation of this method is that it does not address the compression artifacts often found in real-world images.
Face Super-Resolution Face SR is a SR technique specialized for reconstruction of face images. One of the first methods for face SR was proposed by Baker and Kanade [3]. This method reconstructed face details by searching for the most optimal mapping between LR and HR patches. More recent work relies on deep learning based methods with CNNs and GANs. Dahl et al. [13] use pixel recursive learning with two CNNs to synthesize realistic hair and skin details. Chen et al. [11] combine face SR and face alignment to achieve previously unseen PSNR values. By searching the latent space of a generative model for images that downscale correctly, Menon et al. [37] are able to create face images of high resolution and perceptual quality. However, the problem with this approach is that the generated faces are often far from the true identity of the actual person, as illustrated in Figure 2. Additionally, none of the above mentioned methods are robust against noise or other corruptions in the input images [19].
There are very few publications available in the literature which address the problem of RWSR of face-images [19]. Furthermore, the few existing face RWSR methods are only compatible with LR images that have been squared to 16 × 16 pixels, meaning that the reconstructed image will be only 64 × 64 or 128 × 128 pixels depending on the scaling factor [7,12,8]. Hence, these models cannot perform true SR directly on the LR images. This means that the actual usefulness of the existing face SR models is limited. On the contrary, our work presents one possible solution for ×4 RWSR of face images of arbitrary sizes, which we evaluate on real LR face images from surveillance cameras without any prior re-scaling.

The Proposed Framework
This section describes our two-step framework for RWSR. The first step aims to generate LR images from clean HR images in the target domain Y , such that these Original PULSE [37] Ours Figure 2: An example of SR of a real low-quality face image from the Chokepoint DB [48], where it can be seen that the PULSE [37] method changes the identity of the person, while our method preserves the identity and enhances details.
have similar image characteristics as the ones in the source domain X. The second step involves training a SR model on the constructed paired data, and optimizing for perceptual quality.

Novel Image Degradation
Traditional approaches for SR assumes that a LR image I LR is the result of a downscaling operation of the corresponding HR image I HR using some kernel k and scaling factor s, namely: However, real LR images from cameras are influenced by multiple other factors that degrade the image as well. The RealSR [22] framework tries to address this issue by considering realistic noise distributions and blur kernels in the downscaling process. However, we observe that real images from surveillance cameras are often also degraded with compression artifacts, which makes the RealSR framework perform poorly on such images. To this end, we extend the degradation framework from [22] to include JPEG compression artifacts in addition to estimation of realistic noise distributions and blur kernels. Thus, we extend the basic SR formulation from Equation 3.1, and assume that the following image degradation model was used to create I LR .
where k, s, n, and c denotes the blur kernel, scaling factor, noise, and compression artifacts, respectively. I HR is unknown together with k, n, and c. In our degradation framework, we estimate the kernel and noise directly from the images in the source domain X. We build a pool of the estimated kernels and noise patches which is used to generate corrupted LR images from clean HR images and finally JPEG compress the images, in order to create image pairs for training the SR model.  Figure 3: Comparison with SoTA methods for SR of a small face image (56 × 72 pixels) from the Chokepoint DB [48]. As visible, our method hallucinates more realistic face details than the existing methods.

Blur Kernel Estimation
For estimation of realistic blur kernels, we adopt the KernelGAN method by Bell-Kligler et al. [4]. This method estimates an image specific SR kernel k i using an unsupervised approach. More specifically, a GAN is trained to down-scale the input image in a way that best preserves the image patch distributions across scales. We estimate realistic blur kernels from training images in X to form a pool of kernels that can be used to degrade the HR images in Y .
Downsampling To create the downsampled image I D we randomly choose a blur kernels k i from the pool of estimated kernels and perform cross-correlation with images in Y . More formally the process is described as: where I D is the downscaled image, Y n is a HR image, k i refers to a kernel from the degradation pool {k 1 , k 2 , · · ·k m } and s is the scaling factor.

Noise Estimation
For degradation with realistic image noise, we adopt the method from [10] to extract noise patches from the source images X. Here the assumption is that an approximate noise patch can be obtained from a noisy image by extracting an area with weak background and then subtracting the mean. We define two patches p i and q i j . We obtain p i by a sliding window approach across images in X, and similarly for q i j by scanning p i . p i is considered a smooth patch if the following constraints are met: and where M ean and V ar denotes the mean and variance respectively, and µ and γ are scaling factors. Different from [10] we add an additional constraint to ensure that saturated patches are not extracted: where φ denotes a minimum variance threshold. If all constraints are satisfied, p i will be considered a smooth patch. We then create a pool of noise patches n i by subtracting the mean value from all valid p i .

Degradation with Noise
We degrade the LR images by injecting real noise patches from the noise pool. For better regularization of the SR model we randomly pick a noise patch from the noise pool and inject it to the LR image during training. The downscaled and noisy LR image I N is created as follows: where I D is a downscaled image, and n i is a noise patch from the noise pool {n 1 , n 2 , · · ·n l }

Degradation with Compression artifacts
Finally, we introduce compression artifacts to the LR training images to close the domain gap between these and the real JPEG compressed LR images in the source domain X. As there are no way of determining the compression strength of existing JPEG images we empirically compare images from X to similar images with different JPEG compression strengths applied and find that a compression strength of 30 results in similar compression artifacts.

Backbone Model
We base our SR model on the ESRGAN [45], which is one of the SoTA networks for perceptual SR with ×4 upscaling, and train it on the paired LR and HR images generated with our degradation framework. Different from the SRGAN [32], the ESRGAN uses Residual-in-Residual Dense Blocks (RRDBs) in the generator network and the discriminator predicts the relative realness instead of an absolute value. Additionally, the ESRGAN removes the batch normalization layers used in SRGAN.
Loss Functions While traditional supervised SR models are trained with pixel loss to minimize the Mean Squared Error (MSE) between the reconstructed HR image and the GT image, we rely on loss functions that maximize the perceptual quality. The original ESRGAN [45] model uses several different loss functions during training. More specifically, the generator uses adversarial loss L adv [18] in combination with VGG perceptual loss L vgg [24] and pixel loss L pix , while the discriminator use VGG-128 [41] loss L vgg . However, we find that this combination of loss functions is not ideal for high perceptual quality. Following the work of [22], we first exchange the VGG-128 [41] discriminator loss with a PatchGAN discriminator from [53] to reduce the amount of artifacts in the reconstructed images. Different from the VGG loss, the PatchGAN loss L patch has a fully convolutional structure, and only penalizes structure differences at the scale of patches, to determine if an image is real or fake. For optimization of the generator, the loss from all patches are averaged and fed back to the generator. Continuing this track, we seek to also replace the VGG-loss in the generator. Inspired by [23], we find that using the LPIPS perceptual loss L lpips [52] results in less noise and richer textures compared to using VGG-loss for the generator. This is mainly because the VGG network is trained for image classification, while LPIPS is trained to score image patches based on human perceptual similarity judgements. The LPIPS perceptual loss is formulated as: where I gen is a generated image, I gt is the corresonding GT image, φ is a feature extractor, τ is a transformation from embeddings to a scalar LPIPS score. The score is computed from k layers and averaged. In our implementation of LPIPS we use the pre-trained AlexNet model provided by the authors. In total, our full training loss for the generator is as follows: L generator = λ pix · L pix + λ adv · L adv + λ lpips · L lpips (9) where λ pix , λ adv and λ lpips are scaling parameters.

Datasets
This section describes the datasets used for training and testing. For our experiments on real LR face images from surveillance cameras we use the Chokepoint Dataset [48] as our source domain images X. This dataset contains images of 29 different persons captured with three cameras in a real-world surveillance setting. All images have a resolution of 800 × 600. We use a face detection algorithm to extract the faces from the images, and randomly split the dataset, to obtain 72,282 images for training and 3,805 images for testing. The average resolution of the cropped faces is ≈ 92 × 92. We only use the Chokepoint training images to estimate realistic blur kernels and noise distributions for our degradation framework, and not for direct training of our SR model.
For the target domain of high-quality face images Y , we combine 571 face images from the SiblingsDB [44], 8,040 face images from the Radboud Faces Database [30] and 5,000 randomly selected face images from FFHQ database [25] for a total of 13,611 images. Both the SiblingsDB and Raboud Face Database contains portrait face images professionally captured in a studio setting with controlled lighting. The face images from the FFHQ are more diverse in appearance, and ethnicity of the subjects. We augment all images in the target domain by downsampling by 25, 50 and 75% with bicubic downscaling to obtain a more diverse dataset. We then apply our degradation framework described in Section 3.1 on the images in Y to obtain LR/HR image pairs for training of our SR model.
For evaluation on artificially corrupted faces images, we use the first 1,000 images from the FFHQ dataset. To generate LR/GT images we introduce three kinds of corruptions, namely, downsampling, sensor noise, and compression artifacts. For downsampling, we randomly choose a kernel from our blur kernel pool. For modeling of sensor noise we follow the protocol from [34] and use pixel-wise independent Gaussian noise, with zero mean and a standard deviation of 8 pixels. For compression artifacts, we convert the images to JPEG using a compression strength of 30.

Evaluation Metrics
Real-World Images Due to the nature of RWSR, no GT reference image exists, which makes it impossible to compare the different methods using traditional SR IQA methods e.g. PSNR and Structural Similarity index (SSIM). To this end, we follow the non-reference based IQA evaluation protocol from the NTIRE2020 RWSR challenge [1]. In particular, we assess the image quality using NIQE [39], BRISQUE [38], PIQE [40], NQRM [36] and PI [5], where PI is a weighted score computed as 1 2 ((10 − N QRM ) + N IQE). However, these methods are known to correlate poorly with human ratings [1]. To address this issue, we supplement our evaluation protocol with MOR and NIMA [16], where NIMA is a learned metric based on human opinion scores, which can quantify image quality with high correlation to human perception. We use the pre-trained model for rating of the technical image quality. For the MOR, we ask the participants to rank overall image quality of the SR results. To simplify the ranking, we only include the predictions of the top-5 methods based on NIMA scores. To avoid bias, the order of the methods are randomly shuffled. We average Original MZSR [42] EDSR [33] ESRGAN [45] USRNet [49] RealSR [22] DPSR [51] Ours Figure 4: Comparison with SoTA methods for ×4 SR of real low-quality face images from the Chokepoint DB [48]. As visible, our method generates superior reconstructions over the existing methods for different faces.
the assigned rank of each method over all images and participants to compute the MOR.
Artificially Corrupted Images For our experiments on artificially corrupted images we evaluate the performance using three conventional IQA methods, PSNR, SSIM, and the later Multi Scale Structural Similarity index (MS-SSIM) [46]. However, these metrics focus more on signal fidelity rather than perceptual quality [6]. As our method is optimized towards perceptual quality, we also include three of the most recent full-reference metrics targeting perceptual quality, namely Normalized Laplacian Pyramid Distance (NLPD) [31], LPIPS [52], and Deep Image Structure and Texture Similarity (DISTS) [14].

Experiments and Results
Implementation Details We perform all our experiments with a scaling factor s = 4. For our SR model we jointly train the generator and discriminator for 400K iterations with a batch size of 16. We initialize the weights from the PSNR optimized RRDB model from [45]. We use LR patches of size 32 × 32, and empirically set λ pix , λ adv and λ lpips to 0.01, 0.005 and 0.001 respectively. For noise estimation we set p i to match the LR patch size and q i j to 8. Similar to [10] we set µ and γ to 0.1 and 0.25 respectively. We empirically set the minimum variance threshold φ to 0.5. For degradation with compression artifacts we JPEG compress the LR training images with strength of 30 during training with a probability of 0.9 for better regularization of the SR model.

Comparison with State-of-the-Art
We did not find any other ×4 face image specific RWSR methods in the literature. Instead, we compare our method to bicubic upscaling, as well as with different groups of SoTA super-resolution methods including two generic SR models (ESRGAN [45], EDSR [33]), one SR method for arbitrary blur kernels (DPSR [51]), three real-world SR models (MZSR [42], USRNet [49], and RealSR [22]).
For a fair comparison, we adjust the competing models for optimal performance. For MZSR [42], which is an unsupervised method, we enable back-projection with 10 iterations and set a noise level of 0.5. For DPSR [51], we use the pre-trained DPSRGAN model with settings for real-world images. With USRNet [49] we set the noise value to 15 for best results. The results for the RealSR [22], is based on our re-implementation of the framework as the training code was not available. We adapt the RealSR method to our face data for a fair comparison. For ESRGAN, we use the pre-trained weights provided by the authors to better illustrate the difference from our method.
Real-World Images In this experiment we evaluate the SR performance on LR face images from the Chokepoint testset. Quantitative results can be seen in Table 1. Qualitative results for multiple images are shown in Figure 4 while a close-up view of facial components can be seen in Figure 3. Our method clearly outperforms the other methods in terms of perceptual quality. However, while the traditional non-reference IQA methods (NIQE [39], BRISQUE [38], PIQE [40] and NQRM [36]) fails to capture this, scores from the more recent NIMA [16] method correlates well with human perception, which is also backed by our MOR rankings. This shows that the traditional IQA metrics are not ideal for judgement of the perceptual quality.
Artificially Corrupted Images This experiment evaluate the SR performance on artificially corrupted images from the FFHQ testset. We show quantitative results of all methods in Table 2. Qualitative results for multiple images are shown in Figure 5. Our method produces sharp and detailed images with few artifacts which closely resembles the GT images, which is also reflected in the quantitative results. Most noteworthy are the DISTS results, which are very correlated with human perception of image quality. The results show that the reconstructed images produced by our method is superior in comparison to the other methods.

Ablation Study
We evaluate the effect of our proposed method for realistic image degradation and our improved ESRGAN based SR model in the same setting as described in Section 4.1. A qualitative comparison can be seen in Figure 6.
Baseline Here, we use kernel estimation and noise injection to generate training data for the ESRGAN with patch discriminator, similar to [22]. This SR model is fine-tuned to our face image dataset, and serves as our baseline. The resulting HR images contain unpleasing noise and lack detail.
Compression Artifacts In this setting, we add JPEG compression artifacts to the LR images during training of the baseline model. This results in more noise-free reconstructions compared to the baseline.
LPIPS loss Here, we use the LPIPS loss function for the generator instead of VGG-loss combined with the addition of compression artifacts. When the baseline model is retrained under these settings the resulting reconstructions becomes sharper with better texture and details. While our method produces reconstructed faces of better visual quality than the compared SoTA methods, it does not solve the problem RWSR of face images. Figure 7 shows several failure cases of our method. These occur when the input image is severely corrupted e.g. by motion blur or harsh lighting, or when out-of-focus. In these cases, our method might only super-resolve some parts of the face, e.g. a single eye, or even hallucinate unrealistic facial features.

Conclusion
In this paper, we have presented a novel framework for RWSR, which we have evaluated on low-quality face images from surveillance cameras, and artificially corrupted face images. Our method shows SoTA performance in both cases, which is achieved by using LPIPS-loss and making the SR model robust against the most common degradation types present in real LR images. Moreover, our model is the first to perform SR on real LR face images of arbitrary sizes, which makes it useful for practical applications. In the future, even better reconstructions could possibly be obtained by including more image degradation types in the framework e.g. chromatic aberration.