Continuous digital zooming using generative adversarial networks for dual camera system

This paper presents a generative adversarial network (GAN) with patch match algorithm to realize a high-quality digital zooming using two camera modules with different focal lengths. In dual camera system, shorter focal length module produces the wide-view image with the low resolution. On the other hand, the longer focal length module produces the tele-view image via optical zooming. The long-focal image contains more details than short-focal image and can be used to guide short-focal image to reconstruct high frequency part. Firstly, a feature extraction block (FEB) is advanced to extract feature of long-focal image and short focal-image to reconstruct a wide-view image with different resolutions. Next, a patch match algorithm is integrated into convolution neural networks (CNN) to fuse information of long-focal with short-focal image and generate a new fused image. Finally, the fused image and short-focal image are merged with a feature fusion block (FFB) to predict high-resolution images. In addition, generative adversarial networks are used for ﬁltering information integrated by previous network and output the zoomed image. Extensive experiments on benchmark datasets show that our algorithm achieves favorable performance against state-of-the-art methods.


INTRODUCTION
It is well-known that high-resolution (HR) image can offer more details than its counterpart low-resolution (LR) image. These details are critical in various fields, such as remote sensing [1], medical diagnostic [2], intelligent surveillance [3], and so on. Continuous zooming system can continuously change the spatial resolution of an acquired image. Optical zooming system change the spatial resolution by moving the lens back and forth along the optical axis. As an economical alternative, digital zooming has a wide range of application areas in many imaging devices [4]. However, digital image processing algorithms such as image interpolation used in the digital zooming system generate jagging and blurring artifacts that significantly degrade the quality of resulting images. To solve this problem, a number of modified algorithms have been proposed in the past few decades [5,6].
Existing digital zooming methods were proposed to increase the spatial resolution of the input image using either inter-polation or super-resolution. Interpolation-based restoration approaches [7] aim to search the connections among the neighboring pixels and fill the missing pixels via a base function or an interpolation kernel. Though, it has fast processing time at lowcomputational complexity, the step-by-step forward approach fails to guarantee precision of the estimation, especially when there is noise.
To break the limitations of interpolation-based approaches, super-resolution (SR) methods [8][9][10][11] developed the image priors and learned the mapping relationships by sampling from a database containing a large number of HR and LR image pairs. But it requires an additional database and reconstruct the patch to build the HR image. A dual camera system consists of two fixed-focus cameras with a common optical axis or a parallel optical axis. The two optical systems have different optical parameters, such as imaging focal length, exposure time, and sensor resolution. Therefore, images of different resolutions are acquired in the dual FIGURE 1 Dual camera system camera system. Figure 1 shows a classic dual camera system, in which imaging device one and imaging device two have the same optical axis. Light is split by dichroic prism and imaging into two imaging devices.
In a digital zooming system, long-focal image and short-focal image is used to enlarge image. To fuse the details of long-focal image and the structure of short-focal image, it is essential to patch two images. Figure 2 shows a conceptually simplified process of the digital zooming system under the assumption that two images have the same geometric parameters except for the focal length. To register two images, the short-focal image is first cropped into suitable size, and then up-sampled to the same size as short-focal image. However, it is difficult for neural networks to register two different images which are important for multi-image super resolution. Hence a patch match algorithm is advanced to match two images and transport them into neural networks. Then the long-focal image and the up-sampled image will be transported to digital zooming system and generate a high-resolution image which contents the details of long-focal image and the structure of short-focal image.
Actually, if just concatenate long-focal image and short-focal image together and input into convolution neural networks (CNN), the network will not make use of information of longfocal image to restore short-focal image, and the whole network is like a single-image super-resolution network, because two images are not aligned, and the network cannot handle it. Previous mutli-frame super-resolution CNN used a registration module to align images. But for digital zooming system, two images are not similar in image scene, so registration cannot use a long-focal image to restore short-focal image scene which not in long-focal image. Patch match algorithm is advanced to divide the whole image into many patches and find the most similar patch in another image. In this way, this paper construct an reference image aligned to the short-focal image by long-focal image and achieved good results.
Although the experiment is based on the digital zooming system of a dual-camera with the same optical axis, it can be implemented on any dual-camera system. The implementation depends on the patch match algorithm which is for registering the image. In other words, a similar patch is searched for in the whole image, and the most similar patch is taken as the reference.
Generally, the short focal length is S , and the long focal length is L . The target magnification ratio is , 1 < ≤ L ∕ S . Major contribution of the proposed method includes: • The paper proposes a generative adversarial network with a patch match algorithm (MatchGAN) for high-quality image digital zooming tasks. The patch match algorithm is integrated into networks which use the detailed information of the long-focal image to repair the corresponding area of the short-focus image and the structure of the neural network to reduce the image size to speed up the matching. • The paper proposes feature extraction (FEB) and fusion block (FFB) to extract image feature for patch matching and fuse hierarchical features at different layer of our network. In order to prevent image information lost in deep network, feature fusion combine the shallow features and deep features together. • Extensive experiments on several benchmark datasets demonstrate that MatchGAN achieves superior results compared with other state-of-art algorithms.

RELATED WORKS
At present, the continuous zoom scheme of the dual camera system is based on traditional pixel interpolation techniques, FIGURE 2 A digital zooming system. Here the long and short-focal image are x2.0 and x1.0 zoomed, and the resulting is zoomed image such as bicubic, or single-frame image super-resolution based on machine learning or convolutional neural networks.

Image super-resolution
Image super-resolution aims to reconstruct high-resolution images from low-resolution images. Mathematically, singleimage super-resolution reconstruction is a serious ill-posed inverse problem, mainly due to the pixel number of a desired HR image larger than the number of a degraded LR image. Inspired by the manifold learning methods, Chang et al. [12] proposed a super-resolution method based on embeddingbased scheme. This approach assumes that the LR and HR image patches share similar local geometry in two distinct feature spaces. Pandey et al. [13] classified super-resolution methods based on the priors used and they discussed image priors and regularization in detailed. But those methods are based on traditional ways, the results are worse than methods based on deep learning on visual and objective evaluation indicators.
Recently, deep learning based methods have achieved dramatic advantages against conventional methods. In 2014, Dong et al. [14] proposed the first end-to-end neural network SRCNN, but the network only has three layers, so it is difficult to learn the complex structures. The low convergence rate and working on a single scale are the other limitations of SRCNN. They established a mapping between the LR images and HR images. In 2016, VDSR [15] was proposed by Kim et al.; they used stack small filters and residual learning to accelerate the training and improved the performance of network. It is one of the first methods that utilize the residual learning to train the VSDR network. Because of using the residual image (residue between the ground-truth and bicubic interpolated image) and relatively large depth in the learning process, VDSR achieves superior results over SRCNN. However, with the depth of network grow, gradient disappear and gradient explosion will be the main problem to hinder the performance of the network. This problem was solved when He et al. [16] proposed residual net (ResNet). Additionally, the efficient sub-pixel convolution layer was proposed by ESPCN [17] to upscale the final LR feature maps into HR output to solve the problem of over-smooths and blurs in the original LR image. However, these methods trend to produce a blur result because of the loss function.
To solve this problem, Goodfellow [18] proposed generative adversarial network (GAN). They trained a generator to generator result and a discriminator to distinguish the result is real or not. As a result, the quality of the image improved greatly and image details were enriched. After that, GANs have been increasingly applied to various image processing fields [19][20][21].

Patch match
In 2003, Criminisi et al. [22] presented a classical method to remove a large object according to the computation of priority and similarity of patches, while preserving important information of texture and structure. Afterward, many patch match methods have been proposed. Sun et al. [23] proposed a global structure propagation method which connected the geometry structure of the whole image manually and then propagated the known texture to an unknown region. Based on this, Li and Zhao [24] proposed an automatic structure completion method which avoided manual intervention and enhanced the efficiency of the algorithm. Wong et al. [25] proposed a nonlocal mean by utilizing multiple samples in an image to obtain more similar source patches. In 2009, Barnes et al. [26] proposed a seminal work of patch match, introduced an efficient way to find dense correspondences across images for structural editing. They used a large number of random samples and get good guesses. Due to its effectiveness in pruning out the search space, many researchers [27][28][29][30] extended this approach and applied in multiple domains. Korman and Avidan [30] incorporated the idea of image coherence into Locality Sensitivity Hashing and significantly improved the speed. He and Sun [29] combined KD-trees with patch match to perform more efficient neighbor matching. Especially, ref. [28] exploits patch match to overcome the infeasibility of searching over continuous output space.

Application in dual camera system
To obtain a high-quality image, dual camera system is developed for many applications such as image denoising, superresolution, object tracking and so on. In 2016, Kim et al. [31] proposed an object tracking method using asymmetric dual camera system. They estimated the depth map between the wide-and tele-view image and used it to calculate the velocity of the object in the region of interest. Hicks et al. [32] presented an image denosing method in low-light condition with dual camera system. they used the first camera to acquire a monochrome image with high-sensitivity, spatialresolution, and dynamic range. Using the second camera composed of RGB and near infrared pixels by replacing the green pixel of Bayer pattern with near infrared pixel to provide the chrominance information.
In 2017, Yu et al. [33] presented a theoretical basis to realize a high-quality digital zooming using two camera modules with different focal lengths. They fused wide-and tele-view images by performing variational-based image restoration using the estimated PSF. Moon et al. [34] proposed a super-resolution method using dual camera system. They fused wide-and teleview images by using the ideal high-frequency component estimated from the optically zoomed image. However, the boundary region of the zoomed wide-view image cannot be restored which have no same scene with tele-view image. Ma et al. [35] proposed to use a band-inpainting algorithm to fuse wide-and tele-view images in frequency domain. They reconstructed band information with super-resolution methods, and introduced an image inpainting algorithm to reconstruct poorly super-resolved texture areas to produce photo-realistic digital zooming images. But these methods are based on traditional image restoration algorithms and they cannot restore finer structures compared with the method based on neural networks. He et al. [36] proposed a digital zooming method using patch match algorithm for wide-and tele-view images fusion. They used Markov random field (MRF) models to patch different features of every layer in convolutional neural networks and continuously iteratively optimized. This algorithm cost too much time on features match and the region not in long-focal image cannot be repaired by this way. The results of the experiment did not improve much on objective evaluation index and visuals. In our method, MatchGAN just match image features only once and construct an refence image for whole short-focal image region.

GENERATIVE ADVERSARIAL NETWORKS WITH PATCH MATCH ALGORITHM
Our task is to fuse the image of long focal length with short focal length and generate a new high-resolution image with middle focal length which keeps important appearance details of the identity. It is a challenge to implement image fusion into an end-to-end model. In this paper, an approach is proposed to address this generative adversarial network with patch match algorithm (MatchGAN). Firstly, a feature extraction block(FEB) is advanced to extract image features for matching. Secondly, in the middle of the net, a patch match algorithm is presented to match the short-focal features and long-focal features. Lastly, a feature fusion block (FFB) is proposed for image restructuring and output the final image. In addition, generative adversarial networks (GAN) is used to train network, in order to generate more realistic image. Another reason this paper use GAN is that it can be interpreted as patch correspondences [37], so, in MatchGAN, a discriminator is used to improve the performance of long-focal features and short-focal features matching and increase more image details. The overall framework of the proposed MatchGAN is shown in Figure 3.
It is difficult to match image as well as achieve image superresolution in an end-to-end network. So MatchGAN combine the traditional algorithm with neural network to solve this problem. In this paper, a short-focus length image S with a longfocus length image L are integrated to generate a coarse result I that match the structure of S and the details of L in target image I .

Network structure
As shown in Figure 3, MatchGAN mainly consists three parts: feature extraction, image match and image restoration stage. Compared with previous CNN-based method, our CNN should both extract low-level and high-level features and combine them for path match, and feature extraction is proposed to concentrate multi-level features of different images. In addition, short-focal features are fused with patch-match features for further restoration stage, which is different from singleimage super resolution, and an image restoration net structure is presented. Especially, before inputting into the network, wide-view image is enlarged to times using bicubic interpolation and cropped into desired size. Tele-view image is resized to desired size by interpolation directly. Two FEBs are used to extract the feature of image and decrease image size, and further conduct patch match. There are two branches of feature extraction stage for extracting features of long-focal and

Feature extraction
As shown in Figure 4, our FEB contains local feature extraction and local residual learning and down sample layer, leading to contiguous memory mechanism. Feature extraction is applied on inputs to extract features. As analyzed above, the first convolution layer extracts feature F 1 from input by where Conv i (⋅) denotes convolution operation with stride i, denotes ReLU [38] activation, F pre is the output of the previous layer. F is used for the further layer to extract feature. So where n denotes n th convolution operation with stride one. Local residual learning is applied after feature extraction. As there are several convolution layers before residual learning. The final output of the residual learning is where dn denotes n th level feature extraction Input layer is then pass through one convolution layer to keep the same dimension with F d e then add on F d e Down sampling is used to reduce the image size, so that MatchGAN can decrease the time of patch math algorithm. Convolution with stride two is used to reduce image size. So where Conv 2 (⋅) denotes convolution operation with stride 2, and F d denotes the output of FEB. All convolution is our FEB is 3 × 3. In MatchGAN, image features are extracted by two FEB and then applied to patch match.

Image match
With the features of two images, which may differ in appearance but have similar semantic structure, the mapping from longfocal features to short-focal features is needed to generate a new image based on the short-focal-image structures and long-focalimage details. By this way, any image areas can be mapped from long-focal images to short focal images, which means the input images can be taken on different axis. As long as the long-focal image have similar areas with short-focal image, the results will be improved. Based on ref. [39], represents a nearest-neighbor field (NNF) between two image features by where N (p) is the patch around p. F (x) is a vector that represents all channels of feature layer at position x. The patch size is set to 8 × 8. For each patch around pixel p in the short-focal features A, its nearest neighbor position is found in long-focal features B. Normalized features is used in patch similarity metric, and random search and propagation are adopt to support features maps. For images, their NNFs are typically constant or linearly varying over large areas, so a random initialization and random search is used until convergence.
The N (p) location is initialized at random, most of the initial offsets are useless, but a certain number will be optimal or near-optimal. These are quickly diffused to the rest of the image in the propagation phase. Then the image is raster scanned top-down and left-to-right. MatchGAN use the nearby patch as an alternative estimates of current N (p), and select the best one.
To avoid getting trapped in bad local minima, random search is used when searching the most similar path. Besides it is timeconsuming to search all the image path especially for high spatial resolution image. Therefore, MatchGAN adds a bi-dimensional random variable after every search. Experiments in ref. [26] show that typically patch match converges to a near-optimal NNF in less than 10 iteration. So this offset threshold is set as 1∕10 of long-focal image size. After the final NNFs is obtained, image I patch by patch aggregation can be reconstructed: where n is the patch size of N . Patch match is time-consuming, so in feature extraction stage, convolution with stride two is used to decrease image size. In image match stage, MatchGAN is faster because of small image size.

Image restoration
An image full filled with long-focal image details and shortfocal image features are received from patch match. But the details of long-focal image cannot match the short-focal image perfectly, because short-focal image contains more areas than long-focal image. Long-focal image is used to repair the areas in short-focal image which resemble to long-focal image patch. So MatchGAN concatenate features of patch match and shortfocal image features of the second FEB for further network to restore high-resolution image. In image restoration stage, a feature fusion block (FFB) is proposed to reconstruct image. The structure of our FFB is shown in Figure 5. The last convolution filter is a deconvolution layer with stride two, this step can be formatted by where DeConv 2 (⋅) denotes de-convolution operation with stride 2.
In FEB, convolution with stride 2 is used to speed up patch match. Part of image information which is essential in image restoration is lost as the same time of decreasing image size. To prevent information loss, MatchGAN add the output of FEB to layer after de-convolution, considering the whole network of match as a residual learning network. This operation helps us accelerate network convergence and reduce the diffi- where f denotes f th level feature fusion, n denotes n th convolution operation with stride one. In image restoration stage, two FFBs are used to restore image. After FFB, the zoomed image is generated through three convolution.

Discriminator
The discriminator network is used to discriminate real HR images from generated SR samples. The discriminator commonly composed of a series of convolution layer, ReLU [38] and batch-normalization [40] as shown in Figure 6. Follow the architecture by Ledig et al. [41], fewer filters are used considering the weight of GAN loss is small compared to regular term. Discriminator model contains six convolutional layers, and the resulting one feature maps are followed by a final sigmoid activation function to obtain a probability for sample classification.

3.3
Loss function l 1 norm is used as one part of loss function in MatchGAN, because l 1 loss function does not over-penalize larger errors, and is proved to be more powerful for performance and convergence [42].
where G (S , L) is the generator with short-focus length image S and long-focus length image L input, and I HR denotes the real HR image. But it is well known that l 1 loss produces blurry results on low-level problems. So GAN discriminator is used to make our zoomed image more realistic.
The discriminator has only one output to classify the zoomed image is real or fake by  where D denotes the discriminator which is used to identify whether the input image is a real image.

Implementation details
In MatchGAN, the size of all convolutional layers is set to 3 × 3 except the first layer in FEB whose kernel size is 5 × 5, and the number of filters is shown on figures. The convolution layers in feature extraction, image restoration stage have 16 filters, and the filter numbers in Figure 2 is 16, 32, 16 and 3 from left to right. In discriminator, the number of channels for all filters is 16 except the first one is 3 and the last is 1. The convolutional filters are initialized using the method of He et al. [43] For all the convolution layer, this paper pad zeros to each side of the input to keep size fixed. The final convolution layer has three output channels, as output high-resolution color image. The network can also process gray images if change that to one. The patch size in image match stage is set to 8 × 8. Experiment is based on the dual camera with the same optical axis. In training and testing phase, MacthGAN use the center area of the image as long-focal image and down sample the whole image as the short-focal image. For the magnification less than 2, the center 1∕2 area is used as long-focal image. For the magnification between 2 and 4, center 1∕4 area is used as longfocal image.
DIV2K [44] is used as our training set. DIV2K [44] consists of 800 training images, 100 validation images, and 100 test images. All of our models are trained with 800 training images and 10 validation images are used in the training process. In each training batch, this paper randomly sample 16 patches with the size of 128 × 128. An epoch has 1000 iterations of back-propagation. LR image is generated by bicubic interpolation. This paper implement our MatchGAN with the tensorflow framework and update it with Adam optimizer [45]. The learning rate is initialized to 2e − 4 for all layers and decreased by a factor of 0.96 for every 100 epochs. Followed by Ledig et al. [41], we alternate updates to the generator and discriminator network.

EXPERIMENTS AND ANALYSIS
In this section, we first analyze the contributions of different components of proposed network. We then compare our MatchGAN with state-of-the-art algorithms on four benchmark datasets.  To validate the effect of patch match structure, we remove the image match stage and double the filters of the last layer of feature extraction stage to use the same number of MatchGAN. In Figure 7, compared with the second picture, MatchGAN without patch math generate more artifacts, because it cannot find corresponding area of reconstructed patch and generate many artificiality as the result without patch match. In Table 1, the quantitative results show that the patch match in CNN leads to moderate performance improvement.

GAN
Generally, GAN is interpreted as more realistic details generation and lower PSNR and SSIM [46]. In ref. [37], it is developed as a patch correspondence and applied in many image inpainting programs. The result shown in Table 1 proved that GAN can improve the performance of patch match, as the consequence, our network with GAN achieved higher PSNR and SSIM. Based on the results of ref. [37] and our own experimental results, the mechanism of GAN kind of matches similar patch, but its specific principle needs further discussion and experiments. When we provide similar area in long-focal image, it will use it to reconstruct similar area in short-focal image. When just used as single-image super-resolution, it will use similar area of training set to reconstruct image. That is the reason why we achieved higher PSNR in Section 4.2. In Figure 7, we can find that Match-GAN generated more details than MatchGAN without GAN.

FIGURE 10
The "Board" image dataset with an upscaling factor 4 LapSRN [48], IDN [49]. For digital zooming, we generate the training data by randomly sampling 128 × 128 patches from four different datasets: Set5 [50], SET14 [51], BSD100 [52], URBAN100 [53]. Among these datasets, SET5, SET14 and BSD100 consist of natural scenes;URBAN100 contains many images with details in different frequency. We compare the proposed MatchGAN with several state-of-the-art SR algorithms. We evaluate the SR images with two commonly used image quality metrics: PSNR, SSIM [46]. All the experiments are implemented on GTX 1080 TI GPU, we train those algorithms with the same strategy. Table 2 shows quantitative comparisons for 1.5×2×3×4× SR. Our MatchGAN performs favorably against existing methods on most datasets. In particular, MatchGAN achieves higher SSIM values even PSNR is not the best among those algorithms, which has shown that our feature fusion can fuse similar patch of different image region.
In Figure 8, we show visual comparison on Urban100 with a scale factor of 1.5. For comparison, we only show the other three best algorithms on the figures. Our method reconstructs more clear patterns on the stone brick. Besides our approach suppresses artifacts by matching image patch which usually occurs in GANs. As shown in Figure 7, our result also are less artificial than algorithm without patch. In Figure 9, we can find more clear broken lines on tiles. Generally, the more similar the scene in long-focal and short-focal image, the better the quality of reconstructed image achieved, because we can generate reference image to reconstruct image. In Figure 10, we achieved higher PSNR and SSIM compared with other methods, but in the center area (we use this area as long-focal FIGURE 12 The real images inputs. The left column of is wide-view image and the right column is tele-view image captured by IphoneX. The red frame is the super-resolution result we will show next FIGURE 11 Results on real images(×2) in red frame

FIGURE 13
Results on real images(×4) in red frame

FIGURE 14
The real images inputs. The left column is wide-view image and the right column is tele-view image captured by IphoneX image), we cannot find similar spot, so the visual improvement of the reconstructed image is not obvious.

Qualitative evaluations on real image
We further qualitatively evaluate the proposed algorithm against state-of-the-art methods on real image. To capture the realistic tele-and wide-view image pairs, we use iPhone X with two camera lens mounted on a tripod. We define images captured at 51 mm focal length as the wide image and images captured at 28 mm focal length as the tele-view image. Those two images have a certain displacement and parallax, and are not perfectly aligned. As we stated in Section 3.1.2, MatchGAN has no strict restrictions on input images. The results in Figure 11 show that the flower in the picture is more obvious in MatchGAN with the input of Figure 12. And Figure 13 shows that the state-ofthe-art methods are still blurry with the input of Figure 14. In contrast, MatchGAN generates the images with clearer detailed structures, which demonstrate that the proposed algorithm generalizes well.

LIMITATIONS
While our method is capable of generating high-quality HR image on a certain scale, it is time-consuming. Patch math is a slow algorithm. We use some convolution layers with stride two as the basic structure of our model because we can decrease the time of patch match algorithm greatly, but compared with other networks, it is still slower and we are trying to fix it in the future. Additionally, we use long-focal image to guide shortfocal image reconstruction. When the scale factor is too large like 8× and 16×, the long-focal image is too small compared with short-focal image to guide the reconstruction and the concatenate structure will make some confusion to original image, so our method is not suitable for large scale factor.

CONCLUSIONS
In this work, we proposed a deep convolutional network with patch match algorithm for continuous digital zooming. Our model progressively predicts high-frequency residuals guided by long-focal image. By using the feature structure, we decrease the time of patch match algorithm, and the matched image can concatenate to the original image for details reconstruction. Extensive evaluations on benchmark datasets demonstrate that the proposed model performs favorably against the state-ofthe-art SR algorithms in terms of visual quality.