Image translation with dual ‐ directional generative adversarial networks

Image ‐ to ‐ image translation is a class of vision and graphics problems where the goal is to learn the mapping between input images and output images. However, due to the unstable training and limited training samples, many existing GAN ‐ based works have difficulty in producing photo ‐ realistic images. Herein, dual ‐ directional generative adversarial networks are proposed, which consist of four adversarial networks, to produce images of high perceptual quality. In this framework, self ‐ reconstruction strategy is used to construct auxiliary sub ‐ networks, which impose more effective constraints on encoder ‐ generator pairs. Using this idea, this model can increase the use ratio of paired data conditioned on the same dataset and obtain well ‐ trained encoder ‐ generator pairs with the help of the proposed cross ‐ network skip connections. Moreover, the proposed framework not only produces realistic images but also addresses the problem where condition GAN produces sharp images containing many small, hallucinated objects. Training on


| INTRODUCTION
Many problems in image processing, computer graphics, and computer vision can be posed as 'translating' an input image into a corresponding output image. Translating an image from source domain X to target domain Y is equal to learning a data distribution transformation between domains. We define the problem of image transformation from a scene into another as image-to-image translation.
In the field of image translation, prior works use convolutional neural nets (CNNs) to achieve image-to-image translation via variational autoencoders (VAEs) [1,2]. They use CNNs to produce images, which are relatively blurry, by minimizing Euclidean distance between generated images and real images [3,4]. This is because Euclidean distance is minimized by averaging all plausible outputs, which causes blurring. Nevertheless, cascaded refinement networks [5] propose a new method to prove that CNNs can also synthesize more realistic images than generative adversarial networks (GANs). Generally speaking, in image processing and graphics tasks, GANs proposed by GoodFellow [6] are becoming a popular framework on problems where the output is highly detailed or photographic. With the help of GANs, recent research in computer vision, image processing, and graphics has produced powerful translation systems in the supervised setting, where example image pairs {x, y} are available, for example, the studies by Eigen [7][8][9][10][11]. Nevertheless, these works still have difficulty in producing photo-realistic images like real-world images. Compared with previous works, the study by Isola et al. [8] makes great progress but generated images still have a large gap with the ground truth, demonstrating that parameters of the model are not been well learned. For an ideal encoder-generator pair for image translation, its encoder is supposed to have the ability to extract distilled content information as style information of input images is of no use to produce output images and even affects the performance of generators. The ideal generator should be a nice domain-specific render that can produce photo-realistic images encoder extracts feature representations that contain useless style information and maybe lose the key content information, resulting in bad results including many small, hallucinated objects. On the other hand, even if the ideal generator is available, content information mixed with the redundant style information makes it difficult for generators to have a good rendering effect on generated images, resulting in blurry images or defective images.
To address the issue above, we seek a translation network that can learn the parameters well and produce photo-realistic images. There is no doubt that enlarging datasets is an efficient method. Nevertheless, large paired datasets are difficult to obtain and the resource of paired datasets is extremely limited. Therefore, we think that increasing the use ratio of data is an effective strategy that can produce impressive results conditioned on the same dataset. For the task of supervised image-toimage translation, we assume that there is some underlying relationship between domains. For example, as shown in Figure 1, domain X has a close relation with corresponding domain Y. Specifically, paired images differ in surface appearance while both are renderings of the same underlying structure and share the content information of images. To make use of this character, we construct a crossing framework, including two main sub-networks X → Y, Y → X and two auxiliary sub-networks X → X, Y → Y, to facilitate mapping learning between domains. In our framework, it implements the task of image translation and image reconstruction for both domains at the same time. We feed feature representations of both domains to domain-specific generators so that our encoders can extract common feature representations of input images, that is, content information. Meanwhile, our generators tend to ignore the redundant style information and act as domain-specific renderers. Taking advantage of the special cross-domain framework, our model can obtain better encoder-generator pairs using two main sub-networks: X → Y, Y → X together with two auxiliary sub-networks: X → X, Y → Y. Thus the proposed network is named dual-directional generative adversarial network (GAN)s. Besides, inspired by U-Net [12], crossnetwork skip connections are proposed in our framework to enhance the performance of generators since images of both domains have a similar underlying relationship. Unlike the study by Liu and Tuzel [13], our core idea and principle are dedicated to solving the issue of parameters learning in supervised learning, which is quite different from them. Besides, the key point is that our model does not use the weight-sharing strategy between encoders of different domains, and it is the same to GANs. For more details about our model, please see Section 3.
Herein, we explore GANs to achieve better results in the field of image-to-image translation using supervised learning. The main contribution of this work is that we propose a novel framework composed of four adversarial networks, which is called dual-directional GANs. The proposed framework utilizes cross-network skip connections and self-reconstruction to facilitate mapping learning between domains. Benefitting from the special framework, our encoders have the ability to extract intact content information and our generators have a nice rendering effect on generated images. Second, the proposed model can not only produce more realistic images but also figure out the problem where condition GAN produces sharp images containing many hallucinated objects. Third, extensive experiments are conducted to show that the proposed model is able to yield better results than some existing works. Qualitative and quantitative comparisons against other methods, demonstrate the effectiveness and superiority of our idea.

| RELATED WORKS
In our work, we succeed in achieving image-to-image translation with multiple technologies, including GANs, image translation, multi-network learning and so on. Next, we will introduce related works about three key points in detail.

| Generative adversarial networks (GANs)
Generating impressive images is still a challenging problem until the rapid development of the GAN. GANs have been proposed by Goodfellow et al. It simultaneously trains two models: 1) the generator G tries to learn the mapping from source domain X to target domain Y, and 2) the discriminator D tries to distinguish the generated images from real images.
The key to success of GANs is the novel framework that utilizes an adversarial loss to force generated images to be indistinguishable from ground truth. With the rapid advancement of the GANs, many works have succeeded in achieving impressive results in image generation. For example, DCGAN [14] utilizes convolution-deconvolution network to convert random noise to meaningful images. CGAN [15] succeeds in producing more realistic images by feeding conditional variable (e.g., the class label) to the input of network. In our work, we also use GANs to achieve image-to-image translation because of its superiority in image generation.

| Image-to-image translation
Converting one image to another can be considered the task of image-to-image translation. The idea of image-to-image translation goes back at least to Hertzmann et al.'s Image Analogies [16], which uses a nonparametric texture model [17] from a single input-output training image pair. Prior works, like Gatys et al. [18], apply CNNs to produce new images of high perceptual quality that combine the image content and style of natural images. More recent approaches use GANs to train their models and achieve impressive results. Isola et al. [8] figure out a general network called 'pix2pix' to achieve image translation using paired training datasets. They use an imageconditional GAN to learn a mapping from source images to target images. The image-conditional models have tackled image prediction from a normal map [19], future frame prediction [20], product photo-generation [21,22], and image generation from sparse annotations [23,24]. It is interesting that Chen et al. [5] utilize CNNs to make great progress in image translation. Unlike contemporary works, their approach can produce high-definition images by use of cascaded refinement networks without adversarial training. In addition to supervised learning, a large amount of works have achieved fruitful results in unsupervised setting. For example, Dis-coGAN [25] proposes a method based on GANs that learns to discover relations between different domains. Yi et al. [26] develop a novel dual-GAN mechanism to learn both mappings between domains. The primal GAN learns to translate images from domain X to domain Y, while the dual GAN learns to convert the task. The closed loop made by the primal and dual tasks allows images from either domain to be translated and then reconstructed. The key idea is similar to CycleGAN [27] as a concurrent work and both of them use the closed loop to acquire good results in unsupervised setting. Similar to our work, both of them can learn to translate images between domains in dual directions. Different from the cycle-reconstruction strategy used in CycleGAN and DualGAN [26], our self-reconstruction is reconstructed from feature space instead of image space. Besides, our framework does not use the closed loop, and it is quite different from them. Specifically, DualGAN utilizes the cycle-reconstruction strategy to translate images X → Y and then reconstructs input images with generated images as the input to achieve Y → X, which is a close-loop network and based on RGB images. Contrary to it, our model is an open-loop network, which consists of four adversarial networks to achieve image translation and image reconstruction using common feature representations instead of RGB images used in DualGAN. It is notable that we do not reuse the generated images to reconstruct input images. That is to say that we do not apply cycle-reconstruction strategy used by DualGAN in our model.

| Multi-network learning
In the field of image generation, several works have already implemented multi-network learning to tackle multi-domain problems. Zhu et al. [27] use the idea of cycle consistency to construct two GANs that can learn both mappings between domains using unpaired training datasets. CoupledGANs [13] use two parallel networks to learn a common representation across domain with weight-sharing strategy. Like CoupledG-ANs, Liu et al. [28] use two partly weight-sharing networks, which combine VAEs and GANs, to learn unsupervised image-to-image translation. Note that weight-sharing strategy plays an important role in cross-domain unsupervised learning. Different from them, our model does not use the weightsharing strategy between encoders of different domains and it is the same to GANs. It is worth saying that multi-network learning is becoming a popular method in recent works. For example, Lin et al. [29] decompose images into two parts: the content and style of images, respectively. To translate the image to target domain, they recombine its content code with the style code of another image from target domain by use of image-reconstruction strategy. MUNIT [30] proposes a similar idea to achieve multi-modal images translation, which implements the content code reconstruction as well as the style code. Like cd-GAN [29], MUINT allows users to control the style of translation outputs by providing an example style image. Kazemi et al. [31] proposes an unsupervised image-toimage translation framework which maximizes a domain-specific variational information bound and learns the target domain-invariant representation of the two domain. Lee et al. [32] present an approach based on disentangled representation for producing diverse outputs without paired training images. Gonzalez et al. [33] proposes a framework based on Bicycle-GAN [34] to achieve multimodal image translation by separating the internal representation into two parts: the shared content part and the exclusive style part. The content part is later concatenated with the style part to achieve multi-modal transformations. Different from it, our model utilizes the proposed framework to force encoders to extract the distilled content information and discard the useless style information. Our generators are domain-specific renderers that only take content representations as input to synthesize images belonging to corresponding domains. Moreover, we do not impose more constraints on feature space with L1 norm loss, which is used in the study by Gonzalez et al. [33]. Among these works, most use the main idea of cycle consistency to achieve image translation in essence.

| METHOD
In this section, we give a general overview to our model and share the formulation of our novel framework. The loss functions of our model are explained in the end.

| Our network
As shown in Figure 2, our framework is based on autoencoders and GANs. It consists of two main sub-networks M Y : X → Y, M X : Y → X and two auxiliary sub-networks F Y : Y → Y, F X : X → X. Every sub-network consists of three parts: including one domain encoder E, one domain generator G and one domain discriminator D. In our framework, there are six blocks in total: encoder E X for domain X and E Y for domain Y, generator G X for domain X and G Y for domain Y, discriminator D X for domain X and D Y for domain Y. The encodergenerator pair {E, G} constitutes a sub-network that first encodes source images to shared code via E and then decodes the shared code to produce target images via G. D X and D Y aim to distinguish generated images from real images of corresponding domain. It is worth noting that main sub-network M Y has different encoders from auxiliary sub-network F Y .
Also, it is same to M X and F X . For example, main mapping M Y : X → Y has the same generator G Y and discriminator D Y with auxiliary mapping F Y : Y → Y. However, they have different encoders, specifically E X for mapping M Y , E Y for mapping F Y . To make this easy to understand, we list every part of subnetworks in Table 1. More specifically, both of our encoders are constructed with identical network architecture, which is composed of eight convolution blocks to extract common feature representations. Each block has the form of Convolution-InstanceNorm-LReLu except the first block without InstanceNorm. We use 4�4 convolutions with stride two to downsample the input image to 512 feature maps of size 1 � 1. Corresponding generators utilize eight fractionally strided convolutions blocks with stride half, which use modules of the form Deconvolution-InstanceNorm-ReLu except the last block using Deconvolution-tanh, to make the input and output have exactly the same size. Our discriminator is an imageconditional convolutional discriminator that takes both of input and output images as the input of networks. Note that our discriminators do not adopt PatchGAN [35]. Instead, we take the entire images as the input of discriminators. The

F I G U R E 2
The framework of our dual-directional adversarial network. Purple blocks represent input and output domains. Green blocks represent encoder E, generator G and discriminator D. Yellow blocks represent the feature representations obtained by encoders. Black arrow lines represent the direction of information flow and red arrow lines represent skip connections between encoder E and generator G. It is assumed that paired images have the same underlying structure, so we take no actions to link feature representations Z X and Z Y in our network. Note that the input is real training images, and the output is synthesized images generated by our model discriminator first utilizes four convolutions blocks with stride two to downsample the input image to 512 dimensions and then uses one convolution block with stride one to output the score. But why does our model work? Next we will explain it in detail.
The traditional method like 'pix2pix', is a single network and the model is lack of necessary constraints on their encoders and generators. In our work, our goal is to learn mappings between two domains X and Y given paired training samples {x i } ∈ X and {y i } ∈ Y. As mentioned in Section 1, a single encoder-generator pair {E X , G Y } cannot ensure that E X , G Y act as an ideal encoder and generator, respectively.
Considering that paired samples x i and y i have the same content information, we take both feature representations E X (x i ) and E Y (y i ) as the input of corresponding generators G X or G Y to train our model. For example, as shown in Figure 3, ) under supervision using a L1 norm loss with real images y i . To minimize L1 loss, feature representation E X (x i ) will be gradually close to E Y (y i ). The final result is that our encoders are forced to extract common feature representations, that is, content information. In supervised setting, the content information of domain X is similar to that of domain Y. Meanwhile, this training strategy makes G Y insensitive to style information in E X (x i ) or E Y (y i ) by updating parameters step by step. It is because that domain X has different style information from domain Y. If our generators do not ignore the style information in E X (x i ) or E Y (y i ), the L1 loss between G Y (E X (x i )), G Y (E Y (y i )) and y i will be enlarged and it violates the principle of minimizing loss. Hence, our generators have a nice rendering effect on generated images. In a word, the goal of constructing auxiliary subnetworks is to obtain well-trained encoder-generator pairs. To have a better model, we implement a similar operation on G X . On the basis of M Y and F Y , we introduce M X and F X to facilitate mapping learning and make our encoders E X , E Y act as an ideal encoder, which has the ability to extract intact content information without style components. As a consequence of constructing four sub-networks, the use ratio of data is increased during the training time. Benefiting from the special framework, our model can obtain better encoders, generators and produce more realistic images.
In addition to auxiliary sub-networks, we also add skip connections to generators. Every sub-network consists of autoencoders and GANs. Input images are passed through a series of layers until they are progressively downsampled to shared feature, and then GANs decode it to output images that have the same size 256 � 256 as input images. Considering the small dimension of shared feature, there is a great deal of lowlevel information discarded when the input is passed through F I G U R E 3 As the training goes on, the feature representation extracted by E X is gradually approaching that extracted by E Y . Finally, E X and E Y get the similar feature, that is, content information RUAN ET AL. encoder. Therefore, we add skip connections, following the general shape of a 'U-Net' [12]. Each skip connection simply concatenates all channels at layer i and layer n À i, where n is the total number of layers. In our framework, we design crossnetwork skip connections denoted by red arrow lines in Figure 2 to construct symmetric networks, which are beneficial for our network to learn the shared feature representation. As we can see, generator G Y shares the information from encoder E X and generator G X shares the information from encoder E Y .
For example, sub-networks M Y : X → Y and F Y : Y → Y have the same generator G Y with shared information from encoder E X . The only difference between them is the input of G Y , that For sub-networks M Y and F Y , they have different encoders but the same generators to produce similar images using L1 norm loss between synthesized images G Y (E X (x i )), G Y (E Y (y i )) and the ground truth y i . It facilitates encoders E X , E Y to extract common content information rather than different style information when minimizing the loss. Moreover, it is well known that adding skip connections can reduce the loss of information, which is crucial to produce impressive images. Therefore, adding crossnetwork skip connections is a beneficial strategy for our model.

| Objectives
The objective of a conditional GAN can be expressed as l cGAN ðE; G; DÞ ¼ logDðx; yÞ þ log1½ À Dðx; GðEðxÞÞÞ� ð1Þ Where encoder E converts the input x to latent codes E(x). Discriminator D tries to minimize this objective by forcing D (x, y) to approach one and D(x, G(E(x))) to approach zero, that is D(x, y) ≈ 1, D(x, G(E(x))) ≈ 0. It tries to distinguish real images y from synthesized images G(E(x)) by playing a maxmin game with encoder E and generator G. Generator G tries to produce images G(E(x)) that look similar to images from domain Y, that is, D(x, G(E(x))) ≈ 1. In a word, G tries to minimize this objective against an adversarial D that tries to maximize it, that is, G* ¼ arg min E,G max D l cGAN (E, G, D). In our model, it includes two main mappings M Y : X → Y, M X : Y → X and two auxiliary mappings F Y : Y → Y, F X : X → X. All of them have a similar adversarial loss as conditional GAN.
Another term is L1 norm loss, which prior works have found it beneficial to combine the GAN objective with. L1 loss forces CNNs to learn mappings between domains so that generated images are closer to the ground truth. L1 objective can be expressed as In conclusion, our final objective functions consist of two terms: L1 loss and adversarial loss. L1 loss facilitates our model to learn the appropriate mapping between domains and model the data distribution of datasets. Adversarial loss is introduced to regularize the parameters to make generated images more realistic like real-world images. In our model, we utilize conditional GAN [8], which takes generated images and input images together as the input of discriminators to produce images of high perceptual quality. The final cost functions of our encoder E X , E Y and generator G X , G Y can be expressed as The hyper-parameter λ is the weight used to balance L1 loss with adversarial loss. Our encoders E and generators G try to minimize loss functions l E Y ;G X ; l E X ;G Y ; l E X ;G X ; l E Y ;G Y by forcing generated images close to the ground truth. For example, in case of sub-network F Y , D Y (x, G Y (E Y (y))) approaches 1 and ‖y À G Y (E Y (y))‖ 1 approaches 0. The cost functions of our discriminator D X , D Y can be expressed as To minimize cost functions of D X , D Y , discriminators D X , D Y are forced to distinguish generated images from real images through parameters learning.

| EXPERIMENTS
In this section, we divide our experiments into three parts to introduce. We start with introducing various datasets and baselines used in our experiments. Then, qualitative comparisons on perceptual effect are carried out to show that dual-directional GANs has the ability to generate more realistic images than other methods. In the end, we use the same metrics as 'pix2pix' [8] to evaluate the performance of the proposed method. The quantitative comparisons against other methods are conducted to show the superiority of our model. Besides, ablation study is conducted to further prove 78the effectiveness of dual-directional GANs. To optimize our network, we apply the Adam optimizer [36] where the learning rate is set to 0.0002 and momentums are set to 0.5 and 0.999. In our experiments, we set self-reconstruction weight λ 100 and train our model for 200 epochs with the minibatch size of 1. It is notable that using cross-network skip connections should obey one principle. As shown in Figure 2, generator G Y shares the information from encoder E X and generator G X shares the information from encoder E Y .

| Datasets and baselines
To prove the validity of dual-directional GANs, we test the method on a variety of datasets, including Cityscapes dataset [37], GTA5 dataset [38], sketch-photo [39] and Google Maps [8]. Note that all images used in our experiments, including paired training images, val images, generated images and so on, have the same size 256 � 256 except CRN [5], which produces images of 512 � 256.
Cityscapes dataset is a challenging real-world dataset. It contains urban street images collected from a moving vehicle captured in 50 cities around Germany and neighbouring countries. Cityscapes dataset has 2975 paired training images and 500 val images. We conduct both qualitative and quantitative comparisons on Cityscapes dataset to prove the effectiveness of dual-directional GANs.
GTA5 dataset consists of 24,966 densely labelled RGB images(video frames), containing 19 classes that are compatible with the Cityscapes dataset. In our experiments, we select the first 5000 paired images as experimental datasets for performance comparison. To avoid the similarity and continuity of training images, 2500 images of even numbers are chosen as the training set and the remaining images serve as the val set.
Sketch-photo dataset contains 995 paired training images and 199 val images. It will be mainly used in the experiment of qualitative comparison.
Google Maps dataset consists of 1096 paired training images and 1098 val images that are scraped from Google Maps. It is used on the map → aerial photo setting.
In our experiments, we carry out several performance comparisons against some existing works on multiple datasets. The baselines include some compelling works, that is, 'pix2pix' [8], DualGAN [26], CycleGAN [27], MUNIT [30], CRN [5] and PAN [40]. It is notable that MUNIT is a state-of-the-art approach on synthesizing diverse and realistic images. Like cd-GAN [29], MUNIT has the ability to control the style of translation outputs by providing an example style image. Thus it is suitable to the task of image translation. For fair comparisons, we replace the random style code with the style code of given style images to achieve the task of image translation. Besides, according to instructions in their repository, we use the pre-trained 512 � 256 CRN model to produce images of 512 � 256 for comparisons. DualGAN is referred to as Dual in our experiments for convenience.

| Qualitative comparison
In the experiment, we conduct several qualitative comparisons between our model and other methods. For fair comparisons, we implement the task of image-to-image translation on the same training dataset but with different networks. Figure 4 shows synthesized images produced by different models on Cityscapes dataset. As we can see, images generated by our model are much better than images generated by 'pix2pix' model in perceptual quality. At the local part of images generated by 'pix2pix' model, there is much noise so that they look blurry. Compared with them, images generated by our model look sharper and more realistic on perceptual intuition. Moreover, compared with the state-of-the-art approach like CRN, MUNIT, our model also outperforms them. In our view, CRN is a model based on CNNs, which is the major reason for lower definition. It is worth emphasizing that CRN heavily relies on cascaded refinements of the pre-trained model at the resolution of 256 and 512. At the resolution of 1024 � 2048, CRN outperforms other models without cascaded refinements as it is difficult for models without cascaded refinements to learn so large quantity of parameters at a time. In our experiments, although images generated by CRN have a resolution of 512 � 256, which is larger than 256 � 256 used in other methods, they look inferior to the images of our model in perceptual quality, as shown in Figure 4. Furthermore, it is notable that MUNIT plays not well on complex paired datasets, for example, driving datasets like Cityscapes and GTA5 datasets. It is because that urban scenes have much complex information so that corresponding images cannot be simply decomposed into the style code and content code. MUNIT is only suitable to some simple datasets like edges → shoes [8]. Unlike those works, our model has a broad application and achieves a better effect on the task of image translation. It implements noise elimination and improves the quality of generated images. Figure 5 presents the example results on GTA5 dataset. It is observed that the images generated by our model have higher resolution and are more realistic like real-world images. More results on labels → photos can be seen in Figures 6 and 7.
As shown in Figure 8, images generated by 'pix2pix' model have another drawback that they include many small, F I G U R E 4 Qualitative comparisons on the task of Cityscapes labels → photos. Note that the images generated by our model are better than those generated by other models. Please zoom in for more details, for example, Car Logo, road, people and so on RUAN ET AL.
-79 hallucinated objects on the task of photo → labels. Isola et al. encounter the same problem herein [8], and they use quantitative classification accuracies to demonstrate that simply using L1 regression gets better scores than using a condition GAN (with/ without L1 loss) on the task of photo → labels. They argue that for vision problems, the goal (i.e., predicting output close to ground truth) may be less ambiguous than graphic tasks, and reconstruction losses like L1 norm are mostly sufficient. Although condition GAN can produce sharper images than those using L1 loss, it brings in hallucinated objects simultaneously. For the translation task on Cityscapes dataset and Google maps, we conduct experiments on DualGAN [26] and find it is easy for DualGAN to cause mode collapse that outputs the same images regardless of input images. Furthermore, we carry out several experiments on driving datasets and find it is difficult for most models to learn well on the task of photo → labels. In our work, we still use a condition GAN with L1 loss as Equations 3 and 4, but different frameworks to produce realistic images without hallucinated objects in Figure 8. We use the proposed framework to figure out that condition GAN produces images containing many small, hallucinated objects. To our knowledge, it is the first time for conditional GAN to solve this issue on Cityscapes dataset. As we explain in Section 3, our model can utilize the novel framework to obtain much better encoder-generator pairs. Specifically, our encoder has the ability to extract intact content information and our generator has a nice rendering effect on generated images. Traditional encodergenerator pair is lack of necessary constraints so that the encoder may lose the content information of input images and the generator has a not good rendering effect on the output, resulting in blurry images including many small, hallucinated objects. More results on photo → labels can be seen in Figure 9.
F I G U R E 6 Example results for the task of labels → photos on sketchphoto dataset F I G U R E 7 Example results on Google Maps. Compared with other methods, our model can produce images of high definition. Please zoom in as possible as you can for more details about clarity F I G U R E 8 Qualitative comparisons on the task of Cityscapes photo → labels. Note that 'pix2pix' model produces images that include many small, hallucinated objects. In contrast to other methods, our model can produce more realistic and precise images, which are much close to the ground truth. Please zoom in for more information about image definition F I G U R E 5 Qualitative comparisons on the task of GTA5 labels → photos. Note that our method can produce sharper images where objects, for example, roads, cars and motorbikes, have clear profiles and colour pixels 80 -

| Quantitative evaluation
To illustrate the performance of images translation tasks, we use quantitative measures to evaluate the performance over the test sets, such as 'FCN score' [8], Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) [41] and Visual Information Fidelity (VIF) [42]. Besides, ablation study is conducted to further prove the effectiveness of dual-directional GANs.

| Evaluation metrics-FCN-score
Although perceptual comparison may be the gold standard for evaluating graphical realism, we try to find an automatic quantitative measure that can provide specific evaluation data. To this end, we adopt the 'FCN score' from the study by Isola et al. [8], and use it to evaluate the task of semantic labels → photos on Cityscapes dataset with the standard metrics. The Cityscapes benchmark includes per-pixel accuracy, per-class accuracy, and mean class Intersection-Over-Union (Class IOU) [37]. The FCN predicts a label map for a synthesized photo, and then compared with the ground truth label, the predicted label map is given a FCNscore with standard semantic segmentation metrics. If synthesized photos are realistic enough, corresponding predicted labels will highly approach to ground truth labels and get high scores simultaneously. Table 2 lists the comparison results on the task of Cityscapes labels → photos. Note that we quote partial data from the studies by Isola and coworkers [8,27] for comparison and all the baselines have the same implementation details as our method. Note that CRN heavily depends on cascaded refinements and we use the pre-trained 512 � 256 model provided by the study by Chen and Koltun [5] as the baseline in this experiment. According to the evaluation data, as shown in Table 2, our model gets the highest scores and outperforms baselines by a large margin. Compared with 'pix2pix' model, our model yields the highest per-pixel accuracy up to 0.73, which outperforms 'pix2pix' by 7 percent points. Moreover, we use more metrics (i.e., PSNR, SSIM, VIF) to evaluate performance of different models. Tables 3  and 4 show the evaluation data on Cityscapes labels → photos and labels → photos, respectively. The experimental results suggest that our model can play a better role on the task of image-to-image translation.

F I G U R E 9
Example results for the task of photo → labels on sketchphoto dataset. Note that DualGAN produces images that include small, hallucinated objects on the local part of faces  Table 5. According to the evaluation data, it can be observed that each sub-network is important to obtain the full improvement in performance. With the help of sub-networks F Y , M Y , our model can obtain a better generator G Y and produce more realistic images, which get a higher score up to 0.707 in per-pixel accuracy compared with a single network M Y . To further improve performance, we add sub-networks M X , F X to train our encoders E X , E Y together with generators G Y and G X so that our encoders have the ability to extract intact content information without style components and our generators act as ideal ones to produce images of high fidelity and definition. The experiment showcases the effectiveness of the proposed framework.
In addition, the ablation study on the cross-network skip connections is conducted to prove its effectiveness. We consider the following cases: (a) Ours(full): the full implementation of cross-network skip connections on four sub-networks: M X , M Y , F X , F Y . (b) Ours w/o F X , F Y : The cross-network skip connections on F X , F Y are removed. That means the two red arrow lines in the middle of Figure 2 are removed from our framework. In our generators, we replace the removed feature with the special maps, which consist of zero and have the same size as the removed feature. (c) Ours w/o all: On the basis of (b), we remove the skip connections on M X , M Y . The experiments results on the Cityscapes val set are shown in Table 6. As we can see, adding skip connections can reduce the loss of information when the input is passed through encoders. On the other hand, the cross-network skip connections on F X , F Y are beneficial to obtain well-trained encoder-generator pairs.
As shown in Table 7, we provide a sensitivity analysis of the hyper-parameter λ with different values. When λ is set a smaller value, the performance drops due to lacking L1 norm loss. When λ is set a larger value, the model lacks adversarial supervision, resulting in blurry images. Thus the hyperparameter λ is set 100 in our model.

| CONCLUSION
Herein, we propose a novel framework composed of four adversarial networks, which is called dual-directional GANs, to achieve image-to-image translation. The proposed framework consists of two main sub-networks and two auxiliary subnetworks. It utilizes self-reconstruction strategy to construct auxiliary sub-networks and makes use of them to facilitate mapping learning between domains. Combining with crossnetwork skip connections, our model can produce more realistic images than some existing works, especially in per-pixel accuracy. Moreover, we use the proposed framework to figure out the problem where condition GAN produces sharp images containing many small, hallucinated objects. In other words, our special framework increases the use ratio of data in essence conditioned on the same dataset and obtains better encodergenerator pairs at the same time. Extensive experiments are conducted to demonstrate that our model can produce images of high perceptual quality on multiple datasets. Qualitative and quantitative comparison results suggest that dual-directional GANs is a promising method on the task of image-to-image translation. TA B L E 5 Ablation study showing the effect of each sub-network on the final performance of our approach on the task of Cityscapes labels → photos

Mode
Per-pixel acc.