Two-Step Training: Adjustable Sketch Colourization via Reference Image and Text Tag

Automatic sketch colourization is a highly interestinged topic in the image-generation field. However, due to the absence of texture in sketch images and the lack of training data, existing reference-based methods are ineffective in generating visually pleasant results and cannot edit the colours using text tags. Thus, this paper presents a conditional generative adversarial network (cGAN)-based architecture with a pre-trained convolutional neural network (CNN), reference-based channel-wise attention (RBCA) and self-adaptive multi-layer perceptron (MLP) to tackle this problem. We propose two-step training and spatial latent manipulation to achieve high-quality and colour-adjustable results using reference images and text tags. The superiority of our approach in reference-based colourization is demonstrated through qualitative/quantitative comparisons and user studies with existing network-based methods. We also validate the controllability of the proposed model and discuss the details of our latent manipulation on the basis of experimental results of multi-label manipulation.


Introduction
Anime illustration is a worldwide popular art form of image owing to its diverse colour composition and fascinating character design.However, colourizing a sketch image is a time-consuming and tedious process, even for professional artists.It is also challenging for neural networks owing to the absence of colour and texture in sketch images.Given that line art has enormous market demand, both research and industry can benefit from successfully developing a fully-/semi-automatic colourization system.
Existing sketch colourization approaches usually require additional hints to synthesize colours [ZLW*18, ZJLL17, LKL*20, HYES19].In accordance with how hints are given, these methods can be categorized into three types: text-based [ZMG*19, HYES19], user-guided [SDC09, PMC22, SMYA14, FTR18, ZLW*18] and reference-based [ZJLL17,LKL*20].Text-based methods use the binary attributes of tags, words and sentences to colourize images, but they are insufficient to adjust the degree of colours.User-guided methods require users to specify colours for regions with spots or sprays, so a basic knowledge of line art and an interactive system is necessary.In addition, user-guided ap-proaches are inefficient in colourizing different sketch images as hints are designated in correspondence to each input image.Although reference-based methods overcome these limitations, developing such a method is more challenging as it requires an additional evaluation of the colour similarity with references and a semantically well-paired training dataset, which is currently unavailable and expensive to build.
To improve the quality and controllability of reference-based results, in this paper, we present a generative adversarial network (GAN) that adopts a pre-trained convolutional neural network (CNN) and two-step training.We design a reference-based channelwise attention (RBCA) block and a self-adaptive multi-layer perceptron (MLP) to enable the proposed model to generate high-quality images through references and text tags, as shown in Figure 1.We develop spatial latent manipulation for our attention-based receiving block in the second training.Qualitative/quantitative comparisons and user studies with other methods are taken to show our advantages in reference-based colourization.Experiments on multi-label manipulation also demonstrate the controllability of our results.The main contributions of this paper are concluded as follows: The rest of this paper starts by reviewing related works in Section 2. Section 3 explains the loss functions used in the two-step training and important components of the proposed models.Section 4 includes the implementation details, experimental results and corresponding discussion.Section 5 concludes the paper.

Related work
Image generation with GANs:.GANs are one of the most prevalent generative models owing to their effectiveness in synthesizing high-quality images.Goodfellow et al.As sketch colourization can be regarded as multimodal style transfer, many related algorithms are applicable.To achieve pixel-level correspondence, we utilize the pixel-level L1 loss instead of the frequently used perceptual loss and cycle consistency loss.A feature-level L1 loss is also adopted for latent manipulation in our second training.
Attention in computer vision:.The attention mechanism has dominated the natural language processing (NLP) field [VSP*17, DCLT19] for a long time.Many works have demonstrated the effectiveness of extracting features using spatial and channel-wise attention [HSS18, DBK*21, WPLK18].We adopt a cross-attention module as our receiving block, which can provide latent codes spatially for the GAN to improve the quality of reference-based results.Sketch colourization:.Colourizing is very time-consuming in practice, so researchers have developed many assistance tools to accelerate this process, such as Lazybrush [SDC09].However, traditional methods [SDC09, PMC22, SMYA14, FTR18] are usually developed on the basis of user-guided hints, so they are inappropriate for reference-based colourization.As neural networks have been proven effective in object recognition and colour rendering [ZIE16, ZZI*17, XHZ*20], many deep learning models have been proposed to solve the sketch colourization problem by encoding

Method
The pipeline of the proposed architecture is shown in Figure 2, whose training includes two steps: train the GAN for referencebased colourization, and train the mapping networks for tag-based manipulation.We mark the optimization targets of the first and second training with blue and green dotted rectangles, respectively.To obtain a sufficient number of semantically paired images, we generated sketch x and reference images r by applying line extraction [lll17,SSISI16] and deformation [SMW06] to colour images y, respectively.We use ResNet-34 to extract latent codes from reference images.ResNet-34 was pre-trained on ImageNet [RDS*15] and Danbooru 2020 [AcB21] for image and multi-label classification, respectively.The top 6000 tags, according to frequency, were adopted for the multi-label training.Note that these tags were not cleaned, so most of them are useless for colourization as they are not colour-related, e.g., 'solo' and '1_girl'.A part of the effective tags is included in the supplementary materials.

The first GAN training
We utilize conditional GAN (cGAN) L1 and total variation losses [IZZE17, AD05] to train our colourization GAN.The target of the first GAN training is expressed as where hyperparameters λ L1 = 100 and λ tv = 0.0001 in accordance with our pre-experiments and Refs.[IZZE17, ZLW*18, JAF16].A lower value of λ L1 was found to decrease the colour diversity of the results, leading to a greyish appearance.The total variation loss is adopted to decrease artifacts, but it is unnecessary for lower resolution images as the artifacts are usually not noticeable.The recommended threshold for cancelling the total variation loss is 384 2 .
Conditional adversarial loss:.The cGAN adopts sketch images x as the condition for D [IZZE17].Latent codes obtained from the reference image are introduced as hints for G.Our cGAN loss can be expressed as (2) G attempts to generate images to be as real as possible, while D should classify the real/fake image correctly.
Pixel-level L1 loss:.We adopt a pixel-level L1 loss to penalize the difference between ground-truth y and generated images G(x, φ(r)).
The loss is given as Total variation loss:.To encourage smoothness at high resolution, we follow former works [JAF16, ZYZH10] to utilize total variation regulation [AD05].x is used to represent G(x, φ(r)) to simplify the expression of the regulation, which is formulated as where xi, j denotes the (i,j)th pixel in colourized result G(x, φ(r)) and η is set to 1 in accordance with Mahendran and Vedaldi [MV15].

Receiving block
Reference latent codes are input to the GAN through the receiving block, marked in red in Figure 2. To investigate the latent manipulation in the proposed system for the second training, which will be introduced in Section 3.5, we propose two models with different receiving blocks.Details of the receiving blocks are shown in Figure 3 in which the Attention and Add models adopt an attention block and global average pooling (GAP) block, respectively.Given the sketch side input for the receiving block as x, its channel size as d s , the convolution as H and the output of receiving block as ψ (•), the corresponding latent codes of the GAN ψ (x, r) can be expressed as where Q, K, V, O represent the linear transformations in the attention-based receiving block.Note that r denotes φ(r) or δ b in accordance with the reference, and δ b will be introduced in Section 3.5.
Different from the Add model, which directly adds the globally averaged latent code GAP(r), the Attention model indirectly modifies the GAN's latent codes ψ (x, r) through the attention, where ψ (x, r) is calculated on the basis of the spatial relationship between x and r.The Attention model performs better in reference-based colourization as attention preserves more local information, while the Add model is easier for latent manipulation, which will be discussed in Sections 3.5 and 4.5.

Reference-based channel-wise attention
Our pre-experiments show the deterioration of colour diversity and similarity caused by missing reference latent codes.A number of latent codes are unnecessary in the first decoding layers, so they are discarded before the corresponding visual attributes are synthesized in the middle layers.To solve this problem, we propose the RBCA block to receive hints in the intermediate upsampling layers, whose position and flow chart are shown in Figures 2 and 3, respectively.
Note that x is re-weighted by sigmoid output instead of added.We also adopt a residual connection to ensure a straightforward backpropagation path.

Pre-trained reference encoder
The widely used perceptual loss [JAF16] and cycle consistency loss [JYTPA17] are insufficient for generating natural colours in sketch colourization, so a pixel-level restriction is necessary when training colourization network.However, training with pixel-level restriction requires pixel-level correspondence between (sketch, colour) pairs and semantic similarity between (reference, colour) pairs.As mentioned at the beginning of Section 3, reference images r were generated by applying deformation D to colour images y, so we can re-write Equation (3) as If G and φ are jointly trained, they can be viewed as a united generator G , and the optimization becomes the following process: We can determine the optimal G for the re-organized loss to be D −1 , the inverse transformation of D and ignores the sketch input x.This leads to a substantial deterioration in inference.
Let F and E denote the decoder and trainable encoder(s), respectively.When jointly training the encoders, the reference-based colourization can be expressed as y = F (z) ∼ P(y|x, r), where z = E(x, r), and the latent distribution of z is, therefore, P(z|y, x, r).Adopting a pre-trained reference encoder stabilizes this process by dividing it into two steps.First, the sketch encoder generates z by encoding the sketch image, expressed as z = E(x) and the latent distribution of z is P(z |y, x).Then, the receiving block obtains the embeddings z by conditioning on the reference information, such that z = ψ ( x, r) according to Equation (5), where x = z ∼ P(z |y, x).As the reference encoder is fixed, the optimization target changes from the latent distribution P(z|y, x, r) to the distribution P(z |y, x) and the receiving block.The latent distribution P(z |y, x) is irrelevant to the reference r and decides the image quality.Therefore, this change significantly improves the generated results, particularly compared to the cases when jointly trained encoders fail to match semantically corresponding regions between x and r.
Choice for reference encoder:.We tested a series of frequently used networks for the reference encoder.Contrastive languageimage pre-training (CLIP) encoders could colourize sketch images but were ineffective in adjusting colours using tags.CNNs other than ResNet-34 were less sensitive to colours because the number of colour-related channels will not increase as the networks become heavier, even though they perform better in segmentation and recognition.Considering the efficiency and quality, we chose ResNet-34 as our default reference encoder and left CLIP encoders for future work.All the networks are trained the same way, which will be explained in Section 4.1.
Implications of poor segmentation:.Our ResNet-34 was not well-trained according to the pre-experiment, as its recall and precision were unsatisfactory.This poor recognition decreases the GAN's segmentation ability and the controllability of results by missing reference latent codes.If a reference latent code is mostly missing during training, the GAN can hardly connect it with the corresponding visual attribute.For example, if the pre-trained CNN cannot precisely predict 'red_skirt', its corresponding visual attribute would be controlled by 'red_dress' or 'red_shirt', as they are all recognized as 'red_cloth'.The experiment in Section 4.5 will partially show this disadvantage.

The second mapping training
Motivated by latent manipulation research on StyleGAN [KLA19, KLA*20, GAOI19, YCW*21, WLS21], we propose the second training to manipulate the latent codes using probabilistic values of text tags.Previous work has demonstrated that probabilities given by a pre-trained CNN contain sufficient latent information to colourize sketch images, so our second training is tailored to connect these probabilities with the visual features we used in the first training through a network ϕ that satisfies the following equation: where φ(r t ) and φ(r a ) are the visual features extracted from the target r t and anchor r a reference images, respectively, and cls t and cls a are their corresponding probabilities given by our pre-trained classifier.Using a neural network to approximate the latent codes makes the manipulation more 'linear', enabling the input probabilities to be larger than 1.
The Attention and Add models that go through the second training and are combined with the mapping networks ϕ will be called M-Attention and M-Add in the following sections, respectively.The M-Attention model additionally contains a fully connected layer ω.
Training objects:.The second optimization is defined as where L inv and ω are tailored for the M-Attention model to solve the absence of spatial information in ϕ(cls), so they are removed when training the M-Add model.Colourization performance is not significantly affected since the second training excludes G from optimization.
The key idea of the second training is to modify the reference latent codes on the basis of the mapped probabilities.To achieve this, we need to look back into the equation φ(r) = φ(r) GAP(φ(r)) * GAP(φ(r)).We regard φ(r) as a combination of ( H 32 × W 32 ) latent codes with H and W representing the height and width of input images, respectively.We can find that φ(r) are separately expressed by a spatial part φ(r) GAP(φ(r)) and a content part GAP(φ(r)).As GAP(φ(r)) can be approached by ϕ(cls), we use biased latent codes δ b to approximate the target φ(r t ), and δ b is formulated as where ϕ(cls) is broadcasted to the same shape with φ(r a ) by replicating the channel values, and φ(r a ) is added to ensure the consistency between φ(r a ) and δ b .Figure 4 illustrates how to approximate the target latent code φ(r t ) using δ b .Here, F (r a ) is calculated as for M-Attention where ω( φ(ra ) GAP(φ(ra )) ) = W φ(ra ) GAP(φ(ra )) + b as it is a linear layer in our design.The ResNet-34 adopts ReLU as the final layer, so φ(r a ) ≥ 0 and we can assign φ(ra ) (c)  GAP(φ(ra )) (c) = I when GAP(φ(r a )) (c) = 0. F (r a ) provides spatial latent information for the M-Attention model as a position weight matrix (PWM).Accordingly, F (r a ) = I for the M-Add model since the Add model receives globally averaged latent codes, which can be inferred from Figure 3 and Equation (5).
We adopt ω to adjust the spatial part of the latent codes on the basis of its channel-wise relationship.We assume similar latent codes in φ(r), such as 'red_hair' and 'green_hair', should have the same weight when modified by ϕ(cls) as they control the same object 'hair'.ω should be a linear transformation to prevent the spatial information in φ(ra ) GAP(φ(ra )) from being destroyed.
Hybrid L1 loss:.The mapping networks are used to convert φ(r a ) to φ(r t ) using the mapped probability vectors, as expressed in Equation (10).To achieve this, we tailor the hybrid L1 loss to maintain the pixel-and feature-level consistency, which is written as The feature-level L1 is the core component for the M-Add model as it encourages the generated latent codes to approximate the target φ(r t ), which can be inferred by combining Equations ( 5), ( 10) and ( 12).The pixel-level L1 penalizes the differences in global attributes, such as 'sky' and 'theme' that are majorly controlled by RBCA blocks.
Inversion loss:.The feature-level L1 loss cannot propagate effective gradients for the mapping network ϕ in the M-Attention model due to the dot product Q( x)K(r) in the attention-based receiving block, introduced in Equation (5).To optimize the mapping network ϕ for the M-Attention model, we tailor the inversion loss, which is formulated as The inversion loss directly requires the mapping network to satisfy Equation (8).With the inversion loss, the hybrid L1 loss can optimize ω to modify φ(ra ) GAP(φ(ra )) on the basis of its channel-wise relationships.However, the inversion loss is different from the feature-level L1 loss, which we will discuss in Section 4.5.
Mapping network ϕ :.In accordance with our pre-experiments, the visual attributes are controlled by latent codes generated in different layers.For example, 'hair' and 'eyes' are controlled by the first and second fully connected layers, but 'sky' and 'theme' are in the deeper ones, where the 'hair' and 'eyes' related codes will be entangled.The multi-layer MLP consequently loses control of 'hair' and 'eye' and results in entanglement.To solve this problem, we propose a specialized MLP called self-adaptive MLP.Outputs from different layers are concatenated and adaptively weighted, as illustrated in Figure 5.

Experiments
In this section, we first introduce the implementation details of the proposed method in Section 4.1, and then justify our network design in Section 4.2 by an ablation study.In Section 4.3, qualitative and quantitative comparisons with baselines [ZLW*18, SLWW19, CUYH20, LKL*20, HLBK18] are taken to prove the superiority of our method in reference-based colourization.We conducted two user studies, which will be introduced in the same subsection, to investigate the users' preferences and subjectively evaluate the similarity of colours.Finally, we validate the controllability of the proposed M-Attention and M-Add models through experiments on multi-label manipulation and discuss the differences in latent manipulation between the two models using the experimental results.
We quantitatively evaluate the quality of generated images using the Fréchet inception distance (FID) [HRU*17, Sei20], as a lower FID indicates better image quality.Using the official PyTorch implementation, we computed the FID over the validation dataset 10 times and took the averages.The reference images were shuffled for each evaluation.

Implementation details
We retrained the CNNs on the multi-label classification dataset Danbooru2020 for two epochs.855,876 images were collected as source data from Danbooru Figure2019, a subset of Danbooru2020 [AcB21] that only contains figure images.Colour images were resized to 512 2 and used to generate training and validation data, 766,454 and 89,422 triples, respectively.We implemented the framework using PyTorch and trained the proposed models on four Tesla P100s at a batch size of 64 and an NVIDIA GeForce 3090 at a batch size of 32 for nine epochs.Baselines were trained on the 3090 for the same epochs, but their batch sizes were accordingly lowered due to the higher cost of GPU memory.We adopted the Adam optimizer [KB15] with the settings learning_rate = 0.0001, betas = (0.5, 0.99).Input images were randomly rotated, flipped and resized to 384 2 before each iteration, where identical transformations were applied to (sketch, colour) pairs.We excluded validation images from all training sets to ensure that they were only used for evaluation.The second training was taken with the same settings for three epochs.

Ablation study
Reference encoder:.To justify the adoption of a pre-trained CNN, we trained three models, where ResNet-34 was (a) jointly trained with GAN, (b) fixed and pre-trained on ImageNet and (c) fixed and pre-trained on Danbooru2020 and ImageNet.Samples included in Figure 6 show that (a) generated images with better quality and similarity than (b) and (c) during training but strongly deteriorated in evaluation, whereas the results of (c) are much better than (a) and (b) in both diversity and similarity of colours.In addition to the qualitative experiment, an FID evaluation was conducted and the results are shown in Table 1 for an objective comparison, where a significant improvement can be observed.
RBCA block:.To demonstrate the effectiveness of the RBCA blocks, we performed a quantitative evaluation using the FID.As shown in Table 1, the proposed models achieved better scores by adopting RBCA blocks.
Mapping network:.We objectively show the advantage of the self-adaptive MLP by comparing the feature-level L1 loss, labelled in Equation (6).A lower loss indicates better controllability of the results as it measures the distance between the target ψ ((x), φ(r)) and the modified ones ψ (x, δ b ).To make a better comparison for the M-Attention models, we also show the inversion loss used in the second training.As shown in Figure 7, the models with the selfadaptive MLP were better optimized.

Comparison with baselines
To justify our method in reference-based colourization, we compare our results with those generated by the baselines in this subsection.Our baselines include StarGAN, MUNIT, IconGAN, SCFT and the most important one, Style2paints.
StarGAN [CUYH20] and MUNIT [HLBK18] encode the input images and decode their latent representations with style codes in accordance with the references.IconGAN [SLWW19] adopts separate discriminators for colour and structure.SCFT [LKL*20] obtains the references by deforming the input colour images and records the deformation before each iteration to enable the networks to be trained to find corresponding regions.These baselines jointly train multiple encoders for different styles of input images.Different from these methods, Style2paints [ZLW*18, ZJL20, ZLS*21] adopts two-stage training and a pre-trained InceptionNet.It is an integrated application that requires users to provide hints manually for each input and go through post-processing, so preparing sufficient results of Style2paints for FID evaluation is difficult.We instead conducted two user studies to explore users' preferences for comparison.Quantitative comparison:.In addition, we conduct a quantitative experiment using the FID.As shown in Table 1, our method achieved a much lower FID than the baselines.StarGAN v2 [CUYH20], IconGAN [SLWW19], SCFT [LKL*20] and MUNIT [HLBK18] rank from the second to the lowest as they are ineffective in generating visually pleasant colours and maintaining the structure of objects.

Computational cost:. It took 7 h to retrain
User study:.To investigate users' preferences, we used the Attention model to conduct two user studies.Since only ours and    9, indicates that our results are preferred by most participants while achieving a higher score in colourization performance.
Another user study was conducted to compare the proposed method with all baselines.We arranged four questionnaires in the second user study.Each had 20 (sketch, reference, generated) image triples.For example, in questionnaire #1, the respective results used in groups [1-5], [6-10], [11-15] and [16-20] were from ours, Style2paints, StarGAN and IconGAN.We invited 17 participants to rate the quality and similarity of the result for each triple.The average scores and total counting are shown in Table 3 and Figure 10, respectively, and that ours Attention achieves the highest scores in

Multi-label manipulation
To investigate the controllability of the proposed models, we performed multi-attribute manipulation by changing the values of hairrelated tags, as shown in Figure 11.The values increase along the axes, where the progressive change of hair colour can be observed in the results.We then tested the disentanglement and effectiveness for global attributes as shown in Figure 12, where we can see that the global hue of the images is modified on the basis of the manipulation of 'sky' labels without influencing the eyes and hair.To explore the manipulation linearity, we generated two sets of results using the M-Attention and M-Add models, respectively, as shown in Figure 13.These samples demonstrate the effective control for values larger than 1.According to our experiments, the input values can be approximately [0, 5].
These experimental results qualitatively prove the controllability of our models; however, there are a number of differences between the M-Attention and M-Add models, which we will discuss in the next subsection.

Difference between two mapping models
As introduced in Equation ( 5), the M-Attention model modifies the ψ (x, δ b ) on the basis of the dot product Q( x)K(δ b ).To investigate the information included in x, we show a result generated without reference r in the first column in Figure 14, which indicates that texture and identity are synthesized before receiving the references.Therefore, if a number of latent codes are missing in x, δ b would be ineffective in manipulating related visual attributes.
While the Add model ignores the spatial similarity, which is computed by the dot product in attention, the feature-level L1 loss in Equation (6) can map probabilities cls into the latent space of the

Conclusions
We have presented a novel system for reference-based sketch colourization, which can generate visually pleasant and adjustable results by adopting a pre-trained CNN as the reference encoder.We demonstrated the effectiveness of the proposed RBCA block and self-adaptive MLP in colourization through an ablation study.Qualitative/quantitative comparisons and user studies with the baselines were taken to show our advantages in reference-based colourization.We also showed experimental results on multi-label manipulation to demonstrate the controllability of our models and to investigate spatial latent manipulation.
However, there are still a number of limitations.First, the performance of colourization deteriorates as the line density of the input sketch images decreases, resulting in a loss of texture information.While most generative models rely on noise inputs to compensate for this missing information when applied to single-condition generation, our model is designed for dual-condition generation (sketch + reference image or sketch + text tags) and eschews this approach due to its negative impact on the stability of training and image qual-ity.Second, our ResNet-34 is inefficient in multi-label classification as we found it performs much worse than DeepDanbooru [Kic], which is too heavy for our research, and this drawback decreases the GAN's segmentation ability.Finally, the proposed M-Attention model cannot directly manipulate the latent code in the GAN, which may degrade the controllability of the results, as discussed in Section 4.5.Improving the generative model and adopting well-trained CLIP encoders will be the key points of our future work.

Figure 1 :
Figure 1: Multi-label manipulated results generated by the proposed Attention and M-Attention models.Our method can colourize sketch images using reference images and text tags.
[GPM*14] first proposed the vanilla GAN to decode random noise into images [GPM*14], whose learning process is extremely unstable.Researchers then designed a series of improvements to resolve this issue from the network structure [IZZE17, KLA19, TMYTCJY19, KLA*20] and loss function [MLX*17, ACB17, GAA*17].Many works explored the latent space inside a GAN and proposed effective algorithms to edit the outputs by latent manipulation [GAOI19, SGTZ20, YCW*21, VB20, GSZ20].These methods introduced additional learnable modules to locate the specific visual attributes in the latent space Z of noise input.However, most of them are tailored for StyleGANbased architecture [KLA19] and real photo images.Inspired by these works, we adopt a GAN-based architecture and propose the second training, which manipulates reference latent codes to adjust the colours in final outputs.Style transfer:.Many network-based style transfer methods have been proven efficient in learning features from images.Gatys et al. [GEB16] adopted a pre-trained Visual Geometry Group (VGG) network [SZ15, HB17] to transfer style information from a predetermined image.Johnson et al. [JAF16] proposed a perceptual loss for training a real-time feed-forward network.GANs soon outperformed these types of networks in various style transfer tasks [IZZE17, XYH*21, WCZ*22, JYTPA17, CUYH20, HLBK18].

Figure 2 :
Figure 2: Illustration of proposed network architecture and training flows.ω is a single fully connected layer, and the orange data flow additionally computes the position weight matrix (PWM) for the M-Attention model.x, y and r denote sketch, origin and reference images, respectively; H and W represent the height and width of the input images, respectively; we adopt ReLU and Tanh as the activation function in the intermediate layers and the final layer in the GAN, respectively.

Figure 3 :
Figure 3: Illustration of receiving blocks and RBCA blocks.The GAP and attention blocks are used in the Add and Attention models, respectively.We label the shape of the feature maps at the top of the corresponding grey rectangle, where n = h × w and c, h, w denote channel, height and width at the corresponding layer, respectively.GAP, global average pooling; RBCA, reference-based channel-wise attention.

Figure 4 :
Figure4: Illustration of how to approximate φ(r t ) using δ b .Converting φ(r a ) to φ(r t ) on the basis of the vector distance is better than directly mapping ϕ(cls t ) to φ(r t ) as it ignores the difference of latent space.

Figure 5 :
Figure 5: Self-adaptive MLP used to generate latent codes from classification probabilities.It takes the classification probabilities as input.MLP, multi-layer perceptron.

Figure 6 :
Figure 6: Comparison of results generated by models using different reference encoders.The reference encoder was (a) jointly trained with GAN, (b) fixed and pre-trained on ImageNet and (c) fixed and pre-trained on ImageNet and Danbooru.GAN, generative adversarial network.
ResNet-34 on Dan-booru2020 for multi-label classification, and the proposed first training cost 45 h on an NVIDIA GeForce 3090.As shown in Table 2, our training time is much less than most baselines owing to the simpler architecture compared with Refs.[CUYH20, HLBK18] or less pre-processing to Lee et al. [LKL*20].Though training IconGAN [SLWW19] is faster than our (CNN pre-training + first training), IconGAN cannot generate high-quality images as our model in accordance with the following comparisons.As the other methods are not capable of tag-based control, we excluded the time of our second training, which took another 8 h, in this comparison.

Figure 7 :
Figure 7: Comparison of feature-level L1 and inversion losses during training.The losses are smoothed by exponential moving average with the smoothness weight set to 0.9.
Qualitative comparison:.Colourized images generated by the proposed models and baselines are shown in Figure8to prove our advantage in reference-based colourization.It can be seen that only ours and Style2paints produce visually pleasant textures in the re-sults.However, the synthesized colours in Style2paints's results are less similar to the references than ours, especially the eyes and skin.More samples are included in the supplementary materials.

Figure 8 :
Figure 8: Qualitative comparison with the Add model and baseline methods.Sketch images used in the third and fourth rows are manually drawn by a human artist.

Figure 9 :
Figure 9: User study results.The participants are invited to rate the quality of their preferred result from 1 to 5, with 5 as the best.The average colourization score is calculated as score/ pt, where pt denotes the preferred time.

Figure 10 :
Figure 10: Rating score distribution in the second user study.Higher score indicates better performance.

Figure 13 :
Figure 13: Multi-attribute results generated by the proposed M-Attention and M-Add models, where the baseline columns show the respective reference-based results.The manipulated tags are 'blue_shirt', 'green_hair' and 'yellow_eyes'.

Table 1 :
FID score evaluation for the ablation study and comparison with baseline methods.score indicates better quality of the generated image.'Fix' and 'Train' indicate that the reference encoder is fixed or trained in the colourization training, respectively, and 'D' and 'I' indicate that the reference encoder is pre-trained on Danbooru2020 [AcB21] + ImageNet [RDS*15] or ImageNet only, respectively.Bold values highlight the best scores.

Table 2 :
Comparison of training times (days) on an NVIDIA 3090 GPU for reference-based colourization.We spent 7 h re-training ResNet-34 for multi-label classification, which is bracketed in the table.

Table 3 :
Average scores in the second user study.