Disentangled and controllable sketch creation based on disentangling the structure and color enhancement

Existing sketch-based image processing methods include sketch recognition, sketch synthesis and sketch-based image retrieval. For sketch creation, a meaningful task is proposed namely disentangled and controllable sketch creation (DCSC) based on disentan-gling the structure and color enhancement. Speciﬁcally, as the ﬁrst subtask, sketch structure enhancement (SSE) is used to enhance a non-professional sketch (NPS) and obtain a professional sketch (PS), which is a process denoted as NPS2PS. A data set named SketchMan is ﬁrst provided, consisting of NPSs and PSs with various postures in different scenes. SSE is trained as a conditional image-to-image translation problem, and there are three models: direct sketch-to-sketch (SS), grayscale guided SS and contour guided SS. Multiple IOU metrics are proposed based on Corner Point Map (CPM), Straight Line Map (SLM) and Segmented Area Map (SAM). As the second subtask, sketch color enhancement (SCE) is trained as a two-stage framework containing a topology enhancement network (TE-Net) that maps a sketch to the corresponding grayscale domain and a color injection network (CI-Net) that injects the global color feature to the AdaIN residual blocks to perform adaptive sketch colorization. The TE-Net and CI-Net disentangle the topological and color features to perform more controllable and diverse SCE results. Experimental results demonstrate that our proposed methods are effective to address the challenging and meaningful DCSC task compared with other state-of-the-art methods.


FIGURE 1
Sketch structure enhancement (SSE) is an NPS2PS task conducted to synthesize many delicate works. We show some NPS and PS images drawn by human and computer FIGURE 2 Sketch color enhancement (SCE) results based on various reference images. We show 6 synthesized colored sketches with respect to the different color styles data distribution of the enhanced results of challenging SSE is still far from that of the original sketches. We verify the performance of our model using the original sketches in the SCE subtask, as shown in Figure 2.
Edge information is vital for many image translation tasks, and SSE task is more special. First, many color or semantic rendering tasks [11,14] need to generate more details based on the large number of object edges. As for the inpainting tasks [5,15] with removed contents, the edges of the input image are slightly distorted. This is similar to the superresolution task [16] as well. Moreover, scene generation tasks [17,18] generally translate a new scene from a caption, scene graph or detection box, which may have many structural distortions rather than precise edge features.
In an SSE task, the edge information has extensive structural distortions as well, which needs to be completed and optimized. Furthermore, the fault of SSE results are easier to be exposed. After NPS2PS, if there are unreasonable strokes on a white canvas, this will cause an uncomfortable visual effect for human eyes. To benefit the research on SSE and other related works, the established data set SketchMan provides NPS covering diverse appearances and postures in both simple and complex scenes, as shown in Figure 3. Note that SketchMan con-tains both simple characters and multiperson scenes. The sketch structures are more complex than those of other databases [6,8,19]. For each sample, there are two kinds of NPS with different degrees of alignment to PS strokes. As shown in Figure 4, freehand NPS is sparser than the approximate PS. In SketchMan, high-level feature maps contain the background mask maps, grayscale maps as well as color maps. As for low-level features, three kinds of maps are obtained by means of corner detection, straight line detection and segmentation based on superpixel clustering.
For the SCE task, inspired by recently proposed works [20][21][22][23][24], we adopt adaptive instance normalization (AdaIN) [25] to control the color style transformation of the sketch. Specifically, we conduct controllable sketch colorization by disentangling the topological and color factors; that is, the first stage TE-Net is trained to generate a grayscale map of the sketch that represents the topology completion result, on the basis of which the second-stage CI-Net performs diverse color modifications based on a specific reference map. In this way, both the topological and color features are enhanced to generate more controllable and detailed color maps.
In summary, the main contributions are as follows: An overview of the proposed dataset SketchMan. There are four parts, that is, non-professional sketches, professional sketches, high-level feature maps and low-level feature maps

FIGURE 4
Examples of manga creation in our database. The first row is a PS, the second row is an approximate professional NPS, and the last row is a free-hand NPS • We propose disentangled and controllable sketch creation (DCSC) based on disentangling sketch structure enhancement (SSE) and sketch color enhancement (SCE). • We establish SketchMan that covers various high-resolution anime characters. The statistics of character attributes in SketchMan is shown in Figure 5. Furthermore, the feature maps in high-level and low-level domains are provided for comprehensive analysis. • We propose SS, grayscale guided SS and contour guided SS to address the SSE task, respectively. In addition to ODS (Opti- The statistics of character attributes in SketchMan mal Dataset Scale), OIS (Optimal Image Scale) and AP (Average Precision), we propose to conduct the intersection over union (IoU) evaluation of the NPS and PS with respect to points, straight lines, areas and strokes, which established an important benchmark. • We propose a disentangled and controllable sketch colorization method that includes a topology enhancement network (TE-Net) and a color injection network (CI-Net). Compared with state-of-the-art methods, including pix2pix [10], PaintsChainer [26], Style2Paints [14] and DeepColor [27], the proposed SCE model can achieve higher-quality automatic sketch colorization.
This paper is an extension of our previous conference version [28]. In addition to providing more extensive experiments and in-depth analysis, there are three major differences between this paper and its previous version: 1) DCSC is proposed based on disentangling SSE and SCE. 2) Another new variant of SS task with self attention is proposed and the corresponding experiments are conducted. 3) An efficient SCE framework is proposed to achieve controllable sketch colorization based on arbitrary reference maps.

Sketch generation
Sketch generation based on variational model [6] can automatically produce novel stroke sequences. Sketch abstraction [4,29] can generate sparser sketch whose semantic still could be recognized correctly. The sketch completion task [5] is a sketch inpainting task in white canvas. Casually drawn strokes are refined by color rendering in SketchyGAN [2]. SSE for animated characters is far more complex and challenging than these works. Sketch colorization usually is based on edges whose distribution is approximate with the real professional sketches. Some examples include PaintsChainer [26], Scribbler [30], Style2paints [14] and Comi-colorization [31].

Generative models based on adversarial learning
Generative adversarial networks (GANs) [32,33] have been widely used to improve many tasks, such as neural rendering [2], content completion [5], attribute editing [11] and style transfer [12,13]. Compared with other generative models, such as variational autoencoders (VAEs) and flow-based models, GANs achieve higher-quality results by means of adversarial learning.

Styled image synthesis
There are some related works on styled image synthesis [13,[20][21][22][23][24][34][35][36]. StyleGAN [34] can generate photorealistic faces from noise after disentangling the attribute factors using a feature-mapping network. U-GAT-IT [13] conducts unsupervised image-to-image translation with adaptive layer-instance normalization and class activation map (CAM) loss. SimSwap [35] proposes a simple face-swapping framework using AdaIN [25] to inject the identity feature of the source face. FaceShifter [36] adopts SPADE [37] for the attributes of the target and identity of the source to adaptively generate the swapped face.

APPROACH
In this section, we introduce DCSC based on disentangling the structure and color enhancement, as shown in Figure 7. For SSE, we first present our dataset. Then, we explore three SSE routines considering different pixel distributions. As the second subtask, we introduce a novel approach to perform disentangled and controllable SCE, described below.

Dataset
SketchMan selects 2120 high-quality PS samples from pixiv [40], covering approximately 2690 animation characters. The attribute distributions of SketchMan are shown in Figure 5. We invited students to mimic these professional sketches, obtaining corresponding free-hand NPS images and approximate professional sketch (APS) images. Free-hand NPS aims to simulate the random and ambiguous semantic layout. APS is drawn using a fixed brush size. We study the challenging and meaningful SSE based on free-hand NPS. In the high-level sketch domain, we filter the background content using the corresponding mask based on PS and maintain the main body areas of the anime characters, as shown in Figure 8.

sketch abstraction based on SLIC
The low-frequency information of the original sketch is the abstract sketch. Unlike previous works [4,29], we utilize simple linear iterative clustering (SLIC) [41] based on superpixel clustering of the real PS rather than a color image of PS, to obtain the shape parsing map in the low-level sketch domain, as shown in Figure 9. Specifically, SLIC is formulated as: where the distances in the L-ab color space and pixel space are denoted as D c and D s , respectively. The final clustering considers both of them, as shown in Equation 3, where N c and N s are used to normalize D c and D s , respectively. Additionally, as shown in Figure 10, we find that sketch abstraction is not ideal while directly using once SLIC operation on the original PS; thus, we exploit another two SLIC after flipping the PS horizontally or rotating the PS 90 degrees along the anticlockwise direction, and then rotate the figure back to its original orientation. Finally, all SLIC maps are combined in the element-wise to obtain the integral regional segmentation map. We show more examples of sketch abstraction in Figure 11.

NPS augmentation
We use Laplacian mesh editing [42] to generate more NPS images. In Figure 12, the first row shows a PS and four movement situations of 9 feature points, the second row is the initial NPS and the deformed NPS images.
Specifically, vertex i is denoted as v i = (x i , y i , z i ) in the Cartesian coordinate space. The differential coordinates based on the Laplacian operator are defined as follows: where L s is the Laplacian metric, E a indicates whether two vertices are on one edge, and D is a diagonal matrix consisting of d i values, which are the numbers of vertices adjacent to each vertex. The large linear equation for NPS deformation is formulated as:

Method
In this subsection, we first present the three pipelines of the SSE task and then introduce the network architecture of SCE, described below. For the SSE task, we propose three NPS2PS pipelines, that is, SS, grayscale guided SS and contour guided SS, as shown in Figure 7a.

SS task
A conditional GAN (cGAN) conducts a mapping from observed image x and random noise vector z to a target-domain image y ∶ G ∶ {x, z} → y.p z and p data represent the prior distributions for z and the target domain, respectively. And p G represents the distribution of the synthesis domain. The specific loss is formulated as follows: Generator will be optimized via: We train SS model to directly translate an NPS to a PS in a supervised manner, which is a naive benchmark of the SSE task. We use L1 distance and SSIM [43] loss together with the GAN objective. The former is used to decrease blurring, and the latter enhances the structural similarity of G SS (x) and the real PS y.
The min-max function of the SS task is: We also use the Huber loss to improve the SS model, as shown in Equation 14, where is a threshold used to choose the 1 or 2 loss, as shown in Equation 15. We denote this variant of SS as SS+Huber.
To improve the effect of discrimination, we train another advanced SS model. In addition to the L1 loss at the image level, the L1 distances belonging to the intermediate feature space of the discriminator are considered, as shown in Equations (16) and (17). We denote this variant of SS as SS+L1+FM.
Additionally, inspired by Transformer [45], we add the self attention layer in the encoder and decoder of the global generator of Pix2pixHD [11]. Specifically, following with the conv, batchnorm and ReLU layers, the attentional channels are selected in the propagation by learning the query, key and value features. We denote this variant of SS as SS+L1+FM+Att.

Grayscale guided SS task
In the grayscale guided SS model, the contextual information of a complete topological structure of the intermediate grayscale map is more abundant and diverse than the sparse black-andwhite content of the original NPS. Concretely, the grayscale guided SS task includes two stages.
(1) The sketch-to-grayscale module, which predicts the topology distributions of the dense pixels under the supervision of grayscale PS, that is, the SG stage learns the mapping from NPS x to the corresponding grayscale PS image y G ∶ G SG ∶ {x} → y G : (2) The sketch extraction module, which learns the mapping from the predictedŷ G in the first stage to the real PS y. Sincê y G is an improved intermediate feature map whose boundary distribution is closer to the PS domain, we inputŷ G to G GS to implement sketch extraction, and the ground truth is the final PS, denoted as y ∶ G GS ∶ {ŷ G } → y. The generator learns to remove the color pixels around the sparse sketch.
The second stage is formulated as:

Contour guided SS task
Sketch abstraction removes the most internal details and maintains the overall outline of the object, which makes it easier to fit the data distributions of the abstract sketches in the sketchto-contour (SC) stage. Moreover, in the contour-to-sketch (CS) stage, the details of the PS object are mainly synthesized, and the fitting difficulty of NPS2PS is reduced in this coarse-to-fine way. Specifically, the contour guided SS model consists of two stages.
(1) Sketch abstraction, which is trained for the purpose of guiding SSE optimization with the object contour as a shape parsing clue of PS. We input NPS x to SC stage to predict the corresponding PS contour y C ∶ G SC ∶ {x, z} → y C , where y C is extracted by means of SLIC.
(2) Sketch refinement, where NPS and the predictedŷ C of the SC stage are concatenated as the input of CS stage to generate the real PS. Since the NPS usually contains richer edge details compared with theŷ C , we utilize bothŷ C and NPS x as the inputs of the AS network to conduct NPS2PS, denoted as y ∶ G CS ∶ {ŷ C , x} → y.

SCE task
As shown in Figure 7b, the SCE model contains a TE-Net and a CI-Net. The TE-Net has a U-net structure [46]. Let X s be the sketch obtained based on SketchKeras [47]. The output of the TE-Net is the generated grayscale mapX g . The L1 distance is determined as follows: We further add the perceptual loss to improve the feature matching betweenX g and the target grayscale map X g .
We utilize the contextual loss [48] to measure the feature similarity betweenX g and X g . This loss reduces the texture distortions after sketch colorization. It is formulated as where l is the ReLU{3_2, 4_2} layer of the pretrained VGG19 network.
In the second stage, consisting of the CI-Net, the color feature extracted by the VGG19 model is injected into the grayscale code to control the affine transform parameters in the AdaIN layers. Similar to the losses of the TE-Net, the reconstruction loss, perceptual loss and contextual loss between the generated color imageX c and the target color image X c are as follows: Let  GAN be the adversarial loss used to discriminate the triplet {X s ,X g ,X c } and the real pair {X s , X g , X c } with The total loss of our SCE model is

Implementation details
We uniformly resize the sketch images in the training set of SketchMan. Specifically, the short edge is set to 256 and the long edge is adaptively resized according to the ratio of the width and height. The NPS and corresponding PS are randomly cropped to 256×256 in the training stage. Generally, when dealing with higher resolution such as 512×512, SSE models will easily fit more short and noisy strokes. Moreover, if dealing with a lower resolution such as 128×128, the synthesized blur sketches will be low-quality. Our SSE and SCE models use the Adam [49] optimizer with 1 = 0 and 2 = 0.999.

Quantitative metric of SSE
The performances of the SSE models are evaluated by three criteria: the ODS, OIS and AP [50]. The precision/recall curves for the original and refined NPS is shown in Figure 13. Different from the edge detection task [51][52][53][54] whose edges are more aligned to the object boundaries than the hand-drawn sketch, which has more offset near boundaries. Therefore, the recall of SSE results is relatively low. As shown in Table 1, compared with the other SSE models, grayscale guided SS has better F-score (ODS=.56, OIS=.56, AP=.34). Sketch has three important structual elements including points, lines and areas. Therefore, we consider their correspond-

FIGURE 13
Precision/recall curves for NPS and our SSE approaches. Note that we report the indicators of our previous work [28]  ing feature maps, that is, corner point maps (CPM), straight line maps (SLM), and segmented area maps (SAM), described below. We use pix2pixHD [11] as our backbone. In order to deal with images with any resolution, our SSE model improve pix2pixHD to a fully convolutional network by using the neural group of upsampling operations and a scale-invariant convolutional layer as the decoder architecture rather than the deconvolutional layers. The training set contains 10,600 NPSs where 2120 initial NPSs are deformed to four groups of augmented NPSs, as illustrated in Figure 12. It should be noted that in our protocol, different deformed NPSs can simulate drawn sketches by different drawer, which are supposed to be mapped to one professional sketch image. There are 132 images in our test set, as the first benchmark for implementing quantitative evaluation of SSE task.
While drawing a sketch, changing the drawing direction on the corner points has a great impact on the professional level of the sketch. We use [55] to detect the corner points according to the multi-scale distance D. D10 means the minimum Euler distance of the adjacent corner points is 10 pixels. As shown in Figure 14, D20 indicates the sum of the CPM results of D10 and D20, and the other notation is similar. We locate and extract the straight lines by means of probabilistic Hough transform [56]. As mentioned in Section 3.1.1, the high-resolution original PS images are segmented to 20 superpixel regions by implementing SLIC, to obtain the segmented area map. Note that we filter the redundant segmentation edges of the SLIC results that do not belong to the sketch abstraction, by using the background mask mentioned in Section 3.1.
Inspired by [57], which classifies different strokes into corresponding categories, we use the segmentation metric IOU to conduct the quantitative evaluation based on semantic discrepancy of pixels. Equation (30) shows the proportion of different combinations between real value i and predicted value j under k category. Concretely, p i j represents the amount of pixels with a ground truth label of category i but with a predicted category of j. Therefore, TP, FN, and FP are denoted by p ii , p i j , and p ji , respectively. We set k as 2.
The IoU is calculated for both the foreground and background, that is, the canvas and sketch in the SSE task, to represent the proportion of pixels successfully enhanced in the NPS. Note that the mIoU is the mean of these two IoUs. We separate the sketch image into the background part and the foreground part using two thresholds, that is, 225 and 250. A lower threshold means that there are less details in the sketch. The IOU-based quantitative evaluation results are shown in Table 2, and the qualitative evaluation results are shown in Figure 15

Quantitative metric of SCE
We apply two major quantitative metrics to evaluate the performance of SCE subtask.

Light sensitivity map
We use the light sensitivity map proposed in [58] to evaluate the colorization performance for the SCE subtask. This score focuses on the ability of autopainting and overfitting to the color hint.

FIGURE 16
The color-coded LBP, B3 and R1 share the same binary value

Color-coded local binary patterns (CCLBP) map
Furthermore, we use the color-coded local binary patterns (CCLBP) map proposed in [58] to evaluate the colorization performance. It reflects the rendering effect, for example, the smoothness and cleanness of the anime paintings. It is calculated as follows: where D Ri j is the distance between the current pixel i and the adjacent pixel j in the R channel. T represents the color difference, which is used to determine the binary value of the corresponding location of pixel j . As shown in Equation (32), for instance, if the distance value T R 2 between I(i,j) and I(i − 1, j + 1) is larger than T th , the corresponding binary value R 2 is 1, as shown in Figure 16.
After obtaining the coded binary values, we transfer them to the color space to obtain the CCLBP map based on Equation (33).

Experimental results
As shown in Table 2, compared with NPS, the sk-IoU value with 250 threshold for the SS task has around 6.5% improvement.
As the sketch image naturally has large white areas, it is challenging to enhance the sparse strokes. The contour guided SS approach is also superior over the grayscale guided SS approach, because the grayscale guided SS model usually produces results with more randomly distributed strokes and noises. Overall, the performance of SS+L1+FM has the best IOU performance.
In the test set, SSE models have completed some critical areas of the NPS to some extent, and the structure has been optimized, for example, the head area. However, the overall cleanliness, sketch aesthetics and stroke distribution are still relatively inferior to those of the real PS, especially when it comes to some challenging and complex scenes, for example, Figure 17. Moreover, as shown in Figure 18, the overlaps of the CPM, SLM as well as SAM concerning the SSE results and the original PS are   still sparse. Generally, the more the overlap, the more professional the predicted PS is. For the SCE task, a quantitative comparison with PaintsChainer [26], Style2Paints [14], DeepColor [27] and Pix2pix [10] is shown in Table 3. Our model exhibits competitive performance compared with other state-of-the-art methods. We use the test set in [58] to evaluate the autonomous sketch colorization in this paper, and some comparative examples are shown in Figure 19. More SCE results based on various reference images are shown in Figure 20.

User study of the SSE task
We conduct the subjective evaluation of the SSE task considering sketch professionalism and AI forgery detection, described as follows. We invited 20 people where half of them are anime amateurs and the others are artists. After briefly introducing the SSE task, these 20 users are asked to judge the real PS images in terms of its performance with respect to (a) drawing aesthetics, (b) the sketch completeness, (c) line smoothness and (d) noise amount. There were 924 images to be displayed, and each participant observed 500 random images. We randomly show a SSE image, which is scored 1-5 on the basis of the above four metrics. We collected a total of 10,000 human judgments, and the average subjective evaluation results are shown  in Table 4, where V Overall = Compared with the other proposed baselines, contour guided SS has a better perceptual performance.
As for AI forgery detection, the 20 users are invited to discriminate the generated fake sketch and real sketch in Pixiv. We found that human observation could detect only 5% of the forgery drawings, which demonstrates that SSE algorithms need to be improved further.

4.5.2
User study of the SCE task We implemented a user study of the SCE task based on four criteria: (a) the overall color visual effect, (b) regional obedience, (c) local rendering purity, and (d) colorization completion degree [58]. There were a total of 1400 generated color images, and each participant needed to randomly observe 500 images. The results are shown in Table 5, and our method has better performance than the others.

CONCLUSION
We performed DCSC based on disentangling the structure and color enhancement. SSE task mainly contains three difficulties: (1) randomly distributed lines are difficult to optimize based on the model priors, (2) the large amount of blank background makes the semantic of a large scale of areas ambiguous, and (3) the structural feature of an NPS has serious distortions. To perform this challenging task, we collect plenty of anime characters and draw the corresponding NPS in SketchMan. Moreover, we explored three different pipelines, that is, SS, grayscale guided SS and contour guided SS. We have established an important benchmark for the SSE task. We recommend this challenging and meaningful issue to the research community, hoping to attract more attention. Considering the quality of SSE results is still low, SSE and SCE are independently trained in this paper. The SSE is supposed to be improved in the future. The data that support the findings of this study are openly available in SketchMan2020 at [59]. For the SCE task, the first-stage TE-Net is trained to generate the grayscale map of the sketch representing the topology completion result, on the basis of which the second-stage CI-Net achieves diverse color modification based on a specific reference map. In this way, both the topological and color features are enhanced to generate more controllable and detailed color maps. Furthermore, the color diversity and regional obedience of the created sketches need to be improved.