Multiscale fully convolutional network‐based approach for multilingual character segmentation

National Natural Science Foundation of China, Grant/Award Number: (61872231, 61701297) Abstract Character segmentation is a challenging task for optical character recognition systems. Traditional methods usually utilize rule‐based algorithms but most of them are not applicable in modern intelligent recognition applications that require high accuracy. It is especially the case for text containing Eastern Asian language characters with complex pictograph structures, such as Chinese. To alleviate this problem, this study proposes an encoder–decoder structure‐based multiscale fully convolutional network (MSFCN) model for optical character segmentation. Comparing with other methods, MSFCN can not only effectively extract semantic details from images but also exploit boundary information of intervals between characters, thereby distinguishing characters from a background in pixel level. Extensive experiments have been conducted on two benchmark data sets of ICDAR2013 and MLCS. Obtained results prove that MSFCN achieves state‐of‐the‐art segmentation performance and indicated its practical application value.


| INTRODUCTION
Computer vision has been extensively studied for many years [1,2]. Nevertheless, there are still challenging unsolved problems, such as recognizing text with a large-scale vocabulary of multiple languages, or handwritten text. For optical character recognition applications, character segmentation is a crucial task in the recognition process [3]. The accuracy of segmentation has a significant impact on a variety of applications, such as the precision of single character recognition. Among the various writing systems, Chinese has more than 3000 frequently used characters. Thus there is no standard or universal methodology for extracting features for Chinese character segmentation. Besides, a single Chinese character may consist of multiple characters, so its structure can be very complicated. Thus, Chinese character segmentation becomes much more difficult than the Latin alphabet and Arabic numbers. Furthermore, the length of intervals between two characters in different languages varies irregularly. As shown in Figure 1, this multilingual text includes three types of character intervals with distinct features. Segmenting multilingual text is a challenging task.
To date, the mainstream character segmentation methods are morphology-based segmentation, clustering-based segmentation, and recognition-based segmentation [4,5]. The morphology-based method cannot handle noise well. The recognition-based method requires multiple recognition of different parts of the image and consumes much time. The clustering-based method has inferior robustness to the uneven greyscale image. Recently, deep neural networks can perform automatic feature learning on raw data, and have significantly improved the machines' learning ability in various fields [6,7]. Nevertheless, previous deep-learning-based methods for character segmentation tend to overcut the characters into their subcomponents [8][9][10]. Therefore, the accuracy and efficiency of such neural network-based methods are still far from satisfactory in real-world applications.
In this article, we propose a novel multiscale fully convolutional network (MSFCN) to perform segmentation for multilingual text containing Chinese and other types of characters. By scaling the original image to ones with multiple sizes, we can feed our model with more detailed feature information of raw inputs. Besides, by adopting an encoder-decoder structure and combining the scaled features with the intermediate features generated by the model, the accuracy of the model is further improved. Furthermore, we use the connected components search algorithm to optimize results generated by a deep network, then obtaining the final segmentation results.
The experimental results show that our model has better performance than others.
The structure is organized as follows: Section 2 introduces the related work of character segmentation. In Section 3, we describe the method of the MSFCN and present our novel improvement. Then we present experiment environments and results on the multilingual character segmentation (MLCS) dataset in Section 4. In Section 5, we evaluated our model using the ICDAR 2013 handwritten dataset. And at last, we draw some conclusions in Section 6.

| Rule-based methods
Research on character segmentation can be traced back to the 60s of the last century. At that time, researchers from the United States focused mainly on printed fonts on newspapers or bank cheques. In 1996, Casey and Lecolinet published a summary of character segmentation methods [3] and introduced some outstanding researches at that time. The article mentions three kinds of character segmentation strategies: The first strategy attempts to partition the input image into subimages to be classified later. The second strategy attempts to avoid premature image segmentation and segments the image either explicitly by classification of specified windows, or implicitly by classification of subsets of spatial features collected from the whole image. The third strategy is a hybrid of the first two, using dissection together with recombination rules to define potential segments. After that, rule-based methods such as the connected domain method [11] and the watershed segmentation method [12] were gradually proposed. The latest rule-based method was proposed by Inkeaw et al. in 2018. They proposed a conventional method based on oversegmentation for Lanna Dhamma characters segmentation with a multilevel writing style [5].

| Machine-learning-based methods
However, rule-based methods cannot achieve satisfactory performance when facing various real-world applications. In the meantime, with the development of machine learning, researchers are turning to this new approach. In 1994, Seni et al. had a relatively complete definition of metrics for clustering segmentation methods [13]: (1) bounding box method, (2) Euclidean method, (3) minimum run-length method, (4) average run-length method and (5) run-length with Euclidean heuristic method. In 2009, Louloudis et al. [14] used methods 1 and 5, together with the convex hull distance method. They divide text lines into two categories, one is the segmentation between two words, and the other is the segmentation of a single word. After that, Louloudis used the hierarchical clustering method, and their methodology achieved a detection rate (DR) of 90.4% and a recognition accuracy (RA) of 90.6%.
Recently in 2015, Ryu et al. found that in some traditional segmentation methods, many connections between segmentation lines are often ignored [4]. Thus, they turn the problem into a mixed integer quadratic programming problem. By using the cutting plane method to find the optimal solution, they have about 90% accuracy in the experimental data given by international conference on document analysis and recognition (ICDAR) in 2009 and 2013. In this way, a lot of parts are systematically segmented, but they do not consider the contents of each part. Xu et al. proposed an algorithm based on the hidden Markov model in the recognition algorithm [8]. After preprocessing the picture, they use an HMM recognizer to process the image into a sequence of the field images. Then dynamic programming method (such as the Viterbi algorithm) is used to find the optimal recognition results. Although the segmentation points can be selected dynamically, the time of segmentation takes too long while the accuracy is not as good as previous methods.

| Deep-learning-based methods
As the deep-learning-based methods become popular, they have been applied to the task of text segmentation too. Srihari et al. used a three-layer neural network [15] in segmentation for Arabic characters. They used a connected domain method for rough segmentation. Then two parts of a word appear in Arabic, so connected domains with a closer distance can be combined into a cluster. In 2012, Wang, T. et al. leveraged the representational power of multilayer neural networks [16], and a common framework to train highly accurate text detector and character recognizer modules. They integrated these two modules into a complete end-to-end, lexicon-driven, scene text recognition system that achieves state-of-the-art performance on standard benchmarks. Chen et al. address the task of semantic image segmentation with deep learning [17]. They highlighted convolution with upsampled filters (atrous convolution) as a powerful tool in dense prediction tasks and proposed atrous spatial pyramid pooling to segment objects at multiple scales.
At the same time, a new optimizer method Adam is proposed [18]. This has also been observed independently by Dai et al. [19]. They presented region-based fully convolutional networks (FCN) for accurate and efficient object detection.
Their region-based detector is an FCN with almost all computations shared on the entire image. They proposed positionsensitive score maps to address a dilemma between translation invariance in image classification and translation variance in object detection. FCN is also an effective segmentation method so far [20]. Yang et al. developed a deep learning algorithm for contour detection with an FCN network [21]. Different from previous low-level edge detection, their algorithm focuses on detecting higher-level object contours. Zhang et al. presented an FCN model [22] to predict the salient map of text regions holistically. Noh presented a novel semantic segmentation algorithm by training a deep deconvolution network on top of the convolutional layers adopted from the VGG 16-layer net [23]. The deconvolution network is composed of deconvolution and unpooling layers. It identifies pixel-wise class labels and predicts segmentation masks. Long et al. built an FCN that takes an input of the arbitrary size and produces correspondingly sized output with efficient inference and learning [24]. They adapt deeper classification networks (AlexNet, the VGG net and GoogLeNet) into FCN and transfer their learnt representations by fine-tuning to the segmentation task. In 2018, Yang et al. proposed a useful attention mechanism for convolutional neural network (CNN) models in image classification tasks [25]. In 2019, Li et al. [26] proposed a progressive scale expansion algorithm and combined it with a feature pyramid network (FPN) [27] to obtain a novel framework for a shape robust text detection. In 2019, Baek, Y. et al. [9] proposed Character Region Awareness for Text Detection (CRAFT), which focussed on combing characters and affinity between characters into text lines. They used this method to detect curve text line in scene image and achieve a good performance. In the same year, Zhang et al. [10] proposed another method named look more than once (LOMO) for text detection. The LOMO consists of a direct regressor, an iterative refinement module and a shape expression module for text proposal localization, text refinement and a flexible shape expression.
However, there are still no examples of applying semantic segmentation to character segmentation tasks. That is because the result of semantic segmentation cannot directly represent the result of character segmentation. Further processing of the predicted results is required to obtain the character segmentation results. Besides, these existing methods use the convolution of a single scale in feature extraction and ignore the intervals information between characters, which lead to the unsatisfactory cutting effect. And the problem of MLCS remains challenging, and the neural network models need to be further optimized.

| Problem definition
Most character segmentation tasks use rule-based algorithms like greyscale histograms to extract spatial information from images. Although these methods can effectively capture morphological features, the semantic information for characters is always ignored, which leads to errors in the segmentation results. Thus, we regard character segmentation as a supervised pixel-wise classification task which can be defined as the following: where I denotes the input RGB image with length h and width w, ζ denotes the semantic segmentation model and H denotes the two-channel heat map as the final output. Our final goal is to build a semantic segmentation model to predict the heat map H. The definition of I, H is described as the following: where p h;w 〈t; f 〉 denotes the value of pixel at (h; w). If the pixel belongs to text region, t = 1, f = 0 and vice versa. Next, the results of the heat map will be converted to the results of character segmentation. The definition of the function is shown as follow: The value of R is a natural number denotes which character region that pixel belongs to.

| MSFCN
Typical CNN models consist of convolutional layers, pooling layers, activation layers, dropout layers and fully connected layers. Convolutional layers convolve input feature maps or input images with some linear filters. The procedure can be denoted as follows: where r k denotes the result of convolutional process, F k denotes the k-th filters, M k denotes the k-th feature maps and b k denotes the bias of this k-th filter. Activation function does the nonlinear transformation to the input to make it learn for more complex patterns. As a typical representative of activation function, rectified linear units (ReLU) can be used in neural networks such as CNN [21,25,28,29]; it is defined as follows: Dropout layers randomly ignore some units in model in order to avoid overfitting and save the model's training time. Fully connected layers calculate the outputs as follows: where y k denotes the k-th output neuron, x i denotes the i-th input neuron, W ki denotes the weight connecting x i with y k and b k denotes the bias. Usually, a CNN model has several fully connected layers in the end of the model to output the labels like one-hot codes in the classification task or other type of labels in other tasks. In the CNN model, spatial information will be lost because of flattening operation, thereby is not suitable for segmentation tasks. Thus, FCN architecture is proposed by eliminating fully connected layers and adding convolutional layers with 1 � 1 filter as an alternative. After feature extraction, deconvolutional layers are used to make the output have the same dimensions as the input. The inputs of traditional FCN models are images that can be regarded as a matrix while the outputs remain the same shape. To ensure this feature, the FCN needs to be properly designed. Usually, the segmentation maps have a size of k � w � h for k different objects, where w and h are the width and height of input image shape, respectively. Typical FCN uses cross-entropy as a loss function which is shown as follows: where n denotes the number of output neurons, y denotes the results of the model output and a denotes the results of the label. FCN architecture can be used for semantic segmentation. By using the whole images as input and heat maps as labels, the FCN models can be trained end-to-end and can produce fullpixel segmentation results. As mentioned above, FCN solves the problem of fixed input image size and restoration of the feature image size that CNN suffers. However, FCN can ignore detailed features while its receptive field is increased through the pooling layers. Such a defect will result in the loss of a lot of edge information in the final recognition result of the model, which will lead to the performance degradation of the model. U-net fuses the features in the encoder with the corresponding features of the decoder to reduce the impact of the down-sampling operation. However, some of the image details have already lost in the deep layers of the encoder. Therefore, we proposed a novel network architecture, MSFCN. MSFCN can be divided into two parts, encoder and decoder. The former is mostly composed of convolutional layers for feature extraction. The latter is composed of deconvolution layers and up-sampling layers for feature reduction. Figure 2 shows our model structure. It can be found that our model is an image-to-image conversion model. In convolutional layers and deconvolutional layers, the ReLU is selected as activation layers. Compared to the sigmoid-like activation function, ReLU is a nonlinear function that is easy to calculate. Figure 3 shows a simplified structure of MSFCN. The feed-forward process can be divided into three branches, main branch (marked as blue), intermediate feature fusion branch (marked as green) and multiscale input feature extraction branch (marked as blue). The main branch is an encoderdecoder structure composed of a stack of sampling units and up-sampling units. All the sampling units contain a pooling Overall architecture of the proposed model. Each type of layer is represented by a different colour for the function it performs. The meaning of these colours shows at the lower right. The M, A, S, and C are four pooling operations to scale the input image and will be introduced in Chapter 3.3 layer and several convolutional layers. Correspondingly, the upsampling units contain an unpooling layer and several transpose convolutional layers. In this branch, input features for all units come from the previous units. Thereby solely using it will lead to a large amount of loss of character edge feature information of the input image. Intermediate feature fusion branch combines the features from the encoder with the input feature of the decoder is added to discount the feature loss. To make our model more sensitive to the features of intervals, an input feature fusion branch is used to concatenate features directly from the input image. The scaled input is concatenated to feature maps both from convolutional layers and deconvolutional layers. Thus, all sampling and up-sampling units will get the low-level edge features of the intervals between characters.
Our original inputs are image with size of 256 � 256 � 3, because an image in such size can be down-sized by three times keeping the integer size. And the scaled image remains a certain resolution, then, we use an algorithm to make the image feature maps, shown by scale 'n'�. Table 1 shows the size of these feature maps.
To fuse the semantic features from different level, we combine feature maps from encoder, decoder and scaled input to a concatenate layer Concatð⋅Þwhich can be represented as: where ⊕ denotes the channel-wise concatenate operation and PðInputÞ refers to max pooling operation. The kernel size of pooling layer is not always the same and is demonstrated in Table 2.
There are two concatenate layers. Table 2 shows the shape of input and output of concatenate layers.
After these two concatenate layers, we add another two convolutional layers, a max pooling layer to turn the feature maps into size of 128 � 128 � 128, after that, two convolutional layers and a max pooling layer are added and finally the features maps become the size of 64 � 64 � 256. Table 3 shows the detailed process. Table 4 shows the detailed information of decoder module. Finally, the output's size is the same as the label, it will be origin width �origin height � class numbers in our model. we calculate the loss and update it to the whole model.
The FCN model we used is not the most original one [24]. It is more like a Segnet based on the encoder-decoder structure proposed by Badrinarayanan [28] or a U-net proposed by Ronneberger [29]. The FCN models will keep the two-dimensional feature maps with a deconvolutional process. In MSFCN, based on the FCN model, we concatenate detailed texture features with semantic features and in both the encoder part and decoder part to fuse more features of intervals. So, the irregular intervals between multilingual characters can be captured by multiple receptive fields. It makes MSFCN more sensitive in the intercharacter intervals such as the distance between two characters.  -453

| Multiscale feature extraction
The essential problem of the character segment is that it's very difficult to define an appropriate length threshold for the intervals between two neighbouring characters in the same row. Moreover, compared to characters of Western language such as English, Chinese characters have unique structures where an interval can exist inside one character. To train a model that is sensitive to the boundary of one whole character, we need to let the model learning more information about intervals of two characters, rather than the intervals inside a character. To find out the patent of the intervals of characters, we collected a large number of text images and scaled them. MSFCN extracts the intervals features by scaling down the images with different sizes. By adding these feature maps into the network model, MSFCN takes more intercharacter interval information. Figure 4 shows images that are scaled-down with three different sizes. The image sizes of them from left to right are 256 � 256 � 3, 128 � 128 � 3 and 64 � 64 � 3. The first and second numbers denote the width and height of images, and the last number means the channel numbers of these images. As shown by Figure 4, some intervals inside a character disappear in the down-scaled images. However, intervals between the two characters are still retained. That means, using input images of different scales can enable our model to extract more useful feature information. So we propose a novel scaling method called adaptive multiscale representation, which can be implemented by the following equations: where x denotes the input feature maps. In this module we add two novel modules to scale origin input image. Stack-pooling is a way to pool the original image without losing any information of pixels and two-stride convolution can pool information from input image adaptively with its variable kernels. With the ASMR module, the down-sampling module is more adaptable to various input images. Figure 5 shows the details of two sampling modules.

| Searching connected component
The pixel-wise classification result indicates whether the area of each pixel belongs to is text or the background. The adjacent pixels of text characters are grouped to form a heat map of characters. However, such a result cannot directly represent the final character segmentation result. To get the specific spatial information of each character, these components need to be concatenated together. Image regions consisting of foreground pixels with the same pixel value and adjacent positions in an image were called connected components. We can get all characters after searching all connected components candidates in an image. The detailed information of the twopass algorithm is shown in Table 5. After all the connected parts are labelled as groups, we can colour each group to observe the experimental results. The result of the connected component searching algorithm is shown in Figure 6. From Figure 6, we noticed that most characters can be segmented well except for the last two. The

F I G U R E 5
The effect of stack-pooling and two-stride convolution is demonstrated in the figure: (a) in the stack-pooling module, the size of the output will be halved while the number of channels will quadruple and (b) in the two-stride convolution module, the size of the output will be the same as max-pooling and average-pooling but the pooling result includes all factors in the input image reason is that the pixel-wise classification results from the MSFCN model are not accurate enough. However, these errors can be made up by improving the algorithm of connected domain search. To solve this problem, we proposed a novel method named boundary expansion algorithm.
The boundary expansion algorithm contains three parts: image erosion, connected component search and boundary expansion. To demonstrate the whole procedure of the algorithm, we use a simple example in Figure 7. In this example, we input a pixel-wise segmentation result with eight character regions. Our goal is to find all character region without being affected by the adhesive pixels. First, we use a 5 � 5 kernel for image erosion, the small adhesive part between the last two characters is separated. Then we use the connected components search algorithm to define the character region. The regions labelled with different colours in Figure 7 denote the connected components result of different characters. After that, we use the boundary expansion algorithm to expand the region of each connected component to restore them to the former shape. Finally, we can regard the output image as the result of character segmentation tasks.
The reason why we chose the 5 � 5 kernel is that in the semantic segmentation result image, the only part we are interested in is the character region. And the connection between the character regions is noise, which will seriously affect the segmentation results. Therefore, we chose the 5 � 5 kernel instead of the 3 � 3 kernel to remove noises as much as possible.
The main idea of this algorithm is to collect adjacent pixels of origin segment results and expand the boundary of the current region. This algorithm is based on the breadth-first search algorithm so we do not need to worry about a region expanding too quickly. We describe the details of the boundary expansion algorithm in Table 6. In this description, M is an intermediate variable used to store the labelled result in each iteration. And the function adjacent () denotes a function to get four adjacent pixels near the target pixel.

| MLCS dataset
Due to the lack of publicly available evaluation data set for MLCS task, we create an image dataset, MLCS dataset. MLCS dataset is generated by a python script. Figure 8( training dataset (MLCS dataset) and Figure 8(b) shows the labels mapping to the training data. In training data, there are English alphabets, Chinese characters and alphanumeric characters in images in the data set. Each image has four lines of characters that include random numbers of three types of characters. We set a random interval size in each of these characters. Also, we add some random noise in images to increase the practicability of the trained model. The labels are rectangles, and each rectangle can cover a character mapping from training data. Though we show our labels in red colour, in the training process, we convert our labels into matrices which have a size of 256 � 256 � 2. The reasons for choosing this size are described in Section 3.2. The first and second number denotes width and height of the labels and the last number denotes the category of the segment task. There are two classes, the character part and the background part. We create 3000 images, half for training and another half for testing.

| Results and analysis
The loss function we used in our model is cross-entropy. Since each pixel of images has a class label, we can compute the loss between outputs and labels data and get a number between 0 and 1; and then update all parameters based on loss value. for each ∈ ( , ) do 9.
Enqueue(M, q) 12. Usually, the optimize function of CNN-like models is stochastic gradient descent (SGD). Recently, Adam [18] has been proved to be a well-optimized function, and the fitting speed of Adam is better than SGD. In our experiments, we utilize Adam and set the learning rate at 0.001.
An HP Z840 workstation is used to train the model. The workstation is equipped with 64 GB memory, two Nvidia Tesla K40c graphics cards and 12 GB video memory. To train the model, iteration is set to 300 rounds, and after 8 h we get the model on MLCS. Figure 9 shows the loss curve during the training procedure.
We use area-under-roc-curve (AUC) as a quantitative standard to compare these experimental results. The full name of AUC is area-under-roc-curve, in which receiver operating characteristic (ROC) is a widely used indicator in evaluating binary tasks. However, ROC cannot be calculated as a quantitative standard, so we calculate AUC to make a judgement. To do the comparison experiments, we choose four segmentation models, which are FCNs, GMM, HOG + SVM, [24,30,31] and 2D-DWT + GMM (which is the state-of-the-art in character segmentation tasks), and our model, MSFCN. Figure 10 and Table 7 show the AUC results by different methods.
It can be seen from Figure 10 that the proposed method has better performance than FCNs such as U-net and Segnet. It indicates that keeping the character spacing and character edges of the input image as much as possible will help improving performance of the network model. We can find that MSFCN model achieve best results compared to other methods in the test data.
Besides, to judge the complexity of our model, we used smaller FCNs for a reasonable comparison. We force all FCNs to perform down-sampling only twice. Then we compute the results for the trainable parameters of these models. The obtained results are shown in the table below: From the results in Table 8, we can see that the trainable parameters of our method and U-net have increased compared with other FCNs, and the parameters in our method are slightly higher than U-net in the decoder. This is because the accuracy of the character segmentation results depends on the performance of the decoder, so a slight increase in the trainable parameters is inevitable. Figure 11 shows some sample segment results by using MSFCN. To gain insight into what features are extracted by the convolutional filter in our model, we randomly select four activation feature maps after each convolutional layer for observation. As shown in Figure 11, MSFCN not only has a capacity locating multilingual character instances with arbitrary shapes, but also is sensitive for text with different character intervals. This proves that MSFCN is robust for MLCS. Figure 12 shows these feature maps.

| Performance evaluation methodology
In the study [32,33], the metrics used to evaluate the performance of segmentation tasks rebased on visual criteria. However, simply observing these results is time-consuming and can't be applied in all cases. To avoid user intervention, we use an automatic performance evaluation technique by comparing F I G U R E 9 The loss curve of training procedure F I G U R E 1 0 The comparison between these models in test data the detected segmentation result with preannotated ground truth (GT) [32]. The metric is based on the count of matched pixels between the areas detected and the ones in the label. To compare the performance of detection results by quantitative criteria, we used a table called MatchScore [33] to calculate the intersection-over-union (IoU) of pixel sets for the result and the GT. The definition of MatchScore can be represented as:

Method
where I denotes the matrix of image, G j denotes pixels inside the j-th GT region, R i denotes pixels inside the i-th result region, T(S) is a function used to counts pixels in region S. Table MatchScore(i, j) represents the matching results of the jth GT region and the i-th result region.
The evaluation metric includes the MatchScore table and searches pairs of one-to-one, one-to-many or many-to-one matches. A one-to-one match (o2o) is a pair with a Match-Score higher than threshold th. A g_one-to-many match (go2m) denotes pairs with one GT word matches with more than one detected words. And a g_many-to-one match (gm2o) denotes pairs with more than one GT words match with one detected word. In the same way, a d_one-to-many match (do2m) denotes pairs with one detected word matches more than one GT words. Finally, a d_many-to-one match (dm2o) denotes pairs with more than one detected words match with one GT words. Let N be the count of segments from GT, M be the count of segments from result and ω 1 ; ω 2 ; ω 3 ; ω 4 ; ω 5 ; ω 6 be predetermined weights, we calculate the DR and RA are defined as follows: Now, F-Measure (FM) can be defined by combining the value of DR and RA. The definition is shown as the following:

| ICDAR2013 handwriting dataset
To test the practicability of the MSFCN model, we also test MSFCN on a set of handwritten images from the ICDAR2013 handwriting segmentation contest [33]. There are 150 images written in Greek and English language as well as 50 images written in Indian Bangla language included in this dataset. According to the study, we compare the effect of some of different segmentation algorithms (CUBS, GOLESTAN etc.) on this data set. The methods mentioned in study are the best results in ICDAR2013 handwriting character segmentation contest [33]. There are still some other methods that have the similar performance, so we won't list them all.
We evaluated the performance of all participating algorithms for word segmentation using Equations (8)(9)(10)(11). The  -459 acceptance threshold used was th = 0.95 for word segmentation. For the convenience of comparison, we set ω 1 = ω 4 = 1,ω 2 = ω 3 = ω 5 = ω 6 = 0. In addition, to satisfy the input requirements of our model, we preprocessed the dataset images to generate training pictures of three sizes which are shown by Scale 'n'�. Table 1 shows the size of these feature maps. Then we generate the corresponding label image. Label images have the size of 256 � 256 � 3. In the label image, each word is marked with different colours. For the word segmentation task, we totally create 2000 images, half of which is training data and another half is test data. Figure 13 shows training data (ICDAR2013) and labels mapping to train data. Figure 14 shows some qualitative segmenting results by MSFCN.

| RESULTS AND ANALYSIS
We used the same workstation as described in Section 4. The iteration parameters are set to 700 epochs, and the final model is obtained after 40 h parametric training. Figure 15 shows the loss curve for the training process of MSFCN.
First, we use AUC to evaluate the performance as we do in Section 4.2. Table 9 and Figure 16 show the AUC results in ICDAR2013.
The AUC results show that MSFCN also performs well on ICDAR2013 and has a robust multilingual adaptability. To further analyze the performance of the model, we applied evaluation methods shown in Section 5.1.
The evaluation performance obtained from MSFCN and all algorithms submitted to the contest are presented in Table 10. Figures 17 and 18 are the visual representations of Table 10. And from Figure 18 we can find that MSFCN is superior to other character segmentation methods in both accuracy rate and recall rate. And the FM is pushed to 91.02, proving that our method achieves state-of-the-art.

| CONCLUSION
MLCS is challenging, especially when characters of Eastern Asian language such as Chinese are involved. We proposed an MSFCN and a novel connected component searching algorithm to solve this problem. Due to the multiscale feature extraction, intercharacter intervals information can be collected by multiple scaling down modules which make our model more robust for multilingual characters. By using the connected components searching algorithm, we got the marked result from eroding the semantic segmentation result made by MSFCN. The boundary expansion algorithm restores the marked result in the normal size and produces the final segmentation result. To enhance the practicability of MSFCN, we made a new dataset MLCS. Experimental results obtained on two datasets prove that MSFCN achieves state-of-the-art segmentation performance for multilingual characters with arbitrary shapes, and is sensitive for text with different character intervals.
In the future, we will focus on how to obtain more abstract features that can describe the intervals of two characters, and improve our model to accelerate the predicting procedure.