ILBPSDNet: Based on improved local binary pattern shallow deep convolutional neural network for character recognition

This paper proposes an architecture based on the improved local binary pattern (LBP) shallow deep convolution neural network, which integrates hand-crafted feature pre-processing and the advantage of character learning in the supervised high-level function of CNN, in order to enhance its performance. This study introduced the information of scale space into the LBP to reduce the sensitivity to noise, and applied feature maps with two features, the maximum selection feature map (MLBP) and the ﬁrst selection feature map (FLBP). The former selected the edge with the strongest intensity to reduce the inﬂuence of noise points, while the latter measured local binary features through the scale detection of an effective edge. In the network architecture design, according to the differences of input features, networks of different depths were used for learning, and the features learned by the two networks were adopted for classiﬁcation. The experimental results show that, the ILBPSDNet proposed had certain recognition abilities in many character data sets, and the network parameters and computation were also reduced. Therefore, it has a signiﬁcant effect in realizing the application of real-time character recognition. Finally, compared with other latest networks, its network performance could be maintained at a certain level.


INTRODUCTION
With the progress of science and technology, portable cameras, smart phones, driving recorders, and security cameras are widely used in daily life. A large amount of image information can be obtained through such devices, including various objects, natural scenery, and vast amounts of textual information. Moreover, extracting textual information and understanding semantics can be of great help to real life and business applications, such as content-based image retrieval, automatic file generation, human-machine interaction, driverless car navigation, industrial automation, robot navigation [1], and other applications [2]. Hence, conducting effective automatic analysis on the detection and recognition of character information is a very important subject. Optical Character Recognition (OCR) is a recognition technology of computer vision for natural scene texts and has achieved great success in many fields. However, the traditional OCR technology has high accuracy only in well-controlled environments (such as lighting conditions, character directions, and background complexity) [3]. In real applications, character This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology recognition may suffer from complex backgrounds, uneven lighting, low contrast, blurry texts, text directions, font colors, writing styles, and mixed languages. Hence, character recognition in natural scenes has become a major study focus in this field [4][5][6]. In recent years, with the widespread attention to the bag-of-words (BOW) [7], many methods have been proposed to segment a character into small image "words", such as the end of a stroke, a curved stroke, or a cross stroke, and these small words can successfully increase the recognition rate. Fundamentally, robust feature extraction algorithms are required to solve this problem, such as local binary pattern (LBP) [8], BRIEF [9], SIFT [10], and SURF [11], as these algorithms have different advantages in different scenarios. The features are extracted from character images and input into classification algorithms [12], which are transformed into gray-scale and binary images, and then, the characters are segmented. Moreover, the features of all segmented images are extracted by histogram of oriented gradient technology, and the support vector machine classifier is used for classification in order to complete character recognition. Handwritten characters can be extracted by LBP technology and recognized by KNN [13]. A hybrid method using fuzzy logic and naïve Bayes to learn strokes [14] has been proposed for character recognition. During character segmentation, characters are segmented by support vector machines and the three-feature fuzzy segmentation strategy of particle swarm optimization, and character classification is predicted by Monte Carlo Markov chain sampling. This hybrid method [14] recognizes characters to learn character features, and shows good effect in experimentation. The robust texture descriptor can solve many problems occurring OCR in natural scenes, such as glare, blurring, low contrast, and reflective surfaces. The original LBP algorithm has many advantages in texture recognition, such as brightness invariant, rotation invariant, easy implementation, and low computation; however, due to hardware noise, edge blurring, and other interference factors in the character images of real scenes, according to the experimental results of [15], the LBP performance of OCR in real scenes is poorer than that of HOG. Hence, an improved LBP texture descriptor has been promoted [15], meaning a improved LBP with the additional concept of HOG, which has been experimentally proven to have significant effect. When LBP is applied to the convolutional network, the features generated by LBP may result in non-activated original features after convolutional operation and the activation layer, which renders the features unable to be transferred to deeper networks. Hence, the feature extraction method proposed by [15] was adopted by this paper to retain the rotation invariance and brightness variation to overcome the original local binary features, with the addition of the concept of scale space, in order that local binary features have certain tolerance to noise and scale. As applying Improved LBP (ILBP) features to the convolutional network will cause feature loss, this paper proposed the Shallow-Deep convolutional network architecture to solve training problems, simplified the network architecture to reduce the total network parameters and computation cost, and considered the accuracy and improvement of the overall performance. In recent years, more and more technologies have been introduced to recognize text in natural scenes by deep learning, such as [16] and [17]. This paper mainly contributes to architecture enhancement, as based on the improved LBP shallow deep convolution neural network (ILBPSDNet), which integrates ILBP feature pre-processing and the Shallow-Deep architecture of CNN, and according to the differences of input features, CNN and the networks of different depths were used for learning, and the features learned by the two networks were adopted for classification. The experimental results show that the ILBPSDNet, as proposed here, has certain recognition ability in many character data sets, the network architecture is simplified, and the network parameters and computation are reduced. Therefore, it has a significant effect in realizing the application of real-time character recognition. The remainder of this paper is organized as follows. Section 2 introduces related works. Section 3 explores the main methods and network architecture pruning methods. Section 4 introduces data sets and experimental results to prove the performance of the proposed model. The last section offers the conclusion.

Character recognition and CNN
Character recognition technology was first proposed by Tausheck in 1929 to analyze text, images, and files, and to obtain text information. First published by Casey and Nagy of IBM for character recognition, template comparison [18] was used in most early technologies. In template comparison, text images are usually normalized, and then, the correlation coefficients [19], Euclidean distance [20], and other measurement values are calculated to compare the similarities between text images and templates. After that, people began to study and identify the features of text images. The common image features include histograms of oriented gradient [21], local binary features [8], Haar features [22], and scale invariant feature transformation [23], which are all widely used in character recognition. In recent years, due to the popularity of machine learning, most image features have been combined with machine learning algorithms, such as the Support vector machine (SVM) [24], AdaBoost [25], and K nearest neighbor(KNN) [26]. The earliest prototype of CNN was Neocognitron [27], which proposed the step-by-step convolution kernel commonly used in today's convolutional neural networks, adopted Rectified Linear Units [28] to provide non-linear networks. LeCun et al. [29] integrated backpropagation into the convolutional neural network, and applied it to handwriting recognition. LeCun et al. [30] proposed the LeNet network model, which has almost the same architecture as contemporary common convolutional neural networks and is consistent with the design concept of local receptive field, weight sharing, down sampling, invariance of displacement, scaling, and deformation.

Local binary pattern
LBP are the descriptors that describe local texture features. The original LBP used a 3x3 grid, where the gray-scale value of the centre point was compared with that of the adjacent 8 pixels.
If the pixel value of a neighborhood point was greater than that of the centre point, the point was marked as 1, otherwise as 0. A string of binary codes can be obtained through the above process, namely, the LBP value of the pixel. However, the basic LBP can only cover a fixed-size region. In order to adapt to different sizes, Ojala et al. [31] replaced a square region with a circular region and extended the 3x3 area to an arbitrary range. According to the definition of LBP, the local binary features obtained by the same binary codes are different due to different permutations, thus, the local binary features with rotation invariance were proposed. A series of codes are obtained by constantly rotating the circular region, from which the LBP is selected, and the minimum is taken as the LBP value of the pixel. The local binary feature equivalence pattern is mainly used to solve the rapid increase of neighborhood points when the region is extended. Ojala argued that most local binary feature codes contain at most two transitions (from 0 to 1 or from 1 to 0); thus, the codes with no more than two transitions were classified into the same category, while the remaining codes were classified into another category; however, codes with more than four transitions, which are also known as promiscuous modes, cannot effectively express texture features.

Improved local binary pattern
ILBP [15], as our previous work, is an improved LBP texture descriptor, meaning a modified LBP with the additional concept of HOG. There are four main modification methods in concrete implementation: creating a LBP by averaging the pixel intensities of the integral images, reducing the bins in a histogram, creating a LBP in a proportional space, and using the block and unit concept of HOG. Regarding the step of reducing bins, the bins in a LBP histogram are reduced and a magnitude map similar to HOG is added; hence, each pixel can represent a complete integral magnitude and direction information, which is called the edge type. The second step is to create a binreduced LBP in a scale space, and the third step is to determine the meaningful scale in the scale space. ILBP uses the same concept as SIFT, which uses the average intensity rather than comparing the single pixel intensities in a LBP to achieve scale invariance.

Image features combined with the convolutional network architecture
By combining Haar features and deep convolutional models, HaarNet [32] intended that the combination could enhance the discriminative ability of facial descriptions. The HaarNet architecture consists of a backbone network and a branch network. The backbone network learns the expression of global features by the inception module [33] and the stack of multi-layer convolutions; Haar features are added in the branch network to extract the local and asymmetric facial features; finally, the features of the backbone and branch networks are concatenated by the fully connected layer to obtain more perfect facial descriptions. Annadani et al. [34] put the edge features, as learned by the LBP and the histogram of oriented gradient, into a fully connected layer to fine-tune the overall CNN, removed the feature maps of all convolutional layers, and identified the key points in all maps by SIFT operators as the final output features, in order to become the new features and improve features performance. Shi et al. [35] solved ship identification by Gabor-based multi-scale complete LBP and a deep convolution model. This architecture uses a deep convolution model to extract advanced features, while another network branch learns the edge features in different directions by a Gabor filter, and then, obtains the local features at different scales by MSCLBP. Then, low-order and high-order information are combined to produce more differentiated features. In a multiscale rotation-invariant LBP model [36], the multi-scale and multi-directional feature extraction of the Gabor filter and the rotation invariant of local binary features are also used to complement the features generated by the convolutional neural network model, in order to provide richer information. Hence, the shallow deep convolution neural network architecture we proposed was used with ILBP features in this paper to improve the performance. The networks of different depths were adopted for learning, and the features learned by the two networks were used to improve classification accuracy.

PROPOSED METHOD
This paper proposes an architecture based on the improved LBP shallow deep convolution neural network (ILBPSDNet), as shown in Figure 1. ILBPSDNet inputs consist of original images and ILBP feature maps, in which first selection feature map (FLBP) and maximum selection feature map (MLBP) represent the features generated by the first selection method and maximum selection method, respectively, and the original images are gray-scale maps. Before training the network models, the images were preprocessed. First, the original images were normalized and the pixel values were adjusted to be between 0 and 1, and then, the normalized images were scaled to 100×100 for future network training. Due to the different features of the data input by the two methods, the concept of the shallow deep convolution neural network was introduced to design two network branches of different depths. The network branch in the upper half of Figure 1 is designed for ILBP features; as it is a class of higher-order features, the shallow network architecture is used to retain and protect their features. The network branch in the lower half of Figure 1 uses the deep network architecture to learn the features of the input original images. In order to make the overall network lightweight, the lightweight network architecture was selected and modified for the components in different network branches. In the shallow network architecture, the SqueezeNet network architecture [37] was used to decrease the parameters by reducing the input channels to the 3×3 convolution kernels, and a weighted depthwise separable fire module (WDS fire module) was proposed to adjust and modify the internal structure. Moreover, in order to simplify the network architecture, the recognition ability was also considered. In the deep network architecture, the ShuffleNet network architecture adopting group convolutions was used to reduce the network computing cost. After the training of the shallow and deep network branches, global average pooling was connected to the terminals of the shallow and deep network branches, respectively, in order to output the feature points obtained by the network branches, and then, concatenation was adopted to combine the features of the two-branch network. Finally, the one-dimensional vector, as formed by concatenating the feature points, was input to the softmax function for operation, in order to output the classification results. This paper uses SqueezeNet network architecture, weighted depthwise separable fire module and ShuffleNet network architecture to realize real-time image computing while taking into account recognition accuracy. In summary, ILBPSDNet architecture as

FLBP and MLBP
The ILBP feature extraction algorithm was used in this paper to select feature maps. In the first selection method, in order to avoid the loss of image information caused by continuous Gaussian blur during scale space construction, the first non-zero edge type was used to represent the edge pattern of the pixel, and called the first selection feature map (FLBP).
In the maximum selection method, the maximum response of the pixel was selected from each scale space to reduce the influence of noise on features, and called the maximum selection feature map (MLBP). As shown in Figure 2, the red dots indicate that the pixel has no meaningful edge pattern in any scale space. We found that the red dots obtained by the first selection method were significantly less than those by the maximum selection method. As the first non-zero value in the scale space was selected by the first selection method, it was less blurred, retained more detailed information, and described the details more clearly. As the maximum response in the scale was obtained by the maximum selection method, it was greatly blurred and some details in the image were lost. Therefore, the two feature maps were used in this paper to preserve the image details and remove the image noise, respectively.

Weighted depthwise separable fire module and shallow neural network
In order to achieve the lightweight network architecture, this paper used the SqueezeNet network module [37] to reduce the parameters. A new network module called the fire module was designed in SqueezeNet, which consisted of a squeeze layer and an expand layer, and these layers were connected by the ReLU activation layer. According to the observation of the interior of the fire module, the squeeze layer and the expand layer were connected to the ReLU layer to increase their generalization abilities. In the squeeze layer, the number of 1x1 convolution kernels is expressed as S 1x1 . In the expand layer, the numbers of 1×1 convolutional kernels and 3×3 convolutional kernels are expressed as E 1x1 and E 3x3 , respectively. In order to compress the squeeze layer, the number of S 1x1 was set to be less than E 1x1 +E 3x3 .
Regarding the original fire module, in order to reduce the network parameters, some 3×3 convolution kernels were replaced by 1×1 convolution kernels, thus, the input channels of 3×3 convolution kernels were reduced. However, as known, the main value of 1×1 convolution kernels lies in the fact that, without changing the image size, the dimension can be arbitrarily raised or reduced to increase or decrease the parameters, and to linearly combine channels to realize channel fusion. Accordingly, as the main feature extraction lies in the 3×3 spatial convolution kernels, the 1×1 convolution kernels do not have great influence on the overall network accuracy. Therefore, the weighted depthwise separable (WDS) fire module was proposed to weigh the input channels of the 3×3 convolution kernels to increase the input channels. In the WDS fire module, the channels are allocated to the convolution kernels in the original expand layer, as based on the channel ratio, which determines the number of channels input into the convolution kernels according to size. As shown in Equations (1)-(3), kernel size 1x1 shows the size of a 1×1 convolution kernel, while kernel size 3x3 shows the size of a 3×3 convolution kernel. In order to enhance the recognition ability, the channels used to input 3×3 convolution kernels should be increased, which increases the module parameters. Hence, a depthwise separable convolution module, as proposed in Mobilenet [38], was imported to replace the standard convolutional operations. The advantage of the depthwise separable convolution module is that it can save computation cost with more and greater kernel maps. Therefore, by increasing the channels of the 3×3 convolution kernels, the depthwise separable convolution module could save computation costs and improve the recognition rate.
After the computation cost of the network was reduced, a method to preserve the characteristics of the input features was required. According to the above discussion, it was understood that, with the introduction of the concept of scale space, ILBP has certain expression abilities in local and global descriptions and certain tolerance for most environmental factors, which enables the preservation of feature characteristics. In this paper, the ReLU function was replaced by 2 methods, namely, the shortcut mechanism [39] and the SELU activation function [40]. The shortcut strategy can solve network degradation in training deep networks; when the channel is compressed, the generated features will be more concentrated in the reduced channel, but at this time, using a non-linear activation function, such as ReLU, will have a greater probability of data loss. This is a common phenomenon of manifold collapse. The depth of the feature space will affect the selection of using non-linear transformations (such as ReLU). For deeper feature spaces, the depth of the module is deeper, so the acquired features are more abundant. The non-linear transformation can be enhanced its module generalization ability. However, for shallower feature spaces, the use of non-linear transformations usually results in data loss. SELU function which is similar to the batch normalization can solve this problem for the proposed shallow network. SELU can normalize data and improve the network convergence rate. The proposed weighted depthwise separable (WDS) fire module is shown in Figure 3, and the overall shallow network architecture is shown in Table 1.

Channel quantization and deep neural network
The deep network architecture used in this paper is based on ShuffleNet [41]. As a lightweight network architecture, ShuffleNet is internally constructed by network modules, such as group convolutions and deep convolutions. Channel mixing is placed behind the group convolution to increase the information flows among channels. The basic principle of group convolution is to group the input feature map channels and convolution kernels for convolution operation, respectively, and finally, splice the results of all groups. This method can greatly reduce the required computations. Group convolution is essentially meant to limit the convolution operations within groups, and the features output after the convolution operation are derived from only a few input features; that is, the lack of information flow between groups leads to a decrease in the recognition ability of the learned features; therefore, the channel mixing mechanism is imported. Channel mixing is essentially a recombination of channel maps to ensure that the input features in the future group convolutions contain the information of all groups, thus, achieving information flow among the groups. The more groups there are, the more data will be lost. In order to solve this problem, [42] proposed the merging and evolution (ME) module, in which the merge is a pointwise convolution operation. The main functions of the pointwise convolution operation include the linear combination of channels, the aggregation of channel features, and the reduction of information loss among groups. After merging, evolution is mainly used to obtain the spatial information of feature maps and solve the problem of insufficient flows among the channels of ShuffleNet. Based on the concept of ShuffleNet and ME, we observed that the key to influence the recognition ability of ShuffleNet is channel mixing, which reduces the loss of channel features among groups and improves the expression ability of the overall features. Hence, this paper proposed another channel mixing mechanism, which is called channel quantization. Channel quantization mainly depends on the feature maps generated by all channels through activation functions. The most important use of activation functions is to extract the prototypes of the features, and the channel features can be ranked according to their importance by quantizing the activation values. The four steps of channel quantization are shown in Figure 4. First, calculate the sum of the activation values of each channel feature map after the group convolution operation, as shown in Equation (4). In a given input feature map: X ∈ ℝ h×w×c ; where f is the 1×1 group convolution transfor-mation: X ∈ ℝ g×h×w× c g → ℝ h×w×c ; h and w are the length and width of the feature map; c represents the number of channels; g represents the number of groups in a group convolution, thus, output feature map Z can be obtained by group convolution: where W is the convolution kernel:ℝ g×1×1× c g × c g ; * represents the convolution operator; BN represents Batch Normalization; represents the ReLU activation function. The output feature map Z can be regarded as the set of all channel feature maps Z k . Finally, calculate the activation value V k of each channel, where Z k (i, j ) is the pixel activation value of the feature map of channel k, as shown in Equation (5).
In step 2, the activation values V k of the feature maps of all channels are ranked from largest to smallest; in step 3, the output channels are grouped according to the ranking, and there are a total of c g groups. The channels ranking between g+1 and 2g are grouped into the second quantization group, and the remaining channels are grouped by the same method. In the last step, each group convolution, on average, randomly selects a channel feature from each quantization group and places it into its own group. As the channels are recombined in this process, it increases the flows among channels and reduces the information loss among groups. The channel quantization module, as proposed in this paper, is shown in Figure 5, where (a) is the standard unit and (b) is the down sampling unit. The overall deep network architecture is shown in Table 2.

Description of the experimental environment
The software and hardware specifications of this experiment are as follows. In terms of the software, the Windows 10 operat-ing system was used; the keras high-level api and the tensorflow framework were used as the deep learning package at the front and the back end, respectively; the CUDA integrated technology platform, as launched by Nvidia, was used for GPU computing to speed up the highly parallel tasks; cuDNN, which is a library of Nvidia for deep neural networks, was mainly used to accelerate the operation of the deep neural networks. In terms of  Chars74K  ICDAR03  ICDAR03  IIIT5K  IIIT5K  MNIST  SVHN  -aug  -aug  -aug   Number of  9207  5501  47070  6774  95812  42000  65931   training set   Number of  1023  512  20173  2904  10646  18000  7326   validation set   Number of  930  379  379  269  269  10000  10321 testing set the hardware, the CPU was a quad-core Intel Core TM i7-6700K Processor, the memory was 32G, and the display adapter was GeForce RTX TM 2080.

Dataset
The experimental data sets in this paper were: IIIT5K [43], ICDAR03 [44,45], Chars74K [46], MNIST [30], and SVHN [47]. The IIIT5K data set was obtained from Google image search, and the data included billboards, signboards, doorplates, movie posters, and other images, which were cropped natural scenes and artificial images. There were more than 5000 single-word images. The ICDAR03 data set, as taken from the character recognition competition organized by ICDAR in 2003, provided training sets to adjust the algorithms and test sets for performance evaluation. Chars74K, which is a classic data set for character recognition, contained capital and lowercase English letters (A to Z, a to z) and numbers (0 to 9). There were three image acquisition methods: natural scenes, handwritten images, and computer-generated images with different fonts. The total number of images was 74 K. The MNIST data set, which is an image data set of handwritten numerals, contained 60,000 training images and 10,000 test images, each with a size of 28×28. The SHVN data set was an image data set of real scenes, with the images obtained from the house numbers in Google street view, which was intended to solve the problem of character recognition in the real world. Table 3 shows the number of training sets, validation sets, and test sets for all data sets, where aug indicates that data augmentation was used. Data augmentation was intended to greatly augment the data size through random cropping, horizontal flipping, and brightness transformation of the original input images.

Training details
Before training, the image size was scaled to 100×100, and the data were normalized, thus, the data values were between 0 and 1. When the model was trained, the batch size was set to 128 and 100 epochs were run in total. The learning rate was set by piecewise functions, with the initial value of 0.01, and was reduced to 10% at every 60 epochs, and an Adam optimizer [48] was used. In order to improve the overall generalization ability of the network, the 10-fold cross validation strategy was adopted. Figures 6 to 12 show the accuracy training curve and the loss rate training curve of each data set, respectively.  Table 4 shows the recognition rates of all test data sets and the test network architecture. The ILBPSDNet, as proposed by this study, shows good recognition ability in all data sets. In the Chars74K data set, ILBPSDNet has better recognition ability than small network architectures, such as MobileNet. Image data are scarce among the data in ICDAR 2003 and IIIT5K, while ILBPSDNet has certain recognition ability under such circumstances. On the contrary, as based on their recognition ability, while Inception and ResNet have large network architectures, they cannot achieve the desired results on a small sample basis. When data augmentation is used in ICDAR 2003 and IIIT5K, the recognition rates of all networks significantly increase, demonstrating that the input data are critical for training good models. Table 5 shows the network parameters and the operations of floating numbers in all test networks. In the shallow deep neural network module of ILBPSDNet, the overall network parameters and computations are reduced to a certain extent by the proposed WDS fire module and the group convolutions and deep convolutions of ShuffleNet, which is consistent with the concept of a lightweight network architecture, as proposed in this paper. Tables 6 and 7 show the GPU performance during training and the GPU speed during the testing of all test networks, respectively. The data in the tables are the results produced when the SVHN data set is used for training and testing. According to our observations, the architecture with a large number of parameters and a large amount of computations, as shown in Table 5, has poor GPU performance and speed. Surprisingly, in ILBPSDNet, while the parameters and the computation are reduced to a satisfactory extent, its GPU performance is poorer than expected, which is mainly because the memory access cost (MAC) was ignored. In this latest study, [49] pointed out that too many group convolutions and network branches often lead to a decrease in the overall network performance. We assume that the GPU memory size is enough to store all the feature maps and parameters, and perform a 1x1 grouped convolution operation. The number of input and output channels is c 1 and c 2 . The size of the feature map is h × w. g represents the number of groups in a group convolution. The calculation cost O is as shown in Equation (6), and the total cost is hwc 1 c 2 ∕g.

ILBPSDNet performance test
The memory access cost is shown in Equation (7). It is mainly related to the size of the input and output feature maps and the number of weights.
It can be seen from Equation (7) that when the size of the input feature map is fixed, the memory access cost will increase with the number of groups g. At the same time, it is proved that the ILBPSDNet architecture uses too many group convolutions, which leads to GPU performance low.

Performance test of groups in a group convolution
The output channels, as shown in Table 2, are adjusted according to the parameters and computations. With a fixed number of parameters (less than 1.5 M) and a fixed amount of computation (less than 3 M), a similar number of channels can be allocated in all group convolutions. According the observation of Table 8, while the recognition rate slightly increases when there is more than 1 group, the recognition ability does not DenseNet 500 InceptionNet 714

SqueezeNet 1930
XceptionNet 833 ILBPSDNet 833 increase, but decreases with the increase of groups. Regarding the GPU performance, when there are more groups in a group convolution, it costs more to complete 1 epoch. The relationship between the number of groups and accuracy is shown in Figure 13.

Channel ratio performance test
The importance of spatial convolution in improving recognition ability was explored. The parameters and computation could be significantly reduced by increasing the channels inputting 1x1 convolution kernels and reducing the channels input to 3x3 convolution kernels. Table 9 shows the channel ratios of E 1x1 and E 3x3 in the WDS fire module of the shallow network architecture. According to the experimental results, the recognition  ability could be improved by moderately increasing the channels inputting 3x3 convolution kernels. In this experiment, the best result was obtained at a channel ratio of 1:3. The performance of 1x1 convolution operation was significantly lower than that of 3x3 convolution operation. According to the results obtained at different channel ratios, the recognition rate was slightly increased by gradually increasing the computations of 3x3 convolution operation. However, at a channel ratio of 1:7, we found that the recognition rate suddenly decreased, which might be caused by the excessive reduction of 1x1 convolution operation. While 3x3 convolution operation could extract features in spaces, 1x1 convolution operation could both fuse channels and generate new features by different linear channel combinations, and such features could improve the overall generalization abilities of networks; therefore, how to balance 1x1 convolution operation and spatial convolution operation is still a very important issue. The relationship between the channel ratio and accuracy is shown in Figure 14.

CONCLUSION
This paper proposed an architecture based on the improved LBP shallow deep convolution neural network, which integrates ILBP feature pre-processing to improve character recognition,

FIGURE 14
Relevance between channel ratio and accuracy and two feature maps (FLBP and MLBP) were selected to preserve image details and remove image noise, respectively. In terms of the network architecture, two branch networks of different depths were constructed by importing the shallow and deep neural network architecture, as based on the lightweight design principle. In the shallow network, the SqueezeNet architecture was used and a weighted depthwise separable fire module was proposed, in order to reduce the overall network parameters and improve the recognition ability. In the deep network, the shuffleNet architecture was used to reduce the computation cost by group convolutions. Finally, according to the experimental results of many different data sets and different network tests, the ILBPSDNet architecture, as designed here, could effectively improve the overall recognition rate and enhance real-time image recognition, with the SELU activation function, the weighted depthwise separable fire module, and the channel quantization mechanism.