Deep adaptation of CNN in Chinese named entity recognition

Named entity recognition (NER) is an important task in the field of natural language processing, but it is more challenging in Chinese because of the lack of natural delimiters. The traditional character‐based Chinese NER model directly uses long short‐term memory (LSTM), gated recurrent units, and other sequence models to extract sentence‐level information from character sequences, resulting in the lack of word‐level information in the model. Therefore, a Chinese NER model called ChineseBERT‐CNNs‐BiLSTM‐CRF was proposed, which uses the ChineseBERT pretrained model as the embedding layer so that the vector representation of each Chinese character contained pinyin, glyph, and conventional character information. In addition, a CNN‐based neural network structure called CNNs was presented to extract word‐level information from character sequences and alleviate the problem of entity boundary recognition. BiLSTM was used to extract global features (i.e., sentence‐level information) and predict the corresponding labels of character sequences. Further, conditional random field (CRF) was employed to impose certain rules on the prediction of BiLSTM to enhance the recognition effect of the model. The experimental results revealed that the F1 values of the model on MSRA, people's Daily, and Weibo datasets reached 95.76, 96.61, and 70.00%, respectively, highlighting the effectiveness of the model.

Early NER mainly relied on rule methods and representative technologies were dictionaries and rules. 15 In the case of limited dictionary size, this method can achieve good results, but the mentioned methods have low recall and poor mobility, thus rules need to be respecified in the face of new fields. Subsequently, NER technologies, based on machine learning methods, gradually occupied the main position, and representative technologies include the hidden Markov model, 16 conditional random field (CRF), 17 and the like. Although such methods basically solve the problems faced by NER technologies based on rule methods, they have high requirements for artificial feature extraction. At present, with the improvement of computer computing power, deep learning algorithm ushers in another performance liberation, and once again has become a research hotspot in the field of artificial intelligence. A paradigm shift started occurring in 2012, when a deep learning-based model, AlexNet, 18 won the ImageNet competition by a large margin. Since then, deep learning models have been applied to a wide range of tasks in computer vision and NLP, improving the state of the art. 19 Compared with machine learning models, deep learning models avoid artificial feature selection while learning feature representation and performing classification (or regression) in an end-to-end fashion, and the final effect is relatively better. These models [20][21][22][23][24][25][26][27][28][29] have undoubtedly become mainstream frameworks for various tasks in the NLP field, including NER.
NER models also have some differences according to the characteristics of different languages. At present, mainstream NER models are designed for English, including the classical NER model BiLSTM + CRF based on deep learning. 29 English uses spaces as natural separators to separate words. English NLP models usually employ words as tokens to process various NLP tasks, which is simple and effective. On the other hand, Chinese does not have natural separators that can separate words from words. An intuitive idea is to use word segmentation tools to segment Chinese text, and then apply English NER models to process Chinese NER (CNER) tasks. However, segmentation errors will inevitably occur in the word segmentation process, which will affect the subsequent NER tasks. The use of dictionaries and other methods will cause out-of-vocabulary problems. 30 Some researchers took Chinese characters as the token of the NER model; then, they used the English NER model to obtain the NER label of each Chinese character and achieved good results in this regard. In the field of CNER, 31-33 the above-mentioned character token-based models are classified as character-based NER models, and the models of word segmentation followed by entity recognition are classified as word-based NER models. Some researchers 34,35 explicitly compared character-and word-based methods for NER, confirming that character-based methods avoid the error from the word segmentation stage and perform better in this respect. Although the above-mentioned character-based models only contain character-and sentence-level information, they lack word-level information. On the other hand, in the NER task, the entity recognition can be considered correct only if there is a complete match between the entity type and boundary. Among them, entity type recognition is relatively easy, while entity boundary recognition is relatively difficult.
Accordingly, a CNER model called ChineseBERT-CNNs-BiLSTM-CRF is proposed to solve the above-mentioned problems; this model mainly includes embedding, CNNs, BiLSTM, and CRF layers. The embedding layer is used to map the character sequence of the input model into the vector space. The ChineseBERT 36 pretrained model is employed as the embedding layer. ChineseBERT is designed based on the BERT 37 pretrained model. According to Chinese language characteristics, this model simultaneously encodes the pinyin, glyph, and character information of Chinese characters into the vector representation, realizing SOTA performance on multiple Chinese NLP tasks. The CNNs layer is composed of a group of CNN with different convolution kernel sizes, which is applied to extract word-level information and can alleviate the problem of entity boundary recognition. The BiLSTM layer is utilized to extract global features (i.e., sentence-level information) and predict the corresponding NER label of each character. The CRF layer is used to impose certain rule constraints on the prediction of BiLSTM, which can improve the accuracy of entity recognition.

RELATED WORK
After the stage of statistical machine learning algorithms, NER has stepped into the era of deep learning. Our research is also based on deep learning. BiLSTM + CRF 29 is a typical representative of the early stage of deep learning. Since then, this model has gradually become the mainstream framework of the NER task. Most research work has so far been based on this model. BiLSTM + CRF models the NER task as a sequence labeling problem. The model uses BiLSTM to extract sentence-level information from the input sequence, output the predicted label sequence, and apply CRF to impose certain rule constraints on labels predicted by BiLSTM. For example, in the BIO annotation method, the NER label of each named entity must start with B, and the category of the same named entity must be consistent (i.e., the label sequence {B-PER, I-ORG} is invalid, and the like). The BiLSTM + CRF model has good performance when applied to English NER tasks, but it is not satisfactory in CNER tasks. The main reason is that English NER tasks are word-level sequence labeling tasks, while CNER tasks are character-level sequence labeling tasks. The pipeline model with word segmentation followed by NER will show error transmission, while the direct use of BiLSTM + CRF to process CNER tasks based on the character level will result in the lack of word-level information in the model. To solve the above-mentioned problems, Zhang et al. 38 proposed a lattice LSTM model, which encodes characters and all potential words that can match dictionaries at the same time, making clear the use of character-and word-level information. Compared with the word-based CNER model, there is no problem regarding segmentation errors in this model. Based on lattice LSTM, Li et al. 39 changed the model structure of lattice LSTM with the help of transformer 40 and proposed the flat-lattice transformer (FLAT) model, which solved the problem of the lattice LSTM model in terms of difficulty in making full use of GPU parallel operation and improved the operation speed of the model. In addition, the lattice LSTM model has a complex structure, which is difficult to be applied to industrial fields with high real-time requirements. To solve this problem, Ma et al. 41 suggested the combination of dictionaries into character representation, avoiding the design of complex serialization model structures, and this is applicable to any NER model based on neural networks. Compared with the lattice LSTM model, this method has 6.15 times faster reasoning speed and better effects. Unlike the above-mentioned model, which encodes character-and word-level information into LSTM, our model uses the CNNs structure to encode all potential words containing Chinese characters into the vector representation of the character and then employs the fully connected network to highly purify character features so that the model contains word-level information.
On the other hand, given that Chinese is a kind of hieroglyphics, some research studies attempted to encode the glyph information of Chinese characters into the character representation vector to improve the performance of the CNER model. To solve the problem of missing pictographic information in simplified Chinese, Meng et al. 42 summarized the glyph information of eight kinds of calligraphy, including bronzeware, clerical, seal scripts, and the like, and proposed a tianzige-CNN model to extract glyph information to solve the problem regarding the difficulty of the traditional CNN in extracting glyph information effectively. Xuan et al. 43 introduced the CGS-CNN model, which can not only extract the glyph information of each Chinese character but also the interactive information between adjacent Chinese characters. For example, Chinese characters with the same radical (e.g., "杨树" and "柳树") have similar semantics. Further, Xuan et al. 43 presented a method combining the sliding window and attention mechanism to fuse BERT and glyph representations. We do not directly extract the glyph information from the glyph image but use the ChineseBERT pretrained model as the embedding layer to obtain the glyph information of each Chinese character. ChineseBERT is designed based on the BERT pretrained model. In addition to the glyph information, ChineseBERT integrates Chinese Pinyin information and conventional character information as well. Using ChineseBERT can lead to a better character vector representation.

APPROACH
This section focuses on introducing the CNER model structure proposed in this article and the corresponding model training loss function and optimizer settings. The structure of the suggested CNER model is shown in Figure 1, which mainly includes embedding, CNNs, fully connected, BiLSTM, and CRF layers.

The embedding layer
This layer aims to map the text sequence inputted to the NER model into the vector space to obtain a low-dimensional, dense vector representation of each token. Our model takes Chinese characters as tokens and obtains the vector representation of each character in the text sequence through the embedding layer. The mapping relationship between characters and vector representations is represented by the embedding matrix E ∈ R V×D , where D represents the embedding dimension, namely, the characteristic dimension of the vector used to demonstrate characters, and V denotes the dictionary size. The text sequence input to the model is expressed as s = [w 1 , w 2 , … , w n ], s ∈ R n×V , where w i ∈ R V is the unique hot code of the i-th character, and n indicates the length of the text sequence. Then, the output of the embedding layer can be expressed as a vector sequence e = s × E = [e 1 , e 2 , … , e n ], e ∈ R n×D , where e i is the vector representation corresponding to character w i in the text sequence. The embedding layer can be obtained by randomly initializing parameters in the embedding matrix E, but the subsequent training cost is high, and the training effect is limited by the size of the dataset. Currently, most research work use GloVe, 44 BERT, ERNIE, 45,46 and other pretrained models as the embedding layer. The pretrained model essentially F I G U R E 1 Model structure diagram of ChineseBERT-CNNs-BiLSTM-CRF obtains a good vector representation of each token on a mass of text corpora through self-monitoring learning methods. Compared with the randomly initialized embedding layer, the NER model can achieve better performance using the pretrained model. In this article, the ChineseBERT pretrained model is employed as the embedding layer. Based on the BERT pretrained model and in combination with cutting-edge technologies in the field of CNER, ChineseBERT simultaneously encodes Chinese character information, glyph information, and pinyin information into the vector representation of characters and achieves new SOTA performance in a variety of Chinese NLP tasks. The overview of the ChineseBERT model is displayed in Figure 2.
The pinyin embedding of ChineseBERT uses the open-source pypinyin package to convert the input character sequence into the pinyin sequence of Roman font and employs four special marks to represent the tone. After embedding the pinyin sequence into the vector space, a CNN with a width of 2 and a maximum pooling layer are applied to obtain the pinyin embedding of each character. The pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). The glyph embedding adopts three fonts, including FangSong, XingKai, and LiShu. Each glyph information of each Chinese character is stored in an image with a size of 24 × 24 and a pixel value range of 0-255. For the embedding of each Chinese character, the vectors of the three glyphs are spliced into a vector of 24 × 24 × 3 and then flattened into a 2352 vector. Next, the glyph vector representation of each Chinese character is obtained through a F I G U R E 2 Model structure diagram of ChineseBERT fully connected layer. Chinese characters belong to hieroglyphics, and there is some semantic information in the Chinese character glyph, thus glyph embedding can enhance the vector representation effect of Chinese characters. The character embedding is consistent with the BERT model, which is the embedding of Chinese text sequences at character granularity. The training of ChineseBERT only uses the masked LM task, and the masking strategy applies whole word masking and char masking. In Chinese, most words are composed of multiple characters, and the model using the char masking strategy is highly simple, including "我喜欢紫禁[M]." The model can easily recognize that the masked character is "城." Compared with the BERT pretrained model, the new whole word masking strategy alleviates the above-mentioned problems and enables the model to learn more effective information.

The CNNs layer
The vector sequence output from the embedding layer only contains character-level information. Subsequently, the traditional character-based NER model directly processes character sequences with LSTM, gated recurrent units (GRU), and other sequence models, resulting in the lack of word-level information in the model. To solve this problem, a CNN-based neural network structure called CNNs, which is used to extract word-level information from character sequences is proposed in this research. The CNNs consist of a group of CNN with convolution kernels of different sizes. Each CNN uses a single channel two-dimensional convolution kernel. The convolution kernel size is expressed as [L, D], where L represents the number of characters that can be involved in a convolution operation. According to the length of common Chinese words, L is set as 2, 3, 4, and 5, respectively, and D is the embedding dimension of characters. Convolution kernels with various sizes make CNNs similar to sliding windows with different sizes. As shown in Figure 3, all potential  Figure 4. A matrix with a size of 5 × 4 is obtained after the character sequence "张三去北京" passes through an embedding layer with an embedding dimension of 4. The matrix line vector is the vector representation of the corresponding characters. In Figure 4, CNNs use two different convolution kernel sizes, namely, [2,4] and [3,4], respectively. The number of convolution kernels for each size is 4. After the character sequence is convolved by the two above-mentioned convolution kernels, each character gets two feature vectors whose feature dimension is consistent with the number of convolution kernels for each size, and all feature vectors are concatenated to get the final vector representation of the character. The output of this layer is expressed as c = [c 1 , c 2 , … , c n ], c i ∈ R N×Z , where N indicates that CNNs adopt N convolution kernel sizes, and Z represents the number of the convolution kernels of each CNN. Further, c i is the vector representation of each character encoded by CNNs and contains word-level feature information.

The fully connected layer
This layer seeks to highly purify data features, force the model to learn more effective information, and filter out the noise in the data. In addition, the dropout layer is added to prevent the model from overfitting and enhance its robustness. Dropout deactivates the neurons in the neural network in a certain proportion during the training process. In the subsequent forward propagation, these deactivated neurons will not have an impact on downstream neurons. Moreover, the weight and bias parameters of these deactivated neurons will not be updated in the back propagation ( Figure 5). The output of this layer is expressed as x = [x 1 , x 2 , … , x n ], where x i ∈ R M and M are the output dimensions of the fully connected layer.

The BiLSTM layer
The aim of the BiLSTM layer is to extract global features, namely, sentence-level information, and predict the NER label corresponding to each input character. BiLSTM is composed of forward LSTM and backward LSTM, which can obtain context information.
LSTM solves the problems of long-term dependence, gradient disappearance, and gradient explosion in RNN, and is highly suitable for processing sequence inputs. The LSTM unit structure is depicted in Figure 6, including three logical F I G U R E 5 Schematic diagram of the dropout F I G U R E 6 Unit structure diagram of LSTM structures (i.e., forget, input, and output gates). The forget gate is used to discard some information in the unit state, and the forget gate output f t is as follows: where x t and h t−1 represent the current input and the hidden state at the previous time, respectively. In addition, W and b denote the weight and offset, respectively, and is the sigmoid activation function.
The input gate saves some information in the current input x t and the previous hidden state h t−1 in the cell state. The output of the input gate i t andC t is as follows: According to the forget and input gates, the unit state of LSTM can be updated as Equation (4): where C t , C t−1 , and f t denote the cell state at the current time, the cell state at the previous time, and the forget gate output, respectively, and i t andC t are the outputs of the input gate. The output gate is used to control the output information of the current unit, and its output h t is as follows:

The CRF layer
CRF seeks to impose certain rule constraints on the predicted label sequence of BiLSTM so as to ensure the effectiveness of the predicted label. The CRF in the NER model generally refers to the linear chain CRF, which belongs to the discriminant probability model and is recorded as P (Y|X). Under the condition that the random variable X is x, the conditional probability that the random variable Y is y is expressed as follows: where Z(x) is the normalization factor, and t k and s l are characteristic functions. Additionally, k and l are the corresponding weight values.

The training
The loss function of this model adopts the negative logarithmic likelihood loss function, and the formula is demonstrated as follows: where S, p, and indicate the set of sentences in the training set, the conditional probability generated by the CRF, and the parameter set, respectively. Furthermore, h s and y s are the hidden vector sequence and label sequence of sentence s. The model optimizer adopts the Adam optimizer. 47

Datasets
Three benchmark datasets were used in our experiments, and all original corpus was processed with a BMES tagging scheme. The first one is the MSRA corpus released by the third SIGHAN Chinese language processing bakeoff. 48 It contains 45,000 and 3442 sentences for training and the test, respectively. The MSRA dataset is annotated with three named entity categories, including PER (person), ORG (organization), and LOC (location).
The second dataset is the People's Daily dataset. It includes 20,864, 2318, and 4636 sentences for training, validation, and the test, respectively. The People's Daily dataset is annotated with three named entity categories, including PER, ORG, and LOC.
The third dataset is the Weibo corpus extracted from Sina Weibo. 49 It contains 1350, 270, and 270 sentences for training, validation, and the test, respectively. The Weibo dataset consists of four named entity categories, namely, PER, ORG, LOC, and GPE (Geo-Political Entity).
The detailed statistics of each dataset are presented in Table 1.

Experimental setup
In this article, precision (P), recall (R), and the F1 value were employed as the evaluation indicators of the model. Among them, precision refers to the proportion of the number of entities correctly identified by the NER model (both entity type and boundary are correct) to the total number of entities predicted by the model. The recall rate refers to the proportion of the number of correct entities identified by the NER model to the number of gold standard entities. The F1 value refers to the harmonic mean value of precision and recall, which generally represents the comprehensive performance of the NER model. The formula is as follows: where num(x) and correct represent the number of x and the correct entity, respectively. In addition, predict and gold denote the entity predicted by the model and the gold standard, respectively. The main experimental environment is presented in Table 2:

Experimental results
To verify the effectiveness of our proposed model, namely, ChineseBERT-CNNs-BiLSTM-CRF (ChineseBERT-CLC), the BiLSTM + CRF model, CAN-NER model, 50 and SoftLexicon (LSTM) model 41 were introduced for comparison. Experiments were conducted on MSRA, People's Daily, and Weibo datasets. In addition, the effects of different embedding layers, including the random embedding layer, BERT pretrained embedding layer, and ChineseBERT pretrained embedding layer were compared based on CNNs-BiLSTM-CRF (CLC). The final experimental results are provided in Table 3. Additionally, compared with the CLC model using the random embedding layer and the BERT-CLC model using the BERT pretrained embedding layer, the F1 values of the ChineseBERT-CLC model using the ChineseBERT pretrained embedding layer on the MSRA dataset increased by 1.38% and 0.71%. Furthermore, the F1 values on the People's Daily and Weibo datasets increased by 1.07% and 0.49%, as well as 6.26% and 1.97%, respectively.

Experimental analysis
To verify the effectiveness of the structure of CNNs in our model, the BiLSTM + CRF model was introduced as the baseline.
Based on the baseline model, the CNN-BiLSTM-CRF model was suggested, in which the CNN uses a single channel two-dimensional convolution kernel whose size is set to 3. As shown in Figure 7, the F1 values of the three models on the MSRA dataset vary with epoch. At the initial stage of training, the performance of the three models improved rapidly. Compared with BiLSTM-CRF, the performance of the CNN-BiLSTM-CRF model after adding CNN has improved and become faster, and the role of CNN is more obvious. CNNs-BiLSTM-CRF also has shown a significant improvement in comparison with CNN-BiLSTM-CRF. The three models gradually became stable after the 40-th epoch. During the whole training phase, the performance of the CNN-BiLSTM-CRF model was improved by about 6% compared with the BiLSTM-CRF model, and the performance of CNNs-BiLSTM-CRF was improved by nearly 3% in comparison with the CNN-BiLSTM-CRF model.
Based on data in Figure 8, According to the results of Figure 9, the F1 values of the three models on the Weibo dataset are different with epoch. The MSRA and People's Daily datasets are relatively complete and comprehensive, while the Weibo dataset has a small amount of data and is relatively difficult to train. At the initial stage of training, the performance of the three models improved steadily, but the difference was not significant. As the training progressed, the BiLSTM-CRF model gradually became stable after the 30th epoch, while the CNN-BiLSTM-CRF and CNNs-BiLSTM-CRF models gradually became stable after the 40th epoch. The final performance of the latter two models was also far better than the BiLSTM-CRF model. Compared with the BiLSTM-CRF model, the performance of the CNN-BiLSTM-CRF model improved by approximately 10%. In the entire training stage, the performance of the CNNs-BiLSTM-CRF model was about 4% better than that of the CNN-BiLSTM-CRF model.

F I G U R E 9 Comparison of the three models on the Weibo dataset
The BiLSTM + CRF model directly encoded character sequences with BiLSTM, resulting in only character-level and sentence-level information in the model, lacking word-level information. After CNN was added to extract local features from character sequences, the model's performance represented a significant improvement. However, CNN with a fixed convolution kernel size only encoded the adjacent character information of each character into the vector representation of the character, which is a local feature extraction rather than word-level information extraction. CNNs can encode all potential lexical information near each character into the vector representation of the character, which can efficiently extract word-level information and ease the problem of entity boundary recognition. The model's performance was further improved after extracting word-level information with CNNs.

CONCLUSION
A character-based CNER model, ChineseBERT-CNNs-BiLSTM-CRF was proposed in this article. To solve the problem that the character-based CNER model lacks word-level information, a CNN-based neural network structure called CNNs was introduced to extract word-level information from character sequences, followed by applying the common architecture in the field of NER, BiLSTM + CRF. Further, BiLSTM was employed to extract sentence-level information and predict the corresponding NER label of each character. CRF was also used to impose certain rule constraints on the NER labels' sequence predicted by BiLSTM, which could improve the entity recognition ability of the model. In addition, the ChineseBERT pretrained model was utilized as the embedding layer to obtain a good vector representation of each character. It could reduce the cost of model training and improve the performance of the model. The experimental results on the three common public datasets (MSRA, People's Daily, and Weibo) demonstrated the effectiveness of our proposed model.

CONFLICT OF INTEREST
The authors declare no financial or commercial conflict of interest.