Improving word vector model with part-of-speech and dependency grammar information

: Part-of-speech (POS) and dependency grammar (DG) are the basic components of natural language processing. However, current word vector models have not made full use of both POS information and DG information, and hence the models ’ performances are limited to some extent. The authors first put forward the concept of POS vector, and then, based on continuous bag-of-words (CBOW), constructed four models: CBOW +P, CBOW +PW, CBOW +G, and CBOW +G +P to incorporate POS information and DG information into word vectors. The CBOW + P and CBOW + PW models are based on POS tagging, the CBOW + G model is based on DG parsing, and the CBOW + G + P model is based on POS tagging and DG parsing. POS information is integrated into the training process of word vectors through the POS vector to solve the problem of the POS similarity being difficult to measure. The POS vector correlation coefficient and distance weighting function are used to train the POS vector as well as the word vector. DG information is used to correct the information loss caused by fixed context windows. Dependency relations weight is used to measure the difference of dependency relations. Experiments demonstrated the superior performance of their models while the time complexity is still kept the same as the base model of CBOW.


Introduction
Word representation is the basic task of natural language processing (NLP). To simplify the processing, traditional NLP tasks use a one-hot representation method to represent words [1]. The one-hot representation uses high-dimensional vectors to represent words of only one non-zero dimension, which cannot capture any grammatical or semantic information. Rumelhart et al. [2] proposed distributed word vector representation (WVR), also known as word embedding representation, which can capture grammatical and semantic information by representing words in dense and low-dimensional vector space. However, by in-depth examining these WVRs or models, one can still find several shortcomings of them: (i) the words are equally treated in the context which cannot distinguish the difference between words [3][4][5][6][7][8]; (ii) the fixed context window is used to predict target word, and thus resulted in the loss of some important words in long and compound sentences [3][4][5][6][7][8][9][10]; (iii) it is over simplified to only use word frequency to select words in sub-sampling and negative sampling, which are used to speed up the training process and improve the quality of word vector [3][4][5][6][7][8][9][10].
To address the above issues, in this paper, we try in word vectors to apply the part-of-speech (POS) tagging information and the dependency grammar (DG) parsing information, which is already available from traditional NLP tasks, to further improve the performance of the word vector model.
POS tagging refers to correctly tagging each word in the result of word segmentation, i.e. POS tagging is the process of determining the POS tag of each word, which could be a noun, verb, adjective, or other types. On the other hand, the basic task of DG parsing is to determine the syntactic structure of a sentence or the dependency relationship between words in a sentence. Doing DG parsing on the results of word segmentation and POS tagging can further reveal the dependency relationship between words. To make full use of POS tagging and DG parsing, we combine POS information and DG information into a word vector model based on the popular continuous bag-of-words (CBOW) model [3,11,12]. POS information refers to the lexical information that can help to determine the POS tag of a target word through a form of POS vector. The concept of the POS vector is put forward for the first time by us in this paper to measure the similarity between POSs and to improve the sub-sampling technology and negative sampling technology. On the other hand, DG information means the grammar information that is contained in the extended part of a given context, which is used to compensate for the information loss due to the fixed context window of the word vector model. The dependency relation weight, which can bring DG information into the word vector model, is used to distinguish the differences of dependency relations. Specifically, we proposed four types of models using POS information and DG information: CBOW + P and CBOW + PW models, which are based on POS tagging, CBOW + G model, which is based on DG parsing, and CBOW + G + P model, which is based on POS tagging and DG parsing, respectively. In CBOW + P and CBOW + PW models, we optimise the hidden layer's representation of context by using the POS vector to optimise the prediction of the target word. The POS vectors correlation coefficient and distance weighting function are used to train the POS vectors as well as the word vectors together. In the CBOW + G model, we use DG information to obtain long-distance dependency word information and increase the weight of the words related to the target word in the context window. Adding sentence-based DG information to the training process of the word vector model can reduce the problem of information loss caused by a fixed-size context window. Furthermore, since there is no data set to measure the quality of the POS vector, we construct a POS analogy data set with 55 sets of test data to measure its effect. Experiments on word similarity, word analogy, and POS analogy tasks presented in this paper show that the proposed word vector models significantly outperform other relevant models while the time complexity is still kept under the same order of magnitude as the popular CBOW base model. This paper is organised as follows: Section 2 is a brief review of the related works. Section 3 states the four models proposed by this paper. Section 4 gives the experimental results and comparisons to other models. Also, Section 5 concludes this paper.

Related works
In recent years, Mikolov et al. [3] proposed the CBOW model and Skip-gram model, which are simple, effective, and easy to train in the large-scale corpus. Therefore, these models are popular in NLP tasks, such as text classification [13,14], named entity recognition [15,16], and word segmentation tasks [17,18]. Turian et al. [19] classified distributed WVR into three categories: (i) matrix-based, (ii) clustering-based, and (iii) neural network-based. Matrix-based distributed WVR uses a matrix to represent a word by constructing a word-context matrix. Each row in the matrix corresponds to a word, each column corresponds to a context, and the elements in the matrix correspond to the co-occurring times of words and context. Clustering-based distributed WVR uses the clustering method to represent a word using the Huffman code. Neural network-based distributed WVR captures semantic association between words through multi-layer neural networks which can represent the complex context. Therefore, the neural network-based distributed WVR becomes the mainstream technology of WVR.
In this paper, we further divide the neural network-based distributed WVR into four types according to the different information incorporated into the model: (i) basic distributed WVR, (ii) distributed WVR with word structure information incorporated, (iii) distributed WVR with external knowledge base incorporated, and (iv) distributed WVR with grammatical information incorporated. Type 1 uses context information as the only training corpus, such as skip-gram [3], CBOW [3,[11][12], and GloVe [13] models. Type 2 incorporates structural information of words, including characters, radicals, and strokes. Chen et al. [5] proposed a character and word embeddings model that considers the joint training of Chinese characters and words and takes the character vector as part of the word vector. Yin et al. [6] proposed to add the target word's radicals to the context representation and then used an improved CBOW model to train the words, characters, and radicals jointly. Cao et al. [7] proposed the cw2vec model to extract the strokes' N-gram features of words and then used them to train in the model instead of words. Type 3 attempts to formalise the semantics information contained in the internal structure of words and then integrates them into the training process. Chen et al. [5] proposed the concept of sense vector, which divides polysemy into different sense vectors and improves the effect of word vector in English disambiguation task. Niu et al. [8] proposed Sememe attention over target model and other models by using sememes of HowNet to improve the WVR. They first used the attention mechanism to add sememes of HowNet to WVR learning [20]. Type 4 provides overall semantic information for word representation by lexical analysis and grammatical analysis of traditional NLP. In a sense, types 1-3 WVRs that fuse word structure information or external knowledge base belong to the way of expanding corpora to improve the quality of word vectors. While type 4 focuses on using traditional NLP's lexical and grammatical analysis, which can reflect the dependency of words at the sentence level and provide semantic information for WVRs without any other expanded corpora needed. POS tagging and DG parsing are important parts of traditional NLP's lexical and grammatical analysis.
To improve the expressive capability of word embedding, Liu et al. [9] and Pan et al. [10] tried incorporating POS into word embedding. Levy and Goldberg [21] first suggested that DG should be integrated into the training process of the CBOW model. POS information for learning word embeddings (PWE) model proposed by Liu et al. [9] and CWindow-POS (CWP) model proposed by Pan et al. [10] cannot make full use of POS information, as these models only modify words' weight of the context window by just using the POS correlation weighted matrix. The dependency-based word embeddings (DBWE) model proposed by Levy and Goldberg [21] does not consider the diversity of dependency relations and limits the scope of DG to the context window. Therefore, we propose four types of models: CBOW + P, CBOW + PW, CBOW + G, and CBOW + G + P models to make use of POS information and DG information, which are already obtained in traditional NLP tasks, to enhance the performance of the word vector model.

Methodology
In this paper, the CBOW + P and CBOW + PW models based on POS tagging are proposed and the concept of POS vector is first formally put forward as far as we know. POS vector is introduced to formalise the POS information. The POS vector and the word vector are defined in the same vector space and trained in the same way so that the context's POS information can help to predict the target word. On the other hand, we propose the CBOW + G model based on DG parsing and two strategies [pre-training cosine distance (PTCD) and negative sampling cosine distance (NSCD)] to integrate DG information into word vectors. The CBOW + G + P model is a combination of CBOW + G and CBOW + P models.

CBOW + P model
The PWE model [9] modifies the words' weight of the context window by using the POS correlation weighted matrix, which is related to the distance between the target word and context words. Inspired by the PWE model, in this paper, we propose the CBOW + P model based on POS tagging, which further integrates POS information into the training process of word vectors and proposes the concept of POS vector to solve the problem of difficult measurement of POS similarity. On the other hand, we assume that the contribution of POS to the prediction of the target word is inversely proportional to context distance. We use a function that has a positive correlation with context distance to replace the POS correlation weighted matrix of the PWE model and, in this way, we can also reduce the training parameters. To measure the effectiveness of the POS vector, similar to the task of word analogy, we construct a data set to verify the effect of the POS vector. Surprisingly, the POS vector also exhibits analogical natures, such as vn -v+a=an. Hence, we call this data set a POS analogy test set.
The framework of the CBOW + P model is shown in Fig. 1. The CBOW + P model is a three-layer feed forward neural model, consisting of input layer, hidden layer, and output layer. The input layer is used to accept the context's word vector and POS vector, which are both expressed by one-hot vector. The dimension of word vector is equal to the vocabulary size of the corpus, which is denoted as |V |, while the dimension of the POS vector is equal to the vocabulary size of the POS set, which is denoted as |E|. So, the input layer's dimension is 4C × (|V |+|E|), where C is the windows size of context. The hidden layer represents the latent semantics of the context in the form of low-dimensional vectors, which is denoted as N, ranging from 50 to 300. The output layer represents the target word in the form of one-hot vector, with a dimension of |V| + |E|. The matrix dimensions between the output layers and the hidden layer are the parameters of the model. The POS tagging set in this paper follows the people's daily annotated corpus (PFR) People's Daily Annotation Corpus standard of the Institute of Computational Linguistics, Peking University [22]. We define E as the POS set, |E| as the size of the POS set, x i as the ith target word, x i−k and x i+k as the (i − k)th and (i+k)th word of x i in the context window of length k, respectively, index k(0 , k ≤ C)(C is the windows size of context) as the distance between word x i−k and word x i , w(k) as the function, which has positive correlation with context distance k, z i−k ( [ E) and z i+k ( [ E) as the POS vectors of x i−k and x i+k , respectively, r z i−k as the correlation coefficient of the POS vector of word, which is initialised with an assigned random number between 0 and 1 and trained with the model. Also, when r z i−k = 0, the CBOW + P model reduces to the CBOW model.
The CBOW model ignored the POS information's effect in predicting the target word. The word's POS in the context of a target word determines the POS and part of the semantics of the target word. It is natural to assume that the farther away from the target word, the less the effect of the POS, i.e. the decisive effect of POS is inversely proportional to the distance k between the target word and the context word. The POS vector is added to the objective function L of the CBOW + P model, which regards POS information as part of the context of the target word, as shown in (1) .., x i+C ) represents the probability that the target word is x i when the context word sequence is Differing from the context representation of the CBOW model, the CBOW + P model adds POS vector z i−k , POS vector correlation coefficient r z i−k , and distance weighting function w(k) to the context representation. The calculation formula for the hidden layer variable (vector) of h is shown in (2) Statistics on the frequency of POS appearing in Chinese Wikipedia (i.e. training corpus) show that the occurrence probability of each POS type is not equal. The frequency of POS being found in the training corpus can reflect the common extent of words, which can be used as another indicator in sub-sampling to calculate the drop probability Q(x i ) of word x i . Therefore, we propose that both the frequency of words being found in the training corpus and the frequency of POS should be considered in the sub-sampling to measure the common extent of words. The drop probability Q(x i ) of sub-sampling is shown in (3) Q( where f (x i ) represents the frequency of word x i appearing in the training corpus; t is the preset threshold, which is generally set to 10 −3 -10 −6 ; g(z i ) is the frequency of POS z i being found in the training corpus, which is called POS proportion, corresponding to the word x i . Its definition is given in (4), where E z i is the set of words with the same type of POS in POS set E, the power of 1/4 is a trade-off value to adjust the impact of POS frequency On the other hand, the CBOW + P model also uses POS to optimise the calculation of negative sampling. When choosing negative samples, we will consider the impact of word frequency and POS proportion, CBOW + P model prefers to choose words with high frequency and high POS proportion. The calculation of the probability P(x i ) that words are selected as negative samples is given in (5) P

CBOW + PW model
Based on the CBOW + P model, we further propose the CBOW + PW model. The difference is that the CBOW + P model takes POS as studying the object, while the CBOW + PW model takes a word as studying the object. The CBOW + PW model assumes that each word has its correlation coefficient of POS vector, which further refines the influence of POS in WVR. The CBOW + PW model assumes that each word in the vocabulary has its correlation coefficient of POS vector. Specifically, the difference between the CBOW + P model and the CBOW + PW model lies in the context representation of the target word. The calculation formula for the hidden layer variable (vector) of h is given in (6) where r w i−k represents the correlation coefficient of the POS vector corresponding to each word that is obtained in the training process, replacing r z i−k of the CBOW + P model. The definitions of other symbols are as stated for (1), not repeated here. The framework of the CBOW + PW model is shown in Fig. 2.
The CBOW + PW model is also a three-layer feed forward neural model, consisting of the input layer, hidden layer, and output layer, the same as the CBOW + P model. The CBOW + PW model also used the improved sub-sampling and negative sampling technologies that the CBOW + P model used, as shown in (3) and (5).

CBOW + G model
The CBOW + P and CBOW + PW models use POS information to predict the target words. In traditional NLP, after POS tagging, DG parsing is often carried out to further reveal the diversity of the dependency relationship between words. Therefore, in this paper, we improve the word vector model further based on DG parsing to correct the information loss caused by the current fixed context window approach. At the same time, dependency weight, which is introduced to represent the differences of dependency relations, is used to integrate DG information into the word vector model. The CBOW + G model proposed in this paper is also a three-layer model (as shown in Fig. 3), but with the dependency relations between words added in the input layer through DG parsing. The hidden layer and output layer are the same as CBOW + P and CBOW + PW. The model proposed by Levy and Goldberg [21] does not consider the diversity of dependency relations and limits the scope of DG to the context window. In the CBOW + G model, DG information plays two main roles: (i) DG information takes a sentence as a unit rather than a context window as a unit. While a Chinese document usually consists of complex long sentences and compound sentences, too short or too long a context window will affect the effect of the model. Adding sentence-based DG information to the training process of the word vector model can capture long-distance dependency information and hence reduce the impact of the fixed-size context window. (ii) DG information improves the relative weight of the words related to the target word in the context window. Dependency relation exhibits an effect that is similar to the attention mechanism prevailing in recent years. We use the DG annotation system of the Language Technology Platform (LTP) of Harbin University of Technology [23]. We define the dependency relation set Q of the LTP DG annotation system as follows:  (7) h = 1 2C · −C≤k≤C,k=0 where k is an integer window-variable bounded by windows size of C; w i−k represents the (i−k)th word of the context; Relation(w t , w j ) represents the dependency relation between the tth target word w t and its jth dependent word w j ; J = {Relation(w t , w 1 ), Relation(w t , w 2 ), ..., Relation(w t , w j )} is the dependency relation sequence of target word w t ; Q m is the mth relation of the dependency relation set Q; r Q m is the dependency parameter of the target word w t and its dependency word, which is used to distinguish the influence of different dependency relations. We define r Q m as the average cosine distance of a pair of words belonging to the same dependency relationship in the corpus. We propose two computational strategies: PTCD and NSCD. The calculation formula of the PTCD strategy is shown as where C m is all pairs of words belonging to the same dependency category Q m in the corpus; |C m | is its size; w i , w k is a pair of words in C m . On the other hand, the NSCD strategy uses the average cosine distance of all pairs of words belonging to the same dependency relationship to calculate the dependency weight r Q m as where K m =|NEG G(Q m )| represents the size of a negative sampling set of dependency relation; (w i , w k ) is a pair of words in NEG G(Q m ); Relation(w i , w k ) denotes the dependency relation between words (w i , w k ). We propose the CBOW + G model to capture the long-distance DG information and increase the proportion of words that are dependent on target words in context representation. Take window size = 2 as an example. 'record (记录)', '、', 'people (人)', 'called (称为)' and 'historians (历史学家)' are not in the context window of the target word of 'study (研究)'. However, there is a coordinate (COO) relationship between 'record (记录)' and the target word 'study (研究)'. DG parsing finds words that are closely related to the target word outside the context window, which is of great value in predicting the target word. On the other hand, different dependencies play different roles. For example, '、' and 'and (和)' have little value in predicting the target word 'study (研究)'. To solve this problem, we pre-train the word vector and use NSCD or PTCD strategies to calculate r Q m . For example, the average cosine distances of

CBOW + G + P model
The CBOW + G + P model is a hybrid of the CBOW + P model and CBOW + G model. Experiments show that the CBOW + P model performs better in word similarity tasks, while the CBOW + G model does better in word analogy tasks. Therefore, we integrate the CBOW + P and CBOW + G models into the CBOW + G + P model, by taking their respective advantages. The calculation formula expressed in the hidden layer vector h is as follows, which follows the same formalism as for the CBOW + P model and CBOW + G model:

Experiments
We quantitatively evaluate the quality of word vectors learned by CBOW + P, CBOW + PW, CBOW + G, and CBOW + G + P models on word similarity task, word analogy task, and POS analogy task proposed in this paper.

Experiment settings
Training corpus: Chinese Wikipedia is used as the training corpus of this paper. In the preprocessing, LTP [23] and Jieba [24] are used for Chinese word segmentation, POS tagging, and DG parsing. After preprocessing, we obtained a 3.16 GB training corpus and 73,174,132 words. The vocabulary size is 1,581,744. The POS tagging standard is a POS tagging set of the PFR People's Daily annotation corpus, which contains 48 POS tags, published by the Institute of Computational Linguistics, Peking University. Based on this POS tagging set, we propose a POS analogy test set with 55 sets of test data. Parameter setting: We compare our method with the CBOW model proposed by Mikolov et al. [3], the PWE model proposed by Liu et al. [9], the CWP model proposed by Pan et al. [10] and the DBWE model proposed by Levy and Goldberg [21]. For the CBOW + P and CBOW + PW models, the parameters are both set as follows: the size of context window is 5, the number of iterations is 10, the number of negative samples is 5, the preset threshold of sub-sampling is 10 −4 , and the distance weighting function w(k) = k 2 . In addition, for the CBOW + G and CBOW + G + P models, the dimension of word vector is 200, the number of iterations is 10, the number of negative samples is 5, the preset threshold of sub-sampling is 10 −4 , and the distance weighting function is also set w(k) = k 2 .

Word similarity experiment
Word similarity task evaluates the capability of word vectors to reveal the semantic relevance of words. In this paper, two sets of Chinese word similarity data sets, Wordsim-240 and Wordsim-297, provided by Chen et al. [5] are selected for evaluation. Wordsim-240 data set contains 240 pairs of Chinese words and Wordsim-297 data set contains 297 pairs of Chinese words. A word pair of 'OPEC-石油' not appearing in Chinese Wikipedia is deleted from Wordsim-297 to become Wordsim-296. The similarity score of a pair of words is calculated by the cosine similarity of word vectors. The Spearman correlation coefficient proposed by Myers and Sirois [25] is used in this paper to measure the difference between the human marker's scores and similarity scores calculated by word embedding evaluation. The experimental results are shown in Tables 1 and 2, where +P denotes the CBOW + P model, +PW denotes the CBOW + PW model, +G denotes the CBOW + G model, and +G + P denotes the CBOW + G + P model.
The results show that the +P and +PW models proposed in this paper are superior to CBOW, PWE, and CWP models in word  similarity tasks with the improvement rising up to 3.25%. On the other hand, the + G and + G + P models outperform CBOW and DBWE models in word similarity tasks with improvement rising up to 8.08%.

Word analogy experiment
This task evaluates the quality of word vectors by examining their capability of discovering linguistic rules between pairs of words. The data set proposed by Chen et al. [5], which is called the Analogy-1125 data set in this paper, includes three types of analogies: national capital, city state/province, and family vocabulary, with a total of 1125 analogies. As there is no Chinese analogy data set of POS vectors as far as we know, based on the POS tagging set of PFR People's Daily annotation corpus, in this paper, we propose a data set consisting of 55 test data sets, which is called POS-Analogy-55 data set. (v, vn, a, an), (a, ad, g, dg), and (n, vn, v) are examples of the POS-Analogy-55 data set. The experimental results are shown in Tables 3 and 4.
The results show that the +P and +PW models proposed in this paper are superior to CBOW, PWE, and CWP models in the POS-Analogy-55 data set with the improvement rising up to 83.32%, but the performance on the Analogy-1125 data set is slightly inferior. On the other hand, the +G and +G+ P models outperform CBOW and DBWE models with the improvement rate rising up to even 166.74% in the POS-Analogy-55 data set, and the improvement rate rising up to 3.01% in the Analogy-1125 data set, as shown in Table 4.
It can be seen from Tables 1-4 that the CBOW + P, CBOW + PW, CBOW + G, and CBOW + G + P models perform better than the original CBOW model. Since CBOW, PWE, and CWP do not output POS vectors, we use the average value of all word vectors belonging to the same type of POS as POS vectors. The four proposed models can improve the quality of word vectors by using both POS information and DG information. PWE and CWP models only use POS weight information between the target word and context words. The POS vectors proposed in this paper are located in the same vector space as word vectors and formally represent the impact of POS in word meaning. The time complexity of the four models proposed in this paper is of the same magnitude as CBOW, and the number of predictions is O (V), where V is the size of the vocabulary. POS tagging and DG parsing are conducted in preprocessing.

Conclusions
In this paper, we proposed four types of models using POS information and DG information: CBOW + P, CBOW + PW, CBOW + G, and CBOW + G + P models. CBOW + P and CBOW + PW models optimise the vector representation of context by using the POS vector and optimise the prediction of the target word. The CBOW + G model uses DG information to obtain long-distance dependency word information and increase the weight of the words related to the target word in the context window. To verify the effectiveness of the POS vector, we proposed a POS analogy data set consisting of 55 groups of test data. Experiments show that the proposed model outperforms the PWE and CWP model when using POS information and it also outperforms the DBWE model when taking into account the DG information. In addition, the time complexity of the proposed four models is of the same magnitude as the CBOW model. Future work may need to explore the following aspects: first, since the POS vector proposed in the paper presents its capability of discovering linguistic rules between word pairs, but how to apply this capability of it into the traditional NLP tasks in depth deserves further research. Second, dependency parameters are calculated directly from cosine distances, and perhaps more efficient methods should be used to calculate them. Third, a closer or deeper combination of POS information and DG information should be considered.

Acknowledgments
This work was supported in part by the Department of Education of Guangdong Province under Special Innovation Program (Natural Science), grant number 2015KTSCX183, and in part by the South China University of Technology under 'Development Fund' with fund number x2js-F8150310.