GKEEP: An Enhanced Graph‐Based Keyword Extractor With Error‐Feedback Propagation for Geoscience Reports

As the amount of published geoscience literature grows, reading and summarizing texts of large collections has become a challenging task. Publication keywords can be considered basic components of knowledge structure representations and have been used to reveal knowledge concerning research domains. In contrast to data used in other research domains, the works on textual geoscience data that entail keyword extraction are limited. In this paper, we propose an unsupervised algorithm, the graph‐based keyword extractor with error‐feedback propagation (GKEEP), that enhances graph‐based keyword extraction approaches by using an error‐feedback mechanism similar to the concept of backpropagation. The proposed approach comprises the following steps. A preprocessed document is used as the input of the proposed model and is represented as a weighted undirected graph, where the vertices represent words and the edges represent the cooccurrence relationship between the words constrained by a window size. Subsequently, its nodes are ranked by their importance scores calculated by a graph‐based ranking algorithm. Consequently, all the words have their own scores, and they are used to compute the scores of keyword candidates. Subsequently, the Word2Vec method is applied to recalculate the scores of keyword candidates and rank the keyword candidates to select the final keyword. It also utilizes error feedback to boost the rankings of the most salient terms that would otherwise be deemed less important. With empirical experiments on two real data sets (including our newly built data set), the proposed GKEEP model outperforms state‐of‐the‐art unsupervised models and the existing graph‐based ranking models. The proposed method can effectively reflect intrinsic keyword semantics and interrelationships.

have gained prominence. The first of these is the machine learning (ML)-based approaches and the second is based on the graph representation of text.
ML methods come in supervised and unsupervised flavors (Sarkar et al., 2010;Sung et al., 2020;C. Zhang, 2008;Zhang et al., 2017;Zhang et al., 2020). Supervised learning approaches need labeled training samples to induce the model. Each instance in the training set denotes a term in the document/report with label 1 (that represents keyword) or 0 (that represents not a keyword). Development of a training set requires manual annotation of the text, making the task tedious, subjective, and possibly inconsistent. Due to the intense human intervention required, supervised approaches to the keyword extraction task have not been able to sustain interest and popularity. Based on this reason, unsupervised approaches are favored as an alternative approach for recognizing keywords/keyphrases. Graph-based methods represent candidate keywords as nodes and the relationship between two nodes as an edge. A set of scoring functions is used to rank the candidates based on a specific graph property (Figueroa et al., 2017;Li et al., 2016;Mihalcea & Tarau, 2004;Rose et al., 2012). Graph-based extraction, however, has the drawback that common or frequently used terms obtain higher scores because they have more edges connected to them. Similarly, rare or infrequent terms obtain lower scores because they are less connected. To overcome this observed weakness on graph-based approaches, we find that the backpropagation concept is a well-suited approach.
Although considerable research has been devoted to this topic over the years, the problem of extracting relevant keywords with high accuracy remains unsolved. Several factors contribute to this. (1) More variations: the text in geoscience reports is complex in terms of text characteristics and patterns and typically written by many different writers/inspectors from various institutions and localities (e.g., "elastic modulus" and "Young's modulus" represent the same quantity). (2) Domain-specific uniqueness: when referring to domain-specific text, geological reports demonstrate domain-specific uniqueness and include high levels of specialized details, complex concept identifications, and relationship associations about minerals, rocks, and geological times (e.g., identifying complex terms about the Permian-Triassic Boundary Section in Changxing Zhejiang and the Maastrichtian Age). The existing statistics-based and ML-based approaches cannot handle such complexities and variability while still obtaining high precision and recall. (3) Existing state-of-the-art graph-based keyword extraction methods rely only on a co-occurrence relation between the candidate keywords, while completely ignoring the semantic relationship. These observations motivate design of word scoring approaches that account for semantic connectivity among the words.
In this study, we propose a graph-based KE algorithm called KE using error-feedback propagation (GKEEP), which utilizes the semantics of word embedding to assist in extracting keywords from geoscience reports.
The key idea of GKEEP is that, when a node with an incorrect score is detected by word embedding, backpropagation learning is used to propagate backward all the nodes of the graph by the amount of the error. GKEEP leverages the benefits of word embedding and error-feedback propagation to undertake the above challenges of KE from geoscience texts, which contain ambiguities due to their lexical variations, free-text narrative style, arbitrary word ordering, and descriptive diversity. The results of our experiments show how backpropagation can be utilized on graph-based keyword extraction methods to produce better-quality keywords. To the best of our knowledge, this work is the first to use an enhanced graph-based approach to mine geoscience keywords from geoscience reports written in the Chinese language.
Our contributions can be summarized as follows.
(1) The proposed method incorporates word embedding to capture the dependency structure as well as the data distribution, and it computes semantic relations to solve the content sparsity problem.
(2) GKEEP utilizes error feedback to boost the most salient terms that graph-based approaches deem less important and designs of a word scoring approach that aims to capture semantic connectivity of the words in the text. (3) We conduct a set of experiments to verify the effectiveness of the proposed method on two available manually constructed data sets. The experimental results demonstrate that our algorithm outperforms the state-of-the-art methods and obtains better-quality keywords. Although this paper focuses on the geoscience domain, our approach could easily be extended and adapted to a wider variety of fields such as medicine and chemistry.
The remainder of the paper is organized as follows. In Section 2, some related work about KE methods is reviewed. Our proposed GKEEP method is described in Section 3. Experimental results are presented in Section 4. Section 5 presents a discussion, and Section 6 provides conclusions and discusses future work.

Literature Review
Highlighting these keywords within the body of a document increases the browsing speed of readers. In this section, we briefly review the literature concerning KE methods. Recent work has categorized them as either supervised or supervised KE approaches.

Supervised KE
Supervised approaches for keywords extraction make full use of training data sets consist of texts and their corresponding keywords. KEA (Frank et al., 1999) applied a Naïve Bayes classifier developed from two features, which are extracted from phrases in the documents: the TF-IDF and the relative position about the phrase.
Additional innovative supervised methods have been proposed in recent years, ranging from conditional random fields (CRF; C. Zhang, 2008) to neural networks (Sarkar et al., 2010). C. Zhang (2008) proposed a CRF model whose string labeling is used over 20 phrase features. A framework used a maximum-entropy classifier was proposed, it can extract keywords from meeting transcripts (F. Liu et al., 2011). In addition, various ML-based approaches have been constructed to automatically acquire knowledge from training data. In particular, these approaches are used to develop classification models (Onan et al., 2016;Yang et al., 2016). These models are first trained using a training set with labeled tags; then, the model is validated using a test set. In recent years, DL methods have exhibited surprising abilities to learn features and have achieved state-of-the-art performances on NLP tasks in several fields (Chung et al., 2014;Goodfellow et al., 2016;Santos et al., 2018). Convolutional neural networks and recurrent neural networks have been widely applied to NLP tasks and have achieved remarkable results in many fields (Guo et al., 2017;Palafox et al., 2017;H. Zhang et al., 2017).
Supervised keyword extraction approaches, as the current state-of-the-art, still require manually annotated training data sets, which are rather challenging and time consuming.

Unsupervised KE
Labeled training data are not needed when we use unsupervised approaches for KE. Some of approaches are based on term statistics, which assess the terms' degrees of importance to the text or collection according to the texts are found (even though lots of these methods require a document corpus for their calculations, because there is no "labeled" corpus available, they are not considered supervised).
Statistics-based approaches, such as n-gram statistics, word frequency, and TF-IDF (Salton et al., 1974) are domain independent and do not require a large amount of labeled data; these methods use statistical techniques to extract keywords (Beliga et al., 2015;Zhu & Hua, 2017). Niu et al. (2014) applied high-frequency keywords to extract key information from geoscience and environmental documents. Su et al. (2014) selected high-frequency keywords to analyze research hotspots and develop domain trends. Chen et al. (2015) used the active index (AI) to demonstrate whether a country/region has comparative advantages based on its share of total world publication. Hu et al. (2018) presented a domain-keyword analysis approach using a Google Word2Vec model by extending the term frequency-keyword AI. The experimental results showed that the proposed method resulted in more effective analysis of domain knowledge.
Graph-based models generally view a document as a graph, whose vertices denote words and edges denote the relationship between two words constrained by the window size. TextRank is an unsupervised algorithm that uses only the information in the document itself for KE and in lots studies TextRank has been used to improve the graph-based ranking algorithm (Figueroa et al., 2017). Rose et al. (2012) proposed a Rapid Automatic KE algorithm, stop words and delimiters are used to divide a text into some candidate keywords. This algorithm has better performance than TextRank. Wang et al. (2015) used pretrained word em-beddings as an external knowledge to improve TextRank and extract keywords from scientific publications. The experiments show that training the word embeddings improves the algorithm's performance compared to the original TextRank algorithm.
However, these models have a common drawback. To extract high-quality keywords depends on whether the total document set composition is the same domains. For another one, graph-based models build a graph to extract keyword candidates, and subsequently these candidates are ranked to select the real keywords from a single document.

Motivations and Limitations of the Study
There are considerable research efforts, both inside and outside the geoscience domain, have been undertaken to extract keywords from unstructured text (Gupta et al., 2018;K. Liu & El-Gohary, 2017;C. Wang et al., 2018). Despite their achievements, the existing KE approaches are still limited in their ability to automatically recognize keywords in the geoscience domain with highly heterogeneous and complex text. Hence, the main motivation of this research is to empirically enhance the learning methods. Two primary knowledge gaps have to be settled urgently.
First of all, when extracting keywords from highly heterogeneous and complex text, there is a lack of KE methods that can both reduce manpower input and achieve high performance. Most of the existing KE methods are ruled-based methods J. Zhang & El-Gohary, 2016;Zhou & El-Gohary, 2017) or supervised ML-based methods (G. Wang et al., 2014;Xia et al., 2011). These approaches can address highly heterogeneous and complex text through a large number of representative samples, but large manual efforts are required to construct rules or annotate training data. In essence, it is extremely challenging and time consuming to build patterns (for rule-based KE) or to tag examples (for the supervised ML-based KE) from geoscience reports.
Second, semantic approaches to support automated KE are lacking. A body of research exists that has explored the use of semantics in addressing various NLP tasks. Formally and explicitly defined semantics can facilitate the recognition and extraction of keywords. Utilizing semantics to assist automated KE is therefore highly important in our study, given the heterogeneous and complex text in geoscience documents. However, semantics have not been well explored in terms of rule-based or ML-based KE.
Starting from this observation and knowledge gap, this paper focuses on a graph-based KE algorithm with error-feedback propagation for use with geoscience reports. The GKEEP algorithm takes advantage of the graphics-based KE by considering the structure and content of a single document. Backpropagation is then used to improve its effectiveness, thereby increasing the score for terms/phrases that are more likely to be relevant and decreasing the score for those that are less likely.

The Proposed GKEEP Model
To extract keywords from geoscience reports, we begin by analyzing the ways that terms of rocks, structures, and geological time are used. Figure 1 demonstrates an example related to the geological structure. Such as "Triassic," "Rock types," and "Lithology" are keyword terms that illustrate the source document, which refers to rock and geological structures. Other terms such as "Stripes" and "Gray white" in the reports reveal the descriptive aspects of these keywords. By obtaining the keyword terms, it can be inferred that this report focuses on the geological structure.
In this section, we introduce the GKEEP and elaborate on components of it. The overall KE process of GKEEP is graphically presented in Figure 2, and there is a detailed introduction in Algorithm 1. As shown in Figure 2, the KE process of the GKEEP algorithm for an input text is performed in four phases. First, the input text is cleaned, and the graph of the input text's words is constructed. Second, each node is ranked according to the graph structure and ranking algorithm. In the third step, each node is measured and the error-detection method is used to detect whether its rank is higher or lower than it should have. In this step, the basic idea is to identify inaccurate scores and assign expected scores for the nodes in the graph with Word2Vec. In the fourth step, the errors detected in the previous step are applied to adjust the weight of every edge. Finally, a list of candidate keywords is extracted from the input reports for which the ultimate scores are calculated. Keywords are ranked by their scores; hence, the top N candidate keywords are selected as the core and meaningful keywords.
QIU ET AL.

Algorithm 1 Input Output
Step Step 4 repeat for each V j G do j end for for each w ij G do w ij w ij w ij w ij w ij w ij end for G until E In contrast to the most recent graph-based KE approaches, GKEEP uses Word2Vec to enhance graph-based KE algorithms with backpropagation learning. Thus, while most of the important and relevant keywords are falsely assigned a low score during the initial KE process, the semantic information from Word2Vec is properly balanced through the error-feedback propagation.
The following subsections describe the proposed GKEEP and its core phases in detail.

Graph Construction
In the graph-based KE method, text units are first identified and added to the graph as nodes. Then, some meaningful relationships between text units are identified and added as edges between nodes. The resulting graph is used to calculate the importance of each node relative to all the others. Similarly, the score of a particular node affects the score of the node to which it is connected.
The formation is mainly exposed in niechaqu area, and its rock types are also complex. It is mainly composed of a set of gneiss, schist, quartzite and marble. It is composed of upper and lower lithology, which cannot be represented on the map. In chayong area of the survey area, only a small amount of biotite plagioclase gneiss is distributed in late Triassic and late Jurassic intrusive bodies.

Manually assigned keywords niechaqu area, rock types, biotite plagioclase gneiss, late Triassic and late Jurassic
Note. Keywords shown in bold are those actually appear in the text.

Table 1
Preprocessed Sample Text From the Collection Corpus, and Its Manually Assigned Keywords Figure 2. Detailed information about the architecture of the GKEEP model. GKEEP comprises four steps to extract the keyword of the input document.
Step 1 constructs the graph.
Step 2 ranks each node rely on the graph structure.
Step 3 detects the error nodes in the graph.
Step 4 applies error feedback, which is iteratively modifying the edge weights of the graph. Finally, the algorithm aggregates scores and ranks the keywords. GKEEP, graph-based keyword extractor with error-feedback propagation.

Preprocessing
Regional geological reports are known to be very noisy for mining tasks because they include some symbols that do not contain any useful information and make further processing ineffective. Therefore, an effective preprocessing phase is needed to remove meaningless symbols and allow more effective keywords to be extracted. There are some preprocessing differences between Chinese and English texts. Unlike English, Chinese has consecutive characters that are not delimited by word boundaries. The goal of text preprocessing is to convert free raw text into a format, which is more conducive to further analysis. The steps for preprocessing are as follows in this research according to the standard procedure: text cleaning, sentence splitting, tokenization, and stop-word removal. Text cleaning is applied to eliminate tables of results, mathematical formulas, author affiliations, and other noisy lines. Sentence splitting marked text into grammatically meaningful sentences based on sentence boundaries such as periods and question marks. Tokenization segments the continuous raw text into semantically and syntactically meaningful units-words, punctuation, digits, and whitespace. Stop-word removal removes function words and some common words using a standard list of stop words and then eliminates words from the set using a word-matching method. Table 1 demonstrates a sample text, which is selected from the collection corpus after processing, along with manually assigned (real) keywords.

Graph Construction: TextRank
TextRank technique can be easily configured based according to different applications. In this research, we will look at the configuration used in the experiment to get the best results.
The size of the text added as a node in the graph is n-grams and the size is 1-4. The cooccurrence (i.e., the maximum distance in words) between two text units is used to represent the relation between the nodes. In our research, 2 is applied as a cooccurrence value, which represents that an edge is developed between two nodes if they cooccur within a window of two words, there is an exception when there is a full stop. Since the manner in which GKEEP will later apply backpropagation to the text graph, the weighted and undirected edges are necessary to owe, which are all initialized with a weight of 1. Figure 3 shows the graph-constructed result of TextRank from the selected sample text.

Node Ranking
TextRank uses a recursive algorithm on the graph, iteratively assigning points to each node until convergence is achieved. In short, the score of a node in the graph affects all scores associated with that node, and vice versa.
For the recursive algorithm, all nodes in the graph are initialized with a score of 1, and all edge weights are set to 1. For each node V i in graph G, In(V i ) denotes the set of incoming nodes (those that point to it), and QIU ET AL.
10.1029/2020EA001602 7 of 22 Out(V i ) denotes the set of outgoing nodes (those that V i points to). Using TextRank, the rank/relevance of node V i is calculated by Equation 1 as shown below: (1) Here, d represents a damping factor (usually set to 0.85), which represents the probability of jumping from one node to the next node. Additionally, (1 − d) is the probability of transference to a new node, and w ij is the weight from V i to V j represents the edge. This is a recursive ranking algorithm applied to the graph until it converges to within a set threshold (set to 0.005 in these experiments).

Error Detection
The enhanced graph-based representation step detects whether a node has been assigned the correct score in each individual sentence. However, the common or frequently used terms often have low term frequency, which means that they are assigned a low score, which is sometimes a signal that a word is a key term. Inspired by feedback propagation in neural networks, which often propagate backward, this step is designed by considering the error and the responsible unit between term frequencies. In our work, which aims to recognize and extract keywords at the topic aspect level from each individual geoscience report, leads to the idea that KE would be improved if the reason for an error could be assessed. We noticed that low-frequency words usually contribute much more. Therefore, our goal is to build an error-feedback mechanism to enhance KE.
The next step in the GKEEP model is to detect and identify nodes with inaccurate scores and then assign them the expected score. Many graph-based methods obtain inappropriate scores when using only word frequencies as the measure of a term's importance. From the constructed graph, GKEEP detects nodes with inappropriate scores (i.e., that were assigned a higher or lower score than they deserve) and then assigns them more accurate scores. To address the lack of domain-specific synonyms resulting from the traditional use of word frequencies, we select the Word2Vec model Mikolov, Sutskever, et al., 2013;Mikolov, Yih, et al., 2013). In this subsection, we used two different measures in our experiment: TF-IDF and Word2Vec.

TF-IDF
TF-IDF represents the importance of the phrase or word to the input document in the collection, and the higher the value, the higher the correlation. Since our test corpus is composed of reports and abstract of research papers belonging to a specific field, TF-IDF measure is a sufficient and simple measure for our backpropagation algorithm. Given each node V i in graph G (i.e., each phrase or word in an input document), we compute its TF-IDF. The TF-IDF measure for phrase P in a document is calculated as follows: where frequency(P) denotes the number of times P occurs in D, wc(D) denotes the number of words in D, |C| denotes the number of documents in document collection C, and df(P) denotes the number of documents in C that include P.
In the KE task, since the document being processed is no longer a traditional text document, TF-IDF cannot provide accurate domain-specific representing results. Therefore, in our example, some words are more likely to appear in both the target document and the background document than in a document that consists of a list of keywords.

Word2Vec
The Word2Vec model is based on the corpus but extends the n-gram language model to help determine the semantic distance between two words without any monitoring information. Therefore, we chose the Word2Vec model to dynamically model the semantics behind low-frequency keywords to reduce the impact of the synonym problem.
Word2Vec (Hu et al., 2018;) model is a model based on shallow neural network, which is used for word embedding. The model uses a large number of text corpora for training and aims to create a unique vector for each word in a high-dimensional space. More recently, Word2Vec has applied two constructs to create word vector representations: CBOW and skip-gram. The continuous word package framework USES the surrounding context words to predict the words, while the hops framework predicts the surrounding words in a fixed-size window around the current word. This model works better for uncommon words. According to the characteristic of jump graph structure, the model of jump graph is trained and applied. A corpus of 14 earth science reports (about 1.4 million words) was used to train the model. For each word, calculate the value of its word vector and use it as a word vector.
The built-in Word2Vec calculation treats each text unit as one vertex of the graph and calculates the similarity distribution between text units based on the edges between the vertices. Finally, in the integration of TextRank and Word2Vec, the semantic similarity is used to calculate the edge values between words. Where, in order to replace formula 1 in TextRank, the semantic similarity calculated by Word2Vec distance is used to reset the weight value w ji .

Calculation of Expected Node Scores
After identifying errors in the nodes of graph G, the next step is to calculate the expected scores for each node. Because there are three possibilities for nodes in the graph-accurate scores and two types of inaccurate scores (lower scores than they deserve and higher scores than they deserve)-to obtain the expected scores in the graph G, one of the above-described approaches must be chosen (henceforth referred to as metric μ) and computed for each node. Graph G is split into three mutually disjoint collections: G low , G mid , and G hig .
If the standard deviation is less than μ, the λ will be defined with a lower bound of the standard deviation, and if the standard deviation is greater than μ, the u will be defined with an upper bound of the standard deviation.
The nodes in G low that have relatively low metric value would be expected to have lower assigned scores than other nodes. S mid denotes the set of node scores in G mid . T i denotes the expected score in G low and is defined as shown in Equation 4: Analogously, nodes from G hig are expected to have higher scores than other nodes. The expected score T i for every node V i in G hig is defined as shown in Equation 5: Because terms/phrases in G mid are in the middle range of metric values, their importance is unlikely to increase or decrease; their expected scores are then left empty (set to Ø). Table 2 shows the three subsets obtained from graph G. For brevity, only the subsets obtained using KE's TextRank algorithm and Word2Vec as error-detection methods are shown. It can be observed that some of the general terms from Table 2, the terms and conditions in the absence of Weifen, Jiechu and Linjin, obtain high TextRank scores, while its Word2Vec values indicate that they have lower ranking. By contrast, domain-specific terms such as pianmayan and shiyingyan-which are actual keywords-obtained lower TF-IDF and TextRank scores than they deserved. It is worth noting that this metric (in this case, Word2Vec) can be replaced by another metric, such as clustered values. More importantly, a supervised approach can be used to obtain the ranking target, assuming that the output is a probability score and a training data set is available.

Error Feedback
Once achieving the expected scores in the graph, the next step is to use the error-feedback strategy to update the edge weights iteratively. For convenience, A i denotes S(V i ), and T i denotes the expected score of V i .
For further clarity, we change index i and index j in the TextRank score formula (in Equation 1); we can use w ij to represent the edge weights in the sum molecule. Then, we can obtain For convenience, we use  ij w to denote the normalized weight of w j in Equation 1. In addition, the notation  ij w is set as follows: Many loss functions can be used in the error-feedback phase, including L 1 -loss, L 2 -loss, and log loss. Motivated by Figueroa et al. (2017), we select L 2 -loss and cross-entropy loss as the loss functions.
The L 2 -loss is defined as follows: where the factor of ½ is used to cancel the exponent after differentiating.
The cross-entropy loss is defined as In this stage, the goal is to minimize the error function. To reach this goal, gradient descent (Rumelhart et al., 1985) is used. During gradient descent, the modification value −  Δ ij w is iteratively computed and applied to update each normalized edge weight.
where η denotes the learning rate, which is used to control the rate of gradient descent. Usually, the learning rate ranges from 0 to 1.
The chain rule is used to compute the derivative as follows: Using the Figueroa et al. (2017) approach, when E 1 is selected as the loss function, we obtain Then, the next step is to calculate E 2 . Let Then, by combining Equations 1 and 11, we obtain When minimizing the cross-entropy loss ∂E 2 /∂A j , there are two possible cases. When V j has an expected score, we compute it as follows: QIU ET AL.

10.1029/2020EA001602
and let When V j does not have an expected score, we calculate We obtain the final modification value: In Equations 13 and 22, δ j denotes the difference between the current node score A j and the expected node score T j .
QIU ET AL.

Table 3 The Final List of Keywords, Which Are Extracted From the Sample Text With the TextRank and GKEEP Scores
Finally, the proposed algorithm converges when the difference between the current node scores A j and the expected node scores T j falls below a threshold ε (set to 0.01 in our experiments).

Candidate KE, Scoring, and Ranking
When the error-feedback method converges, the nodes will be given new and more accurate scores. The keywords were then ranked in descending order of their scores. Arrange keywords in descending order and select them. The number of keywords can be selected at will (e.g., 3, 5, 7), or as a percentage of the total number of words in the text. Finally, there are many strategies that combine a keyword's score and obtain the final score for each candidate keyword. In GKEEP, the final score for a candidate keyword is a new, updated, and more accurate score. Then, the extracted candidate keywords will be sorted according to the final scores, and N best keywords will be returned as the ultimate keywords. Table 3 illustrates the keyword list after executing the entire GKEEP algorithm. For each keyword, the table shows the original TextRank score, the final GKEEP score, and the direction in which the original score was taken. As you can see from this example, the original ranking (TextRank Score) and error-detection method (Word-2Vec) play an irreplaceable role in the final ranking obtained by backpropagation of the graph application.

Computational Complexity
The complexity of the first step of GKEEP is O(log 2 V), where V is the number of words in the corpus. In its second step, the complexity is O(max(V 2 , WL)), where V denotes the number of vertices of the graph, W denotes the window size, and L denotes the number of words in the input reports. In the last step of GKEEP, in which the initial set of candidates is acquired from the input text, the complexity is (O(L)).
Considering the all above analysis, the overall computational complexity of GKEEP is (O max (V 2 , WL, L)), which in practice, translates to O(V 2 ) in real-world applications because the other terms are smaller than V 2 .

Data Sets and Evaluation Metrics
In the experiments, we collected two data sets for use. The first data sets collect abstracts and keywords of journal articles from China National Knowledge Infrastructure (http://cnki.net/). These journals are geoscience journal papers published from 2012 to 2017 described in this section, two data sets are constructed and used. This data set contains 5,500 abstracts and is denoted by GEOA.
The second data set used in this study is a collection of 17 regional geological reports from the National Geological Archives of China (http://www.ngac.org.cn/). These reports include valuable and representative thematic information on stratigraphic distribution, lithology, tectonics, tectonics, minerals, rocks, and geological attitudes of various types. These reports were selected because they are representative: (1) regional geological information recorded by different authors over different years; (2) from different regions; and (3) to present the complexity of a specific field, the text varies from simple to complex. We represent this data set as RGDS.
The data sets were divided into tenfolds, and we use tenfold cross-validation for the proposed model. Six workers with rich knowledge of both geosciences and KE read each sentence and then annotated all the keywords. A voting mechanism was applied to solve any disagreements. Each geological report includes more than 10 chapters and more than 100,000 Chinese characters. Finally, after applying these criteria, we obtained a corpus of 5,000 documents. Note. Key. (%) = the average percentage of manually annotated keywords within the text. #Word = the average number of words per text. Key./ txt = the average number of manually annotated keywords per text. Tot. key. = the total number of manually annotated keywords per collection. Text (%) = the average number of manually annotated keywords divided by the total word count per text. Count = the total number of texts per collection.

Table 4 Some Statistical Analysis Results for the Collections Used in Our Work
We conducted statistical analyses in our experiments using these data collections. Table 4 shows these statistics for each collection.
The precision, recall, and F1-score are applied as performance indicators, which are calculated after lemmatizing both the extracted and the real keywords. Let EK be the set of keywords extracted from document D, and KEY be the set of manually assigned keywords; the precision, recall, and F1-score are calculated as follows:

Baselines
To better illustrate the performance of the model, we compared it with the following generic baseline KE algorithms. (Salton et al., 1974) is calculated based on the term and inverse document frequencies in the given documents.

TF-IDF
NB (Naïve Bayes algorithm) (Han & Kamber, 2006) is based on the class conditional independence assumption, which reduces the required computational cost.
SVM (Support Vector Machine) (Joachims, 1998) uses the nonlinear matching method to transform the original data set into the high-dimensional data set. In this dimension, a hyperplane that can properly partition the data is checked. Hyperplanes are decision boundaries that divide data into classes.
RF (random forest algorithm) (Breiman, 2001) is an ensemble of classification and regression trees induced from bootstrap samples of the training data.
SingleRank (Wan & Xiao, 2008) is a novel method to collaboratively perform single-document keyphrase extraction by making use of mutual influences of multiple documents within a cluster context.
TopicRank (Bougouin et al., 2013) is a graph-based keyphrase extraction method that relies on a topical representation of the document. Candidate keyphrases are clustered into topics and used as vertices in a complete graph. A graph-based ranking model is applied to assign a significance score to each topic. Keyphrases are then generated by selecting a candidate from each of the top-ranked topics.
TopicalPageRank (Sterckx et al., 2015) incorporates topical information from topic models and then computes a single PageRank for each text regardless of the number of topics in the model.
PositionRank (Florescu & Caragea, 2017) is an unsupervised graphbased model that incorporates information from all positions of a word's occurrences into a biased PageRank to score keywords that are later used to score and rank keyphrases in research papers.
MultipartiteRank (Boudin, 2018) is an unsupervised keyphrase extraction model that encodes topical information within a multipartite graph structure.
RankUp (Figueroa et al., 2017) is an unsupervised method that applies TF-IDF to identify inaccurate scores of enhanced graph-based keyphrase extraction approaches. It includes an error-feedback mechanism that iteratively applies a chain effect to all nodes. QIU ET AL. Note. The bold represents the highest performance. Note. The bold represents the highest performance.

Table 6
The Impact of λ and u on GKEEP's Performance on the RGDS Data Set EnhacedWord2Vec (Qiu et al., 2019) is an ontology and enhanced word embedding-based methodology for automatic KE.

Parameter Setting
To obtain desired keywords, GKEEP offers four effective parameters in the process of keyword extraction. Appropriate setting of these important parameters could lead to meaningful and high-quality keywords. Four important parameters of GKEEP are as follows: (1) the weighting factors λ and u, (2) vector dimensionality d, (3) context window k, and (4) the number of keywords extracted n. To find the best parameter settings, we first created a set of promising configurations based on our knowledge about the model. The best configuration was then selected and improved by varying parameter values one at-a-time.
In the rest of this section, we investigate the impact of GKEEP's parameters on the performance of this method. To evaluate a certain parameter, the other parameters are set to their best values in the conducted experiments.

The Weighting Factors
As mentioned in Section 3.2.1, the Stdv lower bound λ and Stdv upper bound u-a word importance indicator-is used to establish a balance between the importance of different words. We conducted a set of experiments to study how different weights could affect the KE performance and find an optimal weight that improves the extraction performance.
A total of 50 controlled experiments were conducted, in which the weight λ was tested with linearly separated values between 0.2 and 1.0 with a step of 0.2, and the weight u ranged from 0.2 to 1.0 with a step size of 0.2. Tables 5 and 6 represent the average accuracy of the experimental results of the GEO and RGDS data sets.  Figure 6. Impacts of the number of keywords extracted in both the GEOA and the RGDS data sets. Table 5, increasing the weight increases the average performance. For the GEOA data set, a rational range (λ = 0.4 and u = 0.6) provides the best results (the algorithm achieved an average precision of 30.75%). The proposed algorithm obtains its best performance when the optimal weight is set between λ = 0.4 and u = 0.8 on the RGDS data set (the algorithm achieved the average precision of 34.82%). Our results show that raising and lowering keyword ratings after application of error feedback can elevate the most prominent keywords to the highest position, while demoting those words and phrases that are considered less important.

The Vector Dimensionality
We also conducted some experiments to verify the impacts of several key hyperparameters in GKEEP, that is, vector dimensionality d, context window k, and the number of keywords extracted n. We selected P, R, and F-measure to demonstrate the impacts of various hyperparameter values on the performance of GKEEP. GKEEP_1 denotes the proposed approach with cross-entropy loss, and GKEEP_2 denotes the proposed method with L 2 -loss.
The impact of vector dimensionality on the performance of GKEEP (including GKEEP_1 and GKEEP_2) is demonstrated in Figure 4. As seen in Figure 6, increasing vector dimensionality provides performance improvements for keyword extractions from geosciences reports. For example, the proposed approaches achieved the least satisfactory performance with a 50-dimensional vector on the GEOA data set and increasing this value to 300 improved the F1-measure by 5.2% and 4.2%. GKEEP achieves its best performance when vector dimensionality is 300 on the GEOA data set and when vector dimensionality is 200 on the RGDS data set. Thus, we set d = 300 and 400 for the GEOA and RGDS data sets, respectively. Figure 5 presents the impacts of context window k on both the GEOA and the RGDS data sets. We can observe that performance improves with k, as the high-context window can capture further relationships and achieve better performance. For example, when the context window is set to 2 on the GEOA data set, GKEEP achieves the least satisfactory performance, and increasing k to 7 improved the F1-measure by 31.78% and 30.19%. However, a larger k requires longer training times as well. Thus, we set k = 7 to achieve a trade-off between KE performance and training times.

The Number of Extracted Keywords
The experimental results from Figure 6 suggest that the number of keywords extracted is important for determining the overall performance on both the GEOA and the RGDS data sets. As seen in Figure 6, when the number of extracted keywords varies from 3 to 5, the F1-measure values increase because increasing the number of keywords leads to the extraction of more correct keywords that would help increase the F1-measure value. However, the growth level of the number of correct keywords is less than the growth level of the number of extracted keywords. When the number of extracted keywords is 6 and 5 on the GEOA and RGDS data sets, respectively, the curve reaches the peak. Thus, we set n = 6 and 5 for GEOA and RGDS data sets, respectively.

Comparison With Other Methods
We conducted a total of 14 experiments using the GEOA and RGDS data sets described in Section 4.1.1 to evaluate GKEEP's performance compared to other KE approaches. Taking the average number of manually annotated keywords in the two data sets into account, the top keyword count is set to N = 5 for each data set.
Comparing the results of GKEEP with 12 baseline technical methods, the following results are obtained: Note. The best results are shown in bold.

Table 7 Comparison of Different Supervised Methods With the Precision (P), Recall (R), and F-Measure (F) of the Extracted Keywords on the GEOA and RGDS Data Sets
ure of the extracted keywords on the two data sets. Figure 7 presents the precision-recall diagrams of GKEEP and the six baseline graph-based approaches (TopicRank, SingleRank, TopicalPageRank, PositionRank, MultipartiteRank, and TextRank) on the two data sets.
Compared to the three supervised algorithms (NB, SVM, and RF) (Table 7), these KE methods obtained the highest predictive performance in all compared cases. For all of the compared cases, the SVM method outperforms the naïve Bayes algorithm and the random forest algorithm. Also, comparing the supervised-based and unsupervised-based approaches in Table 7 reveals that the keywords are indeed more difficult to detect.
As can be seen from Note. The best results are shown in bold. Based on these experimental results, utilizing error feedback and Word2Vec can augment the process of promoting and demoting keyword scores and enhances the most salient terms, while relegating those words and phrases that are deemed less important to lower positions. The simple reason is that error feedback results in a chain effect that affects the terms connected to the promoted and demoted keywords, thus modifying the graph structure. Additionally, GKEEP could easily be extended to other graph-based KE algorithms as well as node scoring approaches if an expected score can be obtained.

Discussion
As shown in Tables 9 and 10, we detailed the extracted results by our proposed GKEEP model and the baselines. And, we compare our model with TextRank because our model is based on this algorithm. In Tables 9 and 10, the correctly extracted keywords are marked with purple and the incorrectly extracted keywords are marked with blue highlights. From the table below, we can draw the following conclusions.
(1) All approaches will extract some incorrect keywords (marked with blue highlights), which further shows that QIU ET AL. The Southern Great Xing'an Range (SGXR), located in the eastern segment of the Central Asian Orogenic Belt, is an important Pb-Zn-Ag-Cu-Mo metallogenic belt in northern China. It was influenced by the subduction and closure of the Paleo-Asian Ocean to the north in the Paleozoic, the subduction of the Mongol-Okhotsk Ocean to the northwest, and the subduction of the Paleo-Pacific Ocean to the east in the Mesozoic. However, in some issues, such as magma source, tectonic setting, and metallogenetic materials, the question still remains controversial. In this paper, typical Baiyinnuoer Zn-Pb-Ag polymetallic deposits, Bairendaba Zn-Pb-Ag polymetallic deposit, Weilasituo Zn-Cu polymetallic deposits, and Bianjiadyuan Pb-Zn-Ag polymetallic deposits were studied. We also carried out detailed petrological, whole-rock geochemical, Laser Ablation-Inductively Coupled Plasma-Mass Spectrometry (LA-ICP-MS) zircon U-Pb geochronological, and Sr-Nd-Pb isotopic analysis on the Late Paleozoic-Mesozoic magmatic rocks in the SGXR, which were also associated with the lead-zinc mineralization, with the aim of understanding the genetic relationship between the magmatism and mineralization and clarifying the tectonic evolution of the SGXR.  Abstract This paper is focusing on the association between geological spatial entity objects and their external description texts. That is, the association of text data can realize the new paradigm of geological data application of "graphics-text mutual query". In addition, the extraction of named entities in geological texts has been carried out, but it lacks the research on the extracted entities. Based on the information service application demand of geological big data, this paper applied the representation learning model and deeply studied the semantic similarity calculation of text data and spatial data in the geological field. Consequently, a prototype system with certain practical functions was constructed to provide a new method for geological data integration and a new paradigm for geological data extraction and application.  (2) The baseline approach TextRank extracted few and incorrect keywords, while the proposed GKEEP has the ability to extract richer, more comprehensive keywords. We attribute this to the integrated backpropagation mechanism. As shown in Table 9, some keywords (such as "Magmatism" and "Mesozoic") are interrelated with each other. Thus, the GKEEP model can learn the relevance of these keywords and feedback the proposed method to extract useful keywords that can provide more information of the given document. This further demonstrates the validity of the our GKEEP model.
Traditional graph-based approaches use distributional semantics learned from word co-occurrences in the corpus context, and they rely on a large and high-quality corpus. Our proposed semantic approach, GKEEP, on the other hand, uses background knowledge maintained in the geoscience domain with word embeddings. For instance, Breccia, Basal Conglomerate, and Syngenetic Conglomerate are all part of Conglomerate; therefore, these words can be regarded as the same semantic unit. We found that our proposed approach had an advantage over previous graph-based approaches. This finding is likely because in traditional graphbased extraction, the common words score higher because they have more edges attached to them, which is a disadvantage of this approach. Similarly, rare or uncommon words scored lower because they had fewer connections. Our approach combines this concept with graphics-based keyword extraction because backpropagation can be applied to other types of networks. If any node score is found to be incorrect, all edge weights of the graph will be modified with the score and error to make the final result more accurate.
The traditional graph-based methods assign higher scores to these terms, which are more common or more frequently used since more edges point to them. Furthermore, our results show that after the application of error feedback, the increase and decrease of keyword scores can elevate the most prominent keywords to the highest position, while demoting those words and phrases that are considered less important. Error feedback can also have a knock-on effect, affecting terms associated with promotion and degradation keywords, thus modifying the structure of the entire diagram. Finally, GKEEP can easily adapt graphic-based keyword extraction algorithms and other graphics-based scoring methods if the desired score of the node is known. This research has several limitations. The enhanced graph-based method has limitations because of the vagueness of the semantics and the corpus size. Existing semantic units might not have the same meanings, which directly affects the feedback and modification process; therefore, the granularity of the semantic units is difficult to determine. As shown in Figure 7, different numbers of keywords result indifferent performances on the GEOA and RGDS data sets. Hence, words with rankings lower than the threshold will be ignored in the analysis. And, our further work can attempt to design a self-adapting method for selecting the number of keywords.

Conclusions and Future Works
The keywords of a text consisting of a concise set of high-quality and meaningful words from the main content of that text. KE from geoscience texts is a critical and challenging task. Previous studies, especially graph-based approaches, suffer a potential drawback from their frequency-based analysis. They calculate ranking scores by frequency, which may lead to higher ranking scores for more common but less relevant terms. However, some less common but probably more relevant terms may obtain lower ranking scores. In this paper, we propose a KE model called GKEEP, which includes the following three stages: preprocessing, enhanced graph-based representation, and keyword ranking and output. The preprocessing phase involves unwanted noise removal, stop-word removal, and tokenization. The enhanced graph-based representation involves error detection and error feedback. At the error-detection step, the Word2Vec model is used to detect errors in the node weights in the graph. Then, this error feedback is used to update the edge weights iteratively. Finally, keywords are extracted using ranking scores. The proposed GKEEP model is compared with seven existing graph-based models. The experimental results show that the performance of the GKEEP model significantly improves the performance of the existing models. As far as we know, no other work has attempted to employ word embedding to aid the TextRank algorithm.
In the future, new weighting methods may be proposed that can be used in graph-based methods to find high-quality, meaningful keywords, and more semantically based approaches to KE can be explored using the proposed model to obtain better-quality results.