Extracting the evolutionary backbone of scientific domains: The semantic main path network analysis approach based on citation context analysis

Main path analysis is a popular method for extracting the scientific backbone from the citation network of a research domain. Existing approaches ignored the semantic relationships between the citing and cited publications, resulting in several adverse issues, in terms of coherence of main paths and coverage of significant studies. This paper advocated the semantic main path network analysis approach to alleviate these issues based on citation function analysis. A wide variety of SciBERT‐based deep learning models were designed for identifying citation functions. Semantic citation networks were built by either including important citations, for example, extension, motivation, usage and similarity, or excluding incidental citations like background and future work. Semantic main path network was built by merging the top‐K main paths extracted from various time slices of semantic citation network. In addition, a three‐way framework was proposed for the quantitative evaluation of main path analysis results. Both qualitative and quantitative analysis on three research areas of computational linguistics demonstrated that, compared to semantics‐agnostic counterparts, different types of semantic main path networks provide complementary views of scientific knowledge flows. Combining them together, we obtained a more precise and comprehensive picture of domain evolution and uncover more coherent development pathways between scientific ideas.


| INTRODUCTION
There were many methods to extract the evolutionary pathways between scientific ideas based on citation network analysis, such as algorithmic historiography (Garfield et al., 2003) and scientific historiograms (Lucio-Arias & Leydesdorff, 2008). Recently, main path analysis (MPA), originally proposed in Hummon and Doreian (1989), has become popular for extracting the major knowledge diffusion paths among the main ideas advancing an analyzed scientific domain, since Batagelj (2003) proposed the efficient search path counting algorithms to weight citation edges and Verspagen (2007) laid out the algorithmic foundations for main path extraction.
Most MPA methods were citation semantics-agnostic, that is, ignoring the semantic relationships between publications. A direct consequence is semantically incoherent main path. Figure 1 illustrates a potential cause of this problem-inappropriate search path counts (SPC). In the top-right schematic image, the citation edges (A, B) and (B, C) are both background citations ("Neutral") while the citation edge (A, C) is an extension citation ("Extends"). Ignoring citation function, we have SPC(A, B) ≥ SPC(A, C) because the former is the sum of the number of paths through A ! B ! C, which is equal to SPC(A, C), and the number of paths through A ! B ! X(≠C). So traditional MPA approaches will select (A, B), but it is more reasonable to include the extension citation (A, C). Some studies adjusted citation weight by, for example, considering citation preferences according to discipline and publication time (Yu & Pan, 2021) or scaling search path count using citing publication's prestige (Yu & Sheng, 2021). However, the problem was not solved. For example, if B is highly cited, then Yu and Pan's approach will still choose (A, B) in main path exploration. Some weighing schemes used measures of similarity between the abstracts of citing and cited publications (Chen et al., 2022;Huang et al., 2022;Liu et al., 2014). However, such (indirectly inferred) similarity measures shall be less precise than authors' own (directly stated) rationales to cite, aka citation function (Iqbal et al., 2021;Kunnath et al., 2022;Lyu et al., 2021).
Theoretically, traditional MPA approaches also tend to prefer long local paths. 1 Figure 1 illustrates this case.
The left-most image shows a vanilla (semantics-agnostic) main path network (MPN). The longest local path from A00-2018 to D07-1096 is very stretched: distance (A00-2018, D07-1096) = 16. It is questionable whether knowledge indeed flows along such long paths with many unimportant citations such as "Neutral." The middle image shows a snapshot of the semantic main path network (semantic MPN) extracted by considering extension ("Ext") and motivation ("Mot") citations. The path becomes more compact: distance (A00-2018, D07-1096) is decreased to 5. For another example, by further considering usage ("Use") and similarity ("Sim") citations, the longest distance from W96-0213 to W05-0516 is reduced from 17 to 5.
To the best of our knowledge, this is the first paper which marries citation function classification to MPA. We proposed a systematic approach to semantic main path network analysis (Section 4) based on citation function classification (Section 3), which solves both issues raised above. Multiple semantic citation networks were built using different citation functions, for which multiple semantic main path networks were extracted, assuming that different semantic networks capture different types of knowledge flows between different knowledge entities, such as ideational basis, methodological extension, tool usage, and similarity in problem or methodology, and so on. We conjecture that different semantic main path networks will collectively provide a more comprehensive representation of an analyzed domain. Note that, there were also some recent studies relying on citation importance classification (Ghosal et al., 2022;Hassan et al., 2018). Essentially, these approaches weighted citation edges by 1 (important) or 0 (incidental), screened out unimportant citations, did not further processing for knowledge flow analysis. The current paper is methodologically different. Citation function classification provides us with more flexible ways to perform MPA. The superiority of the proposed approach was qualitatively justified using two case studies (Section 5). In Section 6, this paper proposed a three-way quantitative evaluation framework. To the best of our knowledge, this is the first study about quantitative evaluation of MPa results. Experiments proved that extracting and merging multiple semantic main path networks achieved better (topical) coverage, (topical) coherence and (ranking) pertinence (Section 6).

| Topological approaches of main path analysis
According to Verspagen (2007), MPA has two steps: citation weighting and main path extraction. Refer to Liu F I G U R E 1 Motivations for semantic main path network analysis. et al. (2019, 2020) for the discussions of best practices of each step. Citation weighting is traditionally based on each edge's traversal count in the search paths between a set of origin nodes and target nodes in a (usually reversed) citation network. We call them topological approaches. The ground-breaking work of Hummon and Doreian (1989) defined three measures: Node Pair Project Count (NPPC), Search Path Link Count (SPLC), and Search Path Node Pair (SPNP). SLPC is predominantly used today. Batagelj (2003) proposed an efficient unified algorithm based on "standardizing" citation networks (summarized in Table 1), and proposed the fourth measure Search Path Count (SPC). For each citation edge (u, v) in a standardized citation network, the citation weight is equal to the number of paths from pseudo-source to u multiplied by the number of paths from v to pseudo-sink. As citation networks are mostly acyclic, the calculation is done iteratively based on topological sort. Kuan (2020) empirically discussed the choices of these weighting variants. Several adjustments exist. Liu and Kuan (2016) proposed to decay search path by length with the belief that knowledge diffusion has higher information loss along long paths, while Yu and Sheng (2021) used citing papers' citation influence for adjustment.
Typically, main path extraction starts from certain chosen startpoints and greedily searches the highest weighted citation edges to follow. Verspagen (2007) enumerated paths from the source(s) with the maximal out-going edge weight as startpoint(s) so the main paths were called forward local main paths (Liu & Lu, 2012). Batagelj (2003) also tried the longest path as the global main path (Batagelj, 2003). Liu and Lu (2012) defined two new types of local main paths. Backward local main path starts from sinks and represents the significant knowledge flow from past to the most recent studies. They also found that these methods often miss the most significant citation edges, called key-routes, they proposed the fourth alternative called key-route main path which searches forward and backward simultaneously from key-routes. To increase the comprehensiveness of the extracted main paths, Liu and Lu (2012) heuristically selected the top-K startpoints or key-routes and merged the main paths extracted from them. Recently, Chen et al. (2022) proposed a more efficient dynamic programming algorithm for exhaustive main path extraction.
2.2 | Semantic approaches in main path analysis Liu et al. (2014) pioneered to use (expert-assigned) citation relevancy to adjust traversal count-based citation weighting. Of course, it be replaced by any semantic relatedness measure. For instance, Huang et al. (2022) claimed that using the weighted sum of the textual and structural similarities between cited and citing publications lead to better convergence, that is, different slices of main path correspond well to different phases of domain development. Topic modeling is another popular semantic approach. Kim et al. (2022) used Latent Dirichlet Allocation (LDA) to analyze topic diffusion along main paths.  used the Citation Influence Model, an extended LDA model which also models the generation process of each citing publication's citation mixture (Dietz et al., 2007), to measure citation weights by topic similarity. Chen et al. (2022) calculated the Cosine similarity between citing and cited articles' topic distribution obtained by Latent Semantic Indexing (Deerwester et al., 1990). Notably, the citation relevancy of (u, v) is the sum of the pair-wise similarities between v and all other nodes u 0 on the current path toward v. While this treatment theoretically ensured the topical coherence of main path, it looks more straightforward to extract main paths from topic subnetworks and merge them. Community detection could be seen as an alternative way of finding topic subnetworks (Kim & Shin, 2018;Yu & Pan, 2021). To the best of our knowledge, citation function classification (Kunnath et al., 2022;Lyu et al., 2021) has never been applied to main path analysis before.

| Dataset and annotation schemes
We created a large citation function dataset by merging and reannotating six existing datasets in the computational linguistics domain: Teufel2010 (Teufel, 2010;Teufel et al., 2006a), Dong2011 (Dong & Schäfer, 2011), Jha2016 T A B L E 1 Search path counting methods for main path analysis.

Method
Origins Targets Citation network standardization Add a pseudo-source s * and a pseudo-sink t* Connect s * (resp. t * ) to all nodes (resp. sinks) SPNP All All Connect s * and t * to all nodes SPC Sources (zero-indegree) Sinks Connect s * (resp. t * ) to all origins (resp. sinks) Jha et al., 2017), Alvarez2017 (Hern andez-Alvarez et al., 2017), Jurgens2018 (Jurgens et al., 2018), and Su2019 (Su et al., 2019). The source papers were crawled from ACL anthology. 2 Different annotation guidelines were adopted so all citation contexts were-reannotated according to Teufel et al.'s 12-class annotation scheme (Teufel et al., 2006b) plus a "Future" class about future work. Reannotation is detailed in Supplementary Section B.1. 3 Some minority classes were still small, so we merged "PModi" with "PBas" into "Basis," and reannotated "CoCo-" into "CoCoGM" or "CoCoRes." This resulted in our own 11-class annotation scheme, which was also mapped to 7-class and 6-class schemes by category merging. Table 2 shows the statistics of our dataset Jiang2022.

| Citation function classification models
For the purpose of recognizing citation functions more correctly, a series of deep learning models were developed. SciBERT (Beltagy et al., 2019) was used to encode citation context, currently fixed to 2 and 3 sentences to each side of the citation sentence (citance). Three types of features were generated from the SciBERT-encoded context: (a) the citation representation h, from the citation segment (represented by a pseudo-word "CITSEG"), (b) the citance representation 4 s, pooled by citance encoder from the citation sentence, and (c) the context representation c, pooled by context encoder from the whole context. The final feature vector f was the concatenation of the three: f = [h; s; c]. Citation representation is mandatory because different citations in the same citance should have different feature representations, but citance and context representations were optional. We tested two types of citation contexts. In a sequential context, no "[SEP]" (sequence separator) was inserted to separate context sentences. In this case, citance and context representations were directly pooled from citance tokens and context tokens respectively. Two options of citance encoder were tested: max-pooling and selfattention (Munkhdalai et al., 2016). In a hierarchical context, "[SEP]" symbols were inserted after each context sentence. Sentence representations were pooled using sentence pooler, for which "[SEP]" was used as the third option in addition to max-pooling and self-attention, and context representation was pooled indirectly from the representations of all context sentences. There were in total 34 model variants. 5 Due to the large GPU time required for training, we cherry-picked a subset of 11 relatively promising variants, shown in Table 3, based on initial experiments of all model variants with the 11-class scheme. Section 4.1 will discuss how to pick the appropriate models to perform semantic MPA based on per-class performance analysis of different models.

| Model selection: Precision or recall
Per-class performance analysis showed that no single best model could beat others on all citation functions or on all annotation schemes (Tables S1-S3). Therefore, we needed to choose the most appropriate model as a binary classifier for each specific citation function. The most pertinent citation function for MPA should be extension ("Basis"/"Extends") of cited work, and motivation ("Motivation") by previous studies. Figures 2 and 3 show the performances of these two classes' top models. The darker the color, the higher the performance. Although the best extension model was model 4 (seed = 5,171, "seed =" omitted hereafter) with the 6-class scheme, its recall was less competitive. Considering the small size of the extension class, for example, only 4.33% in our dataset, we decided to slightly weigh recall over precision (recall-oriented) and F1. The final choice had a good F1 and the highest recall, that is, model 11 (47,353, in solid red rectangle) trained with the 6-class scheme. Taking a similar recall-oriented approach, we chose model 7 (32,491) trained with the 6-class scheme as the "best" motivation model.
We hoped that semantic citation networks could capture as many important citations as possible such as usage according to Valenzuela et al. (2015) and similarity according to Lu et al. (2014). For usage citations, we also took a recall-oriented approach. According to Figure 4, we opted for model 7 (13,249) trained with the 11-class scheme which achieved the highest F1, because the recall of the chosen model was already high enough and its precision was much higher than other candidates. To further enrich the semantic citation network, we decided to add similarity citations because Teufel's annotation guidelines say similarity is between problems and solutions rather than results (Teufel, 2010). According to Figure 5, the selected model was model 11 (25,603) trained with the 11-class scheme.
The other way is to delete unimportant citations, for example, neutral citations ("Neutral"/"Background") or future work citations ("Future") in our case. Due to the dominant size of neutral citations and high performance on this class ( Figure 6), we decided to trade recall for precision (precision-oriented) for neutral ("Neutral"/"Background"), so model 2 (5,171) with the 7-class scheme was  selected. Because both precision and recall were high for future work citations (Figure 7), it was OK to adhere to the precision-oriented approach and select model 8 (32,941) with the 11-class scheme because it achieved high enough precision and the best F1.

| Citation network building
Starting from an empty citation network, a citation edge was added between a pair of publications if there existed at least one in-text citation about extension or motivation (add_Ext_Mot) using the "best" extension or motivation models selected in the recall-oriented approach in Section 3.1. Taking the same recall-oriented approach, more citation edges were added if there existed at least one usage citation (plus_add_Use), and the semantic citation network was further expanded with similarity citations (plus_add_Sim). On the other hand, we also built the fourth semantic citation network by deleting unimportant in-text citations from the original citation network. For each pair of publications, if all in-text citations between them were neutral or future work citations, the citation edge was removed from the citation network (del_Bkg_Fut).
F I G U R E 3 Performances of selected models for motivation citations.
F I G U R E 4 Performances of selected models for usage citations.
F I G U R E 5 Performances of selected models for similarity citations.
F I G U R E 6 Performances of selected models for neutral/background citations.

| Main path network extraction
The semantic citation networks we analyzed have many small strongly connected components (SCC), so we applied the Simple Search Path Count approach (Jiang et al., 2020), an extension of SPC to deal with cyclic citation networks, for MPN extraction. Their JMPA package 6 (Java package for MPA) was used for implementation. Following Jiang et al. (2020), we segmented the network under analysis to several time slices, extracted top-K (K = 10) key-route main paths  from each slice, and merged them into an MPN. More details are given in Supplementary Section B.2.

| QUALITATIVE ANALYSIS
For experimental analysis, citation data came from the 2015 version of ACL anthology network (AAN; Radev et al., 2013) about computational linguistics/natural language. Three areas were selected: natural language parsing 7 (AANPar), automatic document summarization (AANSum), and machine translation (AANMT). Due to space limit, this section showcases on AANPar and AANSum to demonstrate the superiority of semantic MPA.  A00-2018) or in another name loglinear model (P04-1014), conditional random fields (N03-1028), and max-margin parsing (W04-3201, P05-1012). Note that, C00-1011 and P00-1009 were two papers on data-oriented parsing (DOP) promoted by Rens Bod, which however ceased in the wave of statistical parsing dominated by other proposals presented above. Early studies about dependency analysis blossomed into the huge Branch 5 and became the dominant trend since around 2005, further expediated by two important shared tasks W06-2920 and F I G U R E 7 Performances of best models for future work citations. D07-1096, which then diverted into Branch 6 about dependency parsing of morphologically rich languages and Branch 7 about cross-lingual dependency parsing. An issue was that many main path papers were connected by incidental citations. For instance, the citation from A00-2018 said that C00-1011 "stays behind the scores of" the former, a weak citation about performance comparison. For another instance, H91-1037 received only 10 citations in our dataset. SPC (H91-1037, J93-2004) was high only because of highimpact citing citing paper J93-2004 (1,006 citations), although the citation was incidental.

| Semantic main path network: Add extension and motivation citations
The above observations motived us to exploit the semantic relationships between papers in MPA. Figures 9-12 show the semantic MPNs extracted from the four semantic citation networks induced from AANPar, namely AANPar_add_Ext_Mot, AANPar_plus_add_Use, AAN-Par_plus_add_Sim, and AANPar_del_Bkg_Fut. Interesting chemical reactions occurred when MPA met citation function classification. Each semantic MPN revealed some novel branches or new papers. They collectively drew a more comprehensive picture of domain development. Supplementary Section D presents selected citation context excerpts to help readers understand the citation functions marked on certain edges.
On AANPar_add_Ext_Mot (Figure 9 and Tables 5 and S6 for a complete list of main path papers), the early development of parsing technology was tested. Branch 2 is a new branch about old parsers such as shift-reduce parsing, left-corner parsing, tabular parsing, and left-toright (LR) parsing and so on. Similarly, we saw another (isolated) early development of probabilistic approaches (Branch 3; details in Table S6). In addition to A00-2018 as the source of the statistical parsing mainstream, a third source started from E85-1024 ("A probabilistic parser") to J94-2001 ("Tagging English Text with a Probabilistic Model") and W96-0213, then through P02-1034 into the new Branch 4 about multiple parse ranking and re-ranking. Note that Branch 5 started went into a "dead" end about "Chinese TreeBank" (W00-1201).
From the right part of Figure 9, we saw a branch of DOP papers published by Rens Bod until P01-1010. Similar to the evolution pathway in Figure 8, it was gradually merged into the dominant dependency parsing branch. D08-1059 ("A Tale of Two Parsers: Investigating and Combining Graph-based and Transition-based Dependency Parsing") was motivated (denoted by "Mot" on the edge) by two papers P07-1050 ("K-best Spanning Tree Parsing") and D07-1013 ("Characterizing the Errors of Data-Driven Dependency Parsing Models").
Note that, there was a potentially problematic Branch 8 about machine translation (MT) using dependency parsing. Concerning (P05-1012, H05-1066), the citation context excerpt below reveals that although "improving upon" may indicate an extension, the whole context may be recognized as "Similar" or "CoCoGM." This shows that multilabel classification might be a promising future direction to explore (Lauscher et al., 2022).
"We mentioned above that our approach appears to be similar to that of reranking for statistical parsing (Collins, 2000; Charniak F I G U R E 8 Main path network extracted from AANPar. and Johnson, 2005). While it is true that we are improving upon the output of the automatic parser, we are not considering multiple alternate parses." Vague cases exist, such as (W00-1201, C02-1126), a self-citation by D. M. Bikel and D. Chiang. From the citation context excerpt below, expressions like "starting from" and "we have modified" might have been selected as strong signals for extension class ("Ext").
"The third experiment was on the Chinese Treebank, starting with the same head rules used in (Bikel and Chiang, 2000). These rules were originally …, and although we have modified them for parsing, …" 5.1.3 | Semantic main path network: Further add usage and similarity citations By further adding usage citations, that is, on AANPar_plu-s_add_Use, we saw drastically richer diversity in the development branches (Figure 10, Tables 6 and S7). Again, statistical parsing techniques evolved from multiple intelligent sources (Branches 1-3). A clear notion of "corpus-based" parsing emerged (Branch 1). Branch 2 was motivated by H93-1047 ("Automatic Grammar Induction And Parsing Free Text: A Transformation-Based Approach," a duplicate of P93-1035) and developed into "shallow parsing" of words into "text chunks." 9 This time, the seminal paper J93-2004 about the Penn Treebank project emerged in Branch 3 and developed through W96-0213 to J04-4004. Most subsequent papers used Peen Treebank for development and evaluation. We also saw T A B L E 4 Representative main path papers extracted from AANPar.  (Brill, 1995) and Collins' parser (Collins, 1999) were used to obtain parse trees for the English side of the corpus." The DOP branch lead by Rens Bod "developed" through C00-1011 into Branch 4 and found the important shared task W05-0620 on semantic role labeling (SRL) of predicate arguments, and "vanished." This is understandable because SRL became a rather standalone area since then 10 and began to cite less and be less cited by parsing papers. In addition, the branch about cross-lingual dependency parsing embraced a more diverse set of papers.   which was heavily cited (387 times). The following citation context excerpt proved that similarity citation is indeed relevant to knowledge flow of scientific ideas.
"The maximum entropy models used here are similar in form to those in (Ratnaparkhi, 1996;Berger, Della Pietra, and Della Pietra, 1996;Lau, Rosenfeld, and Roukos, 1993)." The domain then evolved to the dominant dependency parsing branch (Branch 3), where we were excited to see two new shared tasks about joint syntactic and semantic dependency parsing (W08-2121, W09-1201), and then to Branch 4 of subsequent studies on semantic dependency parsing (W09-1208, D09-1004).

| Semantic main path network: Delete neutral and future work citations
Finally, on AANPar_del_Bkg_Fut ( Figure 12, Tables 8 and S9), we observed some interesting branches or papers. Since P08-1068, the domain diverted into a new branch about optimization techniques used in parsing algorithms, such as dynamic programming, integer linear programming and dual decomposition (Branch 2). Branch 3 was a similar cross-lingual dependency parsing branch, but it evolved into Branch 4 about parsing morphologically rich languages through a new shared task T A B L E 5 Representative main path papers extracted from AANPar_add_Ext_Mot.

ACLID Title
Branch 2  Figure 8. We postulate the result is meaningful since dependency parsing was directed by important shared tasks. Note that, deleting neutral and future work citations might result in weaker semantic coherence than by adding more significant citations like extension and similarity (quantified in Section 6.3). For example, N07-1069 only made a result comparison with W06-2928, therefore it is less confident to say scientific ideas flew through this path.
"Here we can compare directly with the best systems for this dataset in CoNLL-X.
The best system (Corston-Oliver & Aue, 2006), …." In summary, we conjecture that multiple semantic MPNs extracted from different types of semantic citation networks reveal complimentary views and novel knowledge flows, thus should be merged into a more comprehensive representation of scientific domain's topic evolution.

| Case Study 2: Automatic document summarization
Due to space limit, an informative summary is presented here (Figure 13-17). See Tables S10-S14 in Supplementary Section E for the details of main path papers and Supplementary Section F for citation context excerpts. The MPN extracted from AANSum ( Figure 13) covered a few early summarization studies centering around the usage of semantic coherence devices (Branch 1), such as discourse structure, rhetorical relations, and lexical chains (W97-0703: Using Lexical Chains For Text Summarization), and so on. Then the main body of literature focused on multidocument summarization (Branch 2) pioneered by the seminal journal article J98-3005 ("Generating Natural Language Summaries From Multiple On-Line Sources"). The subsequent studies in this topic eventually gave birth to an important Special Issue on Summarization (J02-4001). Since the advent of PageRank in 1998, the graph-based ranking idea was introduced to the summarization domain for sentence ranking for extractive summarization (Branch 3). Seminal works included P04-3020 ("Graph-Based Ranking Algorithms For Sentence Extraction Applied To Text Summarization"), W04-3252 ("TextRank: Bringing Order Into Texts"), T A B L E 7 Representative main path papers extracted from AANPar_plus_add_Sim.  -2052, D14-196). Notably, comparison (sometimes weakness) function was the dominating citation function in Branch 4 in Figure 13. In addition, the only papers about summarization evaluation were N03-1020 about ROUGE ("Automatic Evaluation Of Summaries Using N-Gram Co-Occurrence Statistics"). These two drawbacks motivated us to explore novel branches of summarization using semantic MPNs. By adding extension and motivation citations (Figure 14), we could see a larger early branch about the usage of rhetorical structure and found a seminal application in scientific summarization (J02-4002), which was extended by subsequent studies in other areas, like W03-0505 ("Summarising Legal Texts: Sentential Tense And Argumentative Roles"), evidenced by the citation context excerpt below. "Our methodology builds and extends the Teufel and Moens (Teufel and Moens, 2002) approach to automatic summarization." In addition to the common topics like multidocument summarization (Branch 2) and graph-based ranking algorithms (Branch 5), we were also excited to see Branch 3 about automatic evaluation and related studies. Heavily cited ones included N03-1020 and W04-1013 about the ROUGE package. We also saw more studies about sentence reduction, compression and fusion for summarization. Both Branch 4-1 and 4-2 were pioneered by K. R. McKewon in A00-1043 ("Sentence Reduction For Automatic Text Summarization"), A00-2024 ("Cut and Paste Based Text Summarization"), and J05-3002 ("Sentence Fusion For Multidocument News Summarization").
By further adding usage citations (Figure 15), although we lost the graph-based ranking branch (despite that we got a new paper W04-3247 about LexPageRank), we could uncover more novel topics and branches. Branch 2 about automatic evaluation included more important papers such as N04-1019 about the Pyramid method ("Evaluating Content Selection In Summarization: The Pyramid Method"). A significant new branch was Branch 3 about scientific summarization at right bottom, starting from the seminal paper J02-4002 to citation function classification (W06-1613, N07-1040) and citationbased summarization (C08-1087, N09-1066, P10-1057, and C10-1101). By further adding similarity citations (Figure 16), we could see one obvious expansion of Branch 1 about evaluation, starting from factoid analysis (W04-3254) to summarization evaluation without human models, including D09-1032 ("Automatically Evaluating Content Selection in Summarization without Human Models") and C10-2022 ("Multilingual Summarization Evaluation without Human Models"), both written by famous researchers in this domain (A. Nenkova and H. Saggion respectively).
Finally, the MPN extracted from AANSum_del_Bkg_-Fut ( Figure 17) recovered the vanished or shrunk branches about multidocument summarization (Branch 1) and graph-based ranking (Branch 2), and at the same time introduced some new papers, such as C04-1129 for Branch 1 ("Syntactic Simplification For Improving Content Selection In Multi-Document Summarization"), P08-1048 for Branch 2 ("Summarizing Emails with Conversational Cohesion and Subjectivity," whose abstract says "Second, we use two graph-based summarization approaches, …, to extract sentences as summaries."), and W09-1802 ("A Scalable Global Model for Summarization," whose abstract says "We present an Integer Linear Program for … for automatic summarization.") and C10-2105 ("Opinion Summarization with Integer Linear Programming Formulation for Sentence Extraction and Ordering") for Branch 3 about optimization methods for summarization.
Again, by gradually adding more citation semantics, the semantic MPNs together proved to be more expressive than the semantics-agnostic counterpart.

| QUANTITATIVE ANALYSIS
Few studies touched quantitative MPA evaluation. Filippin (2021) claimed that it is questionable if a main path is representative of the real technological trajectory because, F I G U R E 1 7 Main path network extracted from AANSum_del_Bkg_Fut. based on domain experts' opinions, main path may be "limited to a much narrower neighborhood of the technology space than it really is" and may miss many crucial studies and big players of the analyzed field. Huang et al. (2022) claimed to have achieved better convergence, which was only qualitatively justified. The current situation called us to propose a three-way framework for quantitative MPA evaluation. The first drawback pointed out by Filippin implies that a good main path should have a good coverage of the scientific topics of an analyzed domain. It should also include as many critical studies as possible. We name this aspect the pertinence of main path. Furthermore, according to Huang et al., nearby main path nodes should exhibit a certain level of local clustering and show higher topical coherence. Our framework evaluated all these three aspects.

| Topic modeling
Coverage and coherence were both defined based on topic modeling, here LDA (Blei et al., 2003) trained using the Gensim package. 11 Each article u in the citation network, denoted as CN, was represented by its topic distribution u = [u 1 , …, u t ,…, u T ], where T is topic number, u t is the probability of article u belonging to topic t, and P T t¼1 u t ¼ 1. Two issues arose: the right value of T and the right number of training epochs P (to avoid overfitting LDA training). Supplementary Section G details how to decide these values. In summary, we trained several LDA models with a range of values of T for evaluation and reported the average. For AANPar, T values fell in {10, 11, …, 20, 22, 24, 26}. For AANSum, and AANMT, the maximum value of T was set to 20. The right value of P was set to 50, 40, and 50 for AANPar, AANSum, and AANMT respectively.

| Topical coverage
Let MN denote an extracted MPN. Topical coverage measures how well MN covers the topics of the analyzed domain. It is approximated by the closeness between the topic distribution of MN, denoted as dist tpk (MN), and the topic distribution of CN, denoted as dist tpk (CN), both of which are averaged over the enclosed publications. In evaluation, we used Hellinger distance to measure topical coverage, defined below: where the Hellinger distance between two vectors u and v is defined as The smaller the Hellinger distance is, the better topical coverage is in our sense. Table 9 shows the results. Each "Δ%" column shows the difference of the corresponding semantic MPN from the vanilla MPN in percentage format. Thus, a positive percentage means a decrease in topical coverage and a negative percentage means increase. The upward and downward arrows signify a further increase and decrease from the semantic MPN in the column to the left. On all three datasets, compared to the semantics agnostic counterpart (the "MPN" column), topical coverage decreased (signified by upward arrows) by adding extension and motivation citations (the "add_Ext_Mot" column), but adding usage relations lead to improved topical coverage (signified by downward arrows in the "plus_add_Use" column). This is meaningful because publications linked with extension and motivation citations are technically closer. On the contrary, usage can be about a variety of different things, from algorithm and method to data and definition, and so on, and thus results in main paths that are topically more diverse. Two composite semantic MPNs were extracted: "add_Combined" corresponds to the composite semantic MPN which merged three semantic MPNs corresponding to "add_Ext_Mot," "plus_add_Use" and "plus_add_Sim"; "del_Combined" corresponds to the composite semantic MPN which further merged the semantic MPN corresponding to "del_Bkg_Fut." The results proved that different types of semantic MPNs complemented each other and collectively worked better, that is, covering and approximating the topic distribution of the underlying domain much better. Meanwhile, we also confess that better coverage was partially because composite semantic MPNs were larger in size (also see Table 11).

| Topical coherence
A perfect definition of coherence does not exist. We tried to analyze coherence by adapting the coherence definition originally proposed to evaluate topic model quality (Newman et al., 2010, p. 102). Given a main path network MN, we defined topical coherence as the mean of distances between all pairs of main path nodes: where D(u, v) is the distance between the topic distributions of u and v. Again, Hellinger distance defined in Eq.
T A B L E 9 Topical coverage of main path networks.  Table 10 shows the results of topical coherence evaluation. From the "Evaluate on MPN' rows, again, we observed that adding usage citations (the 'plus_add_Use' column) lead to worse topical coherence compared to using extension and motivation citations (the 'add_Ext_-Mot' column)." This corroborates with the evaluation results of topic coverage, adding usage citations may introduce more diversified topics, which increases topical coverage at the expense of decreasing topical coherence. Contrastively, adding similarity citations (the "plus_add_-Sim" column) improved topical coherence. This may be because similarity in research goal or methodology often happens between topically closer studies. On all three datasets, better topical coherence was consistently obtained (i.e., with a negative Δ% value) except on "plus_add_Use," which demonstrated that semantic MPN may exhibit better semantic coherence than the semantics-agnostic counterpart. For comparison purposes, the lower half of the table shows the results evaluated on CN[MN], the citation subnetwork induced from MN with a few more unimportant citations. The results met our anticipation to see worse topical coherence. This conforms to our initial conjecture that semantically important citations may help improve semantic coherence.

| Ranking pertinence
Ranking pertinence measures whether an extracted MPN effectively and efficiently represents the significant studies of a research field. To approximate expert evaluation, we built three gold standard sets following Jiang et al.'s approach (Jiang et al., 2019). The three gold standard sets, named GS-Par, GS-Sum and GS-MT, each contains 99, 204, and 197 papers respectively. 12 Note that, some gold standards were not recoverable by the way we built citation networks (refer to Supplementary Section B about experimental setup), so evaluation was based on the total number of gold standards recoverable from the citation network. For GS-Par, GS-Sum and GS-MT, the sizes of recoverable gold standards were 78, 151, and 176 respectively.
Taking MPN as an unranked set of papers, pertinence could be evaluated using classical information retrieval evaluation measures. Table 11 summarizes the results, where V represents MPN size, GS represents the number of matched gold standard papers, and^GS represents the maximal number of gold standards in the corresponding citation network or semantic citation network, followed by precision, recall and F1 score. We observed that, although a single semantic MPN might not return more matches, the composite semantic MPNs achieved much better ranking performance. Comparing the T A B L E 1 1 Evaluation results of pertinence of main path networks. "add_Combined" and "del_Combined" rows against the "MPN" row, the recalls of the former were more than doubled on AANPar and AANSum, and gained more than 65% relative increase on AANMT. Recall that, it is extremely important that as many crucial studies as possible are detected by MPA. At the same time, F1 scores were also largely improved except on AANMT_add_-Combined. In addition, from the last three rows, we saw that "add_Combined" and "del_Bkg_Fut" results also complemented each other. The most extreme case was on AANMT: the sum of recalls of "add_Combined" and "del_Bkg_Fut" was only slightly larger than the recall of "del_Combined," implying that they returned drastically different subsets of gold standards. This justifies our claim that semantic MPNs may exhibit higher diversity to complement each other, and it would be better to merge them for a more comprehensive view. Finally, the recalls and F1 scores on all three datasets corroborate with the findings of Filippin (2021) about MPA's unsatisfactory recognition rate of the most significant studies. Although semantic MPA proved to improve ranking pertinence by a large margin, there seemed to still large space to improve recall. To achieve this, we guess that it may be helpful to start and guide main path exploration by first ranking and selecting important publications in some way (Bae et al., 2014;Zhang et al., 2014;Tao et al., 2017;Ding et al., 2022).

| CONCLUSIONS
This paper advocated a novel semantic main path network analysis approach for extracting the scientific backbone from a citation network based on citation function analysis. First, according to per-class performance analysis, the best models for extension, motivation, usage, similarity, neutral (equiv. background) and future work citations were cherry-picked from 55 contextualized citation function classification models trained from 11 model architectures based on SciBERT. Then, four types of semantic citation networks were created by gradually adding extension and motivation citations, usage citations, and similarity citations in a recall-oriented fashion, and by deleting neutral and future work citations in a precision-oriented way. On each semantic citation network, semantic main path network was extracted by merging the top-K key-route main paths extracted from different time slices of the network. Meanwhile, for the first time, this paper performed quantitative main path analysis evaluation by proposing a three-way framework consisting of topical coverage, topical coherence and ranking pertinence. The effectiveness of semantic main path network analysis was demonstrated on three computational linguistics fields, namely natural language parsing, automatic text summarization and machine translation. Qualitative analysis showed that each semantic main path network was able to reveal novel topic branches, new important papers of existing branches, and the development pathways between papers and branches, thus provided complementary views of domain evolution. For example, for large domains such as natural language parsing that were guided by a few seminal studies (like Penn Treebank) and ground-breaking shared tasks, the semantic main path networks were much better at finding these representative works, such as the two early shared tasks on (multilingual) dependency parsing and more future shared tasks on a plethora of topics including semantic dependency parsing, semantic role labeling and dependency parsing of morphologically rich languages, most of which were missed by traditional main path analysis. For automatic text summarization, the semantic main path network approach was able to find an important novel branch about summarization evaluation and the branch about optimization methods for summarization, at the same time enrich the multidocument summarization, graph-based ranking and sentence fusion/compression branches that were recognized by the traditional approach.
Merging multiple semantic main path networks resulted in significantly better topical coverage. When main path analysis is seen as a method to return an unordered set of top-ranked studies, the composite semantic main path networks achieved much better ranking pertinence based on expert-selected gold standards, thus proved to be more comprehensive representations of scientific development. In addition, extension, motivation and similarity citations proved to achieve better semantic coherence on all three datasets than traditional approaches which ignore citation semantics, but adding usage citations may introduce topical diversity, which resulted in lower coherence but higher coverage. In the extracted semantic main path networks, most recognized citation relations were more relevant to uncovering the knowledge flow among scientific ideas. On the contrary, in the traditional approach, many main path papers were connected via incidental citations such as neutral citations. Therefore, we conclude that the semantic main path network analysis approach can discover more pertinent topic branches, uncover more coherent knowledge flows, and provide a more comprehensive scientific domain representation. (2022) that most citation instances' functions could be determined only using citance alone. 5 When f = h, depending on context_type, the number of model variants is 2. When f = [h; s], the number of model variants is: 2 (context_type = "sequential") + 2 Â 3 (context_type = "hierarchical") = 8. When f = [h; c], if context_type = "sequential", the model variant number is 2; otherwise, if context_type = "hierarchical", it is 3 Â 2 = 6 (3 sentence poolers by 2 context encoders). When f = [h; s; c], if context_type = "sequential", the model variant number is 2 Â 2 (2 citance encoders multiplied by 2 context encoders) = 4; otherwise if context_type = "hierarchical", there are 2 Â 3 Â 2 = 12 model variants (2 citance encoders by 3 sentence poolers by 2 context encoders). Therefore, there are in total 2 + 8 + (2 + 6) + (4 + 12) = 34 model variants. 6 https://github.com/xiaoruijiang/JMPA 7 Parsing: Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. See Wikipedia page: https://en.wikipedia. org/wiki/Parsing. 8 Note that more grammars were proposed even earlier, outside our time range of analysis. 9 From Wikipedia, shallow parsing is also chunking or light parsing: https://en.wikipedia.org/wiki/Shallow_parsing 10 Both semantic role labeling and dependency parsing became rather standalone topics and had bespoke monographs on these two topics. 11 https://radimrehurek.com/gensim