Research conducted along the lines of the Cascading Citations Analysis Project (C-CAP, http://www.ccapnet.org), funded by the Research Committees of ATEI and the University of Macedonia, Thessaloniki, Greece
We report on the utilization of the cascading citations indexing framework (C2IF) for the identification of similarities among items (in this case research articles) in a bibliographic database. More specifically, the problem of chaining forward from a given focal article is addressed by considering the direct as well as the indirect citations that target the article in question. From the population of articles that cite the given article directly, those associated with a larger number of higher-level C2IF constructs are found to be more similar to it. The findings also appear to be of value for the mirror image problem of chaining backward from the focal article to a population of referenced articles. Cited publications for which the focal article represents/hosts a larger number of higher-level C2IF constructs are likely to be more similar to it.
As a testbed, sixty (60) highly cited computer science research articles are considered together with their associated bibliographic links over a six-year period 1999-2005) in the Science Citation Index Expanded (SCIE) data. The dataset has been made available by Thomson Scientific for conducting research along the lines of the Cascading Citations Analysis Project (C-CAP). Similarity values are calculated by considering author-supplied as well as automatically generated keywords registered in the SCIE dataset. The purpose of this research is to develop a strategy that will improve the effectiveness of retrieval in digital libraries that incorporate bibliographic citations.
Bibliographic data include the published item itself, the references cited by it (i.e. the bibliography of resources used), and the citations subsequently made to the publication in question, i.e. all published items that include the given item among their cited references. In this respect, citations data are seen to populate a citations graph whereby nodes represent published items and directed edges represent references. The latter originate from the citing item and target the cited reference item. Citations data are usually utilized for ranking the impact of published items, as well as that of the hosting journals; the larger the number of citations received, the higher is the impact of the published item. Another, equally if not more important, use of citations data is in implementing certain bibliographic search strategies. Researchers forward- and backward- chain along the citation links in order to track relevant publications. Content relevance is the goal when it comes to assigning keywords to each publication. Research article indexing today becomes an issue of utmost importance when one considers: (a) the internet based scholarly communication environment, (b) the open access movement, and (c) the globalization in academia and research. Such developments underline the importance the accessibility factor has on published works. When a digital object is not indexed properly it cannot be retrieved and it is missed, even if it is exactly on the subject desired (Bates, 2002).
Keyword- and descriptor-based retrieval together comprise the dominant mode of commercial bibliographic database searching today. Keywords are assigned to articles either by the authors themselves, by professional indexers, or they are machine-generated by considering selected sections of the scientific article's textual content (e.g. its title, abstract, cited references, etc.). Still, it was as early as 1961 when Dr. Eugene Garfield, the inventor of the citation index, first thought critically about the possibility of using an article's cited references as index terms, rather than using machines to automatically assign traditional subject keywords (ISI, 1999). Later, M.L. Pao compared and contrasted the two searching modes, referring to keyword- and/or descriptor- based retrieval by the name of semantic retrieval and to citation based retrieval by the name of pragmatic retrieval (Pao, 1993).
Regarding the drawbacks of semantic retrieval, M.L. Pao is in agreement with S.P. Harter's explanation for the low recall values observed, namely that poor recall is due to the existence of many reasonable candidate search elements and search terms for any given topic (Pao, 1993; Harter, 1990). Alternatively and with equal plausibility, the problem can be attributed to the absence of authority control for the keywords involved, as well as to the fact that keywords and descriptors are subject to linguistic change or obsolescence (ISI, 1999).
As for pragmatic retrieval, its main weakness is that the user needs to initiate the search, using a seed article of relevance (Larsen, and Ingwersen, 2002). B. Larsen, and P. Ingwersen attempt to rectify this drawback by introducing their boomerang effect approach whereby a set of top-ranked articles is identified by conducting an initial keyword based search. Next, the frequently occurring citations are extracted and used to compile a query that is processed against the citation index. Still, their approach is reported to require further improvement in order to outperform the semantic retrieval (Larsen, and Ingwersen, 2006).
The present paper focuses on the forward chaining problem and considers the semantic similarity of a focal article in comparison to each one of the articles that include it in their cited references lists. Both author-supplied and automatically generated keywords are used. It is shown that article-to-article similarity relates to information present in the cascading citations indexing framework (C2IF). The latter extends the citations paradigm by introducing two new concepts: the indirect citation, and the chord (Dervos et al., 2006). An indirect citation is defined to be an instance in which a target (cited) article receives a citation not by being a cited reference of the source (citing) article, but by being a cited reference of a third article which in turn comprises a cited reference of the source article. Indirect citations may be considered up to any desired depth/level. Direct (depth = 1) citations are code-named 1-gen, the next level (depth = 2) citations are code-named 2-gen, and the depth = 3 ones are code-named 3-gen citations. A 2-chord instance is defined to be one in which a direct (1-gen) and an indirect (2-gen) citation both involve the same source (citing) and target (cited) articles. Similarly, a 3-chord instance implies the co-existence of one 1-gen and one 3-gen citations involving the same source and target articles. The approach has a lot in common with the one in (Rousseau, 1987). The difference is in that instead of calculating measures reflecting the influence publications referenced directly and/or indirectly by a given article have had on the development of the latter, emphasis is given on incorporating C2IF data as such in the retrieval paradigm.
As in earlier studies (Dervos, and Kalkanis, 2005; Dervos et al., 2006), indirect citations are considered up to depth = 3. When the source and target items of a 2-gen (or 3-gen) citation are identical, the latter is said to comprise a cycle. Cycles do exist and they are excluded from the calculations that follow.
In the Research Aim and Methodology section the approach applied in this project is presented. The citations dataset, and the data preparation processing are outlined in Testbed and Data Preparation. The results obtained are presented and discussed, in the Results section. In Conclusion and Future Work, issues requiring further research are identified.
Research Aim and Methodology
One valid assumption to be made is that keywords are indicative of the subject/research area(s) covered by the article they index. By ‘keyword’ it is also meant a keyphrase, namely: a composition of terms that identifies a research field and/or discipline area, e.g. Database Management Systems. This can be generalized to a (any) collection of articles: it is assumed that the keywords used to index its members, plus their frequency distribution values are indicative of the subject/research area(s) covered by the collection, as a whole.
Having made these two assumptions, one may proceed and, in the context of C2IF, ask the following question: considering a focal article next to those articles that include it in their cited references lists, how similar is the former to the subject area(s) covered by the latter? Questions of this type are worth exploring since, for example, a low similarity score case is indicative of the focal article being one of an interdisciplinary value. In this respect, it makes sense to calculate the similarity of a typical focal article to its 1-gen citations population, and compare it to the calculated similarity of the same focal article to its 2-gen citations population. Furthermore, one may also consider sub-populations of the focal article's 1-gen citations, like (a) the 1-gen citations that channel (i.e. lie along the path of) higher order (i.e. 2-gen or 3-gen) citations, and (b) the 1-gen citations that represent 2- or 3-chord instances targeting the focal article.
As an example, let the citations graph shown in Figure 1 be considered: the 2→1 (1-gen) citation targeting the focal article 1 is seen to also channel one 2-gen (4→2→1) and one 3-gen (5→4→2→1) citations that target 1, plus it represents one 2-chord (2→3→1), and one 3-chord (2→6→7→1) instance, both of which have article 1 as their target. Quite analogously, the 3→1 (1-gen) citation is seen to channel one 2-gen (2→3→1) and one 3-gen (4→2→3→1) citation, the 7→1 (1-gen) citation channels one 2-gen (6→7→1) and one 3-gen (2→6→7→1) citation, whereas the 5→1 (1-gen) citation does not channel any indirect citation, but it is a host to one 3-chord instance (5→4→2→1) that targets article 1. Indirect citations and chords of level (depth) 4 or higher are ignored, since in the present research indirect citations and chords are considered only to depth = 3.
More formally, considering a given focal article A: by forward chaining, the set of articles which include the article in question in their own cited references lists is identified: D. Equivalently, D denotes A's 1-gen citations. In a similar way, let I2 denote A's 2-gen citations, let I3 denote A's 3-gen citations, let C2 denote the set of articles that host 2-chord instances that target A, and let C3 denote the set of articles that host 3-chord instances that target A. C2 = D ∩ I2, and C3 = D ∩ I3 are noted to always hold true, by definition.
One may now proceed further and introduce a similarity score that utilizes the author supplied keywords to quantify the likeness of A to each one of the five populations considered, namely: D, I2, I3, C2, and C3. In this respect, five similarity scores need be calculated: Sim(A, D), Sim(A, I2), Sim(A, I2), Sim(A, C2), and Sim(A, C3). The five similarity scores, once calculated, can be normalized to obtain values in [0,1] as follows:
where and S(A,x) = 0 when Sim(A,D) = Sim(A,I2) = Sim(A,I3) = Sim(A,C2) = Sim(A,C3) = 0
The Sim(A,x) scores are calculated in a way analogous to the calculation of query-to-document similarity scores by utilizing thetf*idf indexing term importance indicator (Salton, 1989; Robertson, 2004). More specifically, a number (Q) of highly cited articles in some subject/discipline area are identified. Next, each one of the Q highly-cited articles is taken to comprise a focal article. There will beN = 5 * Q populations of articles that cite the given Q articles either directly or indirectly, 5 being the cardinality of the domain from which x obtains its values from. For each one keyword t, of the Q highly-cited articles, its idf value (discriminatory-, when used as an index) is calculated:
whereN = 5 * Q (i.e. the number of populations), and ni is the number of populations that contain at least one article which is indexed by ti.
Regarding the calculation of Sim(A,x), let . The five similarity scores (one for each citing articles population considered) are calculated as follows:
where tfix denotes the frequency of occurrence of ti in x.
Equations (1), (2) and (3) may be updated to opt for cases where the cardinality of the domain which x obtains its values from is other than 5 (as in the case of the calculations in the Results section, below).
Testbed and Data Preparation
To test the approach outlined in the previous section, six years of Science Citation Index Expanded (SCIE) citations data (1999-2005) were utilized. The dataset registered 7,364,211 articles published in 372,544 issues of 11,076 journals. Each individual article was indexed by two sets of keywords: (a) author-supplied keywords, and (b) keywords plus®, a set of machine-generated terms extracted from the titles of the article's cited references (ISI,1999). A first set of measurements revealed an average of 2.06 keywords, and 4.2 keywords plus® instances per article. The typical article was found to cite an average of 22 (within-the-set) references, and receive an average of 4.82 citations from articles present in the given dataset.
In the SCIE dataset, each individual journal issue is associated with one or more subject categories. To obtain a subset involving computer science (CSc) articles only, a filter was applied which selected journal issues relating to CSc subject categories, with ‘Computer Science Interdisciplinary Applications’ excluded. The filter identified 9,342 journal issues (2.5% of the journal issues present in the original dataset). Interestingly enough, four of the 1-gen citations comprised cycles, namely articles citing themselves. A total of 110,151 cycle-free 1-gen citations were present in the dataset, involving 71,688 (distinct) articles. Having excluded all the cycles, as well as all the 2-gen and 3-gen citation instances whereby the same article appears more than once in the citation path, the CSc dataset was measured to involve 112,725 and 60,721 2-gen and 3-gen citations, respectively.
Next, a testbed environment was constructed by selecting the sixty 60) top highly cited articles in the CSc dataset. In the given (closed world) setup, the average highly cited article received 43 1-gen, 136 2-gen, 165 3-gen, 48 2-chord, and 45 3-chord citations. Furthermore, the 60 (highly cited) articles together with their (direct or indirect) citations comprised a population of 4,227 article identifiers, involving a total of 4,682 author-supplier keywords, and 2,544 machine-generated keywords plus® instances.
Focusing on the 4,682 author-supplied keywords, it was found that the set involved a large number of synonyms (cf. Table 1), as well as typos and/or American/UK English language spelling variations (cf. Table 1).
The keyword synonyms and spelling errors/variations present in the testbed environment were identified and excluded manually. Having done so, the size of the author-supplied keywords and the keywords plus® lists were reduced by a factor of 18% and 16%, respectively.2
Table 1. A sample of keyword synonyms present in the testbed
World Wide Web
world wide web
meshless local Petrov-Galerkin (MLPG) method
Meshless Local Petrov-Galerkin (MLPG) Methods
Meshless Local Petrov-Galerkin approach (MLPG)
meshless local Petrov-Galerkin approach (MLPG)
Table 2. A sample of spelling errors and/or American/UK English variations in the testbed
nearest neighbor searching
nearest neighbour rule
Table 3 summarizes on the ten (10) articles populations considered:
Table 3. Article populations used for the calculation of S(A,x)
Articles which cite A directly (1-gen citations)
Articles which cite A indirectly, via 2-gen citations
Articles which cite A indirectly, via 3-gen citations
Articles from which 2-chords originate that target A
Articles from which 3-chords originate that target A
D members, each one of which channels at least 5 indirect citations (2- or 3-gens, irrespectively) that target A
D members, each one of which channels at least 10 indirect citations (2- or 3-gens, irrespectively) that target A
D members, each one of which represents at least 5 chords (of 2- or 3-chords, irrespectively) that target A
D members, each one of which is associated with at least 10 C2IF entities (2-/3-gens, or 2-/3-chords, irrespectively) that target A
D members, each one of which is associated with at least 15 C2IF entities (2-/3-gens, or 2-/3-chords, irrespectively) that target A
As it is stated in the previous section, 60 highly cited CSc articles comprised the testbed environment which was used for calculating the S(A,x) scores. Unfortunately, the SCIE database does not assign author-supplied keyword and keyword_plus® instances to each individual article it registers. More specifically, of the 4,227 articles only 1,600 were found to relate with a non-empty list of author-supplied keywords, and only 2,963 articles were found to relate with at least one keyword_plus® instance. Moreover, only 1,231 articles involved non-empty author-supplied keywords and keyword_plus® lists. As a result, two different sets of 20 highly cited CSc articles were selected for conducting the S(A,x) relating measurements: one for the author-supplied keywords case, and one for the keyword_plus® case. The two sets had only seven (7) articles in common. Still, as it will become clear from the following, the two 20-article sets produced analogous results. Considering the experimental setup outlined in Table 3, the Sim(A,x) and S(A,x) scores were calculated using Equations (1) and (2), with N = 10 * 20 = 200.
Table 4. Average S(A,x) values for the two sets of highly cited CSc articles considered
The results in Table 4 reveal that, on the average, indirect citation populations (i.e. I2, and I3) tend to involve (citing) articles which are less similar to the focal article when compared to chord populations (C2 and C3). This is along the lines with what one would expect to be the case, by intuition. Articles in C2 and C3 cite the focal article both directly and indirectly, which is not the case with all the articles in I2, and I3. The highest similarity scores achieved were those of the P3 and P2 populations for the author-supplied keyword and the keyword_plus® cases, respectively. For the author-supplied keywords case, the P3 articles population represented an average of 24% similarity score improvement over D, i.e. the articles citing the target article directly, whereas for the keyword_plus® case, P2 was shown to achieve an impressive 43,3% improvement over D (Figures 2 and 3).
In Figure 2, the P3 articles population is seen to represent similarity scores which are higher than those of the direct citations (D) for 13 of the 20 focal articles considered. On the average, the former was shown to represent a 24% similarity score improvement over the latter (Table 4). In Figure 3, the P2 articles population is shown to excel over D for 16 of the 20 highly cited CSc articles considered, and to represent an impressive 43,3% similarity score improvement, despite the drop in the case of article number 11 (whereby none of the focal article's keyword_plus® instances was registered to index any one of the articles in the corresponding P2 population).
Returning to the results presented in Table 4, one notes that chords do better in implying semantic relationship for the author-supplied keywords case than they do for the keyword_plus® case. More specifically, C2 and C3 are shown to have been measured to involve (average) similarity scores slightly lower than those of the 1-gens (D) in the keyword_plus® case. Also, P3 (which is chords-based) is seen to comprise the population achieving the highest (average) similarity score for the author-supplied keywords case, whereas the one that excels in the keyword_plus® case is P2 which involves indirect citations, not chords. This notable difference between the results obtained in the author-supplied keywords case and those of the keyword_plus® case could be attributed to the algorithm used to extract the keyword_plus® (index) entries. The latter are extracted from the titles of a given articles' cited references (ISI, 1999). As a result, articles which lie along the same citation path (like the ones in P2) are more likely to be indexed by the same entry when compared to articles citing one another indirectly along non-overlapping citation paths (e.g. the articles in P3). Consequently, the algorithm used to extract the keyword_plus® entries tends to bias the semantic content of the (focal article's) index with pragmatic (i.e. cited references-relating) information.
Conclusion and Future Work
The present study considers a bibliographic citations database and addresses the article-to-article similarity issue in the context of the cascading citations indexing framework (C2IF). More specifically, it focuses on the ‘Where do I go from here?’ dilemma, whereby a researcher (or, even a process) has identified a published item to be one of specific interest and considers the citations received (i.e. articles that include the given item in their cited referenced lists) in order to chain-forward. It is shown that it makes sense to extend the citation indexing paradigm by incorporating C2IF features for increased retrieval effectiveness. By C2IF features are meant entities like indirect citations and chords, both of which are indicative of a cited (target) article's impact value in promoting science and/or technology. In this respect, semantic (keywords- and/or descriptors- based) information may be combined with pragmatic information (i.e. the number of indirect citations channeled, and/or the number of chords represented along a given citation link) in order to calculate new (improved) rank values for retrieval. All of this becomes possible since the outcome of the present study suggests that (pragmatic) C2IF-based information may be used to imply semantic similarity between two articles along a citation link, in addition to relating to the focal article's bibliographic/research impact value.
The relevant calculations and measurements involved Computer Science articles from the Science Citation Index Expanded (SCIE) database with publication dates from 1999 to 2005 inclusive. The database was used to construct a ‘closed world’ target dataset which involved sixty (60) highly cited articles and their associated citations (direct and indirect, up to a depth level of 3). The similarity scores were calculated on the basis of the author-supplied as well as the keyword_plus® entries registered in the SCIE database. A new view to the well-known the tf*idf term importance indicator for indexing was established. Instead of using it in order to rank a query against a set of documents (which is the typical approach), an article was effectively ranked against a set of article populations by considering the (index) entries assigned to each individual article. This new approach comprises a most useful side-benefit of the present research effort, since it can be used in order to formulate a similarity (‘distance’) function for items clustering in a bibliographic citations database. This is one direction which we intend to exploit further in the future stages of the current research.
Additional issues that we intend to address next, include:
(a)the automation of the procedure which identifies synonyms and amends typos and/or American/UK English language spelling variations of the author-keyword and keyword_plus® bibliographic entries in the SCIE database. As is reported in the Testbed and Data Preparation section above, the experimental testbed consisted of 4,227 articles, involving 4,682 author-supplier keywords, and 2,544 machine-generated keywords plus® entries. The two lists were processed manually, resulting in a decrease of their sizes by a factor of 18% and 16%, respectively
(b)the relaxation of the ‘closed world’ restriction applied in the present study. It will be interesting to measure the effectiveness the C2IF features have in implying semantic similarity when bibliographic items from more than one discipline area and cross-disciplinary citation links are considered. Equally interesting will be the investigation of the case whereby self-citations are excluded from contributing to the above
(c)the implementation of the C2IF algorithm, its efficiency and scalability. These issues are to be addressed/reported in a separate paper/report
Concluding, the research outcome of the present study is seen to comprise a clear indication that the cascading citations indexing framework (C2IF) can turn out to be useful in improving the effectiveness of retrieval in the modern digital library environment.
The authors wish to thank David Kohl for his assistance in editing the final version of this report. Also, Richard Hartley and Peiling Wang, members of the C-CAP extended advisory board, for their advice and continuous support/encouragement. Special thanks are due to Thomson Scientific for making available their SCIE database to C-CAP. Also, to IBM for it is via their academic initiative program1 that the DB2 DWE v.9.1 software is used in this project. Last but not least, the authors wish to thank the reviewers. Their comments and suggestions have helped improve the content of the present report.