Six degrees of scholarship

Authors


1. Introduction

In a paper entitled “Exploring Co-Citation Chains,” the author developed the idea of extending co-citation analysis to include additional terms or authors “…that are not directly co-cited, but are linked via a co-cited, yet intransitive, relationship” (Buzydlowski, 2006). That extension was called a co-citation chain.

To illustrate that concept, that paper provides several chains based on the Arts & Humanities Citation Index (AHCI). For instance, the following chain was used as an example:

BACON-F-CICERO- VARRO.

In the above chain, Francis Bacon is co-cited with Cicero, Cicero is co-cited with Varro, but Varro is NOT co-cited with Bacon. The links (dashes) indicate the two different literatures that separate Varro and Bacon, with Cicero as the joining author. It can be imagined that such a chain would be interesting for scholars to ponder the significance of the joining author, Cicero, or to explore the mutually exclusive literatures surrounding that author.

While the above concept is not new, as it has been used before in such systems as Arrowsmith (http://kiwi.uchicago.edu/) which do the same for keywords, co-citation chains do extend those ideas to author names and suggest longer chains for joining authors.

As an example of a longer chain, the aforementioned paper starts its exposition with the idea and interest of linking Howard D. White, the information scientist, with Emily Dickinson, the poet, and concludes the paper with just such a chain:

DICKINSON-E -STEVENS-W- BATES-MJ-BORGMAN-CL-WHITE-HD.

While the above chain does link the two names in question, the author indicates that the derivation of such chains is computationally difficult, user intensive, and that the prototype system used did not necessarily provide the shortest chain.

2. Methodology

This paper builds upon co-citation chains and uses a specialized database, Piotr, created to provide the shortest chains in real time. Using Piotr and a supercomputing-class machine, Rachel (Grandinetti, 2007), this paper explores other chain types not possible with the previous system due to their computation complexity. The data used to compute the examples in this paper are also from AHCI11 .

Piotr is primarily a set of in-memory data structures providing links such that the paths from authors to papers and back again can be transversed very quickly. Additionally, a data structure which allows for the elimination of previously inspected authors and papers makes the database very fast and efficient. The data from the AHCI was specifically prepared and transferred to the database for the examples used in this paper.

Rachel, a high-performance computer hosted by the Pittsburgh Supercomputing Center, is “a loosely-coupled pair of SMP machines. Each system has 64 1.15 GHz EV7 processors with 256 Gbytes of shared memory … Rachel is primarily intended to run applications with low to moderate parallelism that require large memory bandwidth and/or large shared memory.”22 The machine was primarily used for its large memory capacity, but the highly parallel nature of the machine can be used for the ideas put forth in the Conclusions & Future Directions Section of this paper.

Using Piotr, Rachel, and the AHCI dataset, the aforementioned chain of Dickinson and White was recomputed and was found to be separated by only one term (degree):

DICKINSON-E-US-EPA-WHITE-HD.

The system allows for a real-time check to see if any two names are connected. If they are connected, it will return a shortest chain. If not, it will return an empty chain.

3. Discussion

With the ability to generate chains in real time, exploratory co-citation chain analysis is possible. For example, rather than generate chains based on two given names, only one name (a nameseed) could be used-similar to exploratory author co-citation analysis used in the system AuthorMap (Lin, 2001)-to find chains of interest to a user.

A definition of such a chain, an interest chain, could be a chain of at least, say, two degrees linked to a name of sufficient citation frequency, say, thirty. Given this definition and starting with a nameseed of Emily Dickinson, the system produces the following chain:

DICKINSON-E(300)- ABBEY-E(72)- AMERY-C(31)-PEI-IM(57).

It may be of interest and a surprise to a Dickinson scholar to know that Dickinson is connected to the architect I.M. Pei.

Or, given a nameseed of Francis Bacon, the chain joins him with the San Francisco Ballet.

BACON-F(896)-JONES-M(354)-AUSTRALIAN-BALLET(21)-SAN-FRANCISCO-BAL(32)

(The numbers in parentheses indicate the citation strength; e.g., Bacon is cited 896 times in AHCI, the San Francisco Ballet is cited 32.)

While the database Piotr provides the shortest chain, it does not provide every possible chain linking the two names. However, providing all the chains would sometimes produce thousands of nearly similar chains that would overwhelm the user. Additionally, the calculation of the full set of chains is extremely computationally complex in terms of its time and, especially, its storage.

As a compromise to every possible chain, which would contain many redundant names, the idea of orthogonal chains, chains containing unique paths (no redundant names) between the two authors of interest, would be possible.

Below is a sampled subset of the orthogonal chains generated between Emily Dickinson and I.M. Pei:

  • PEI-IM-AMERY-C-ABBEY-E-DICKINSON-E,

  • PEI-IM-ATTALI-J-ABRAHAM-N-DICKINSON-E,

  • PEI-IM-AULENTI-G-FOUCAULT-M-DICKINSON-E,

  • PEI-IM-CARRENODEMIRAND.J-JENKINS-M-DICKINSON-E.

Two parameters associated with orthogonal chains, length and occurrence frequency of intermediate authors, could be altered to reduce the number of such chains.

As a last example of the possibilities allowed for by exploratory chain analysis, this paper introduces the idea of a petered chain. This involves taking a nameseed and then generating all of the names within one degree of (co-cited with) that name. Next, do the same for all of the new names that were so generated. Repeat this until the chains become exhausted, i.e., there are no new authors. Given various nameseeds, how long would the chains be?

Starting with the nameseed of the Bible, the most frequently cited source in AHCI, reveals the following chain:

BIBLE: 93982, 1047855, 57682, 879, 38, 1, 0.

With Dickinson, the following chain is produced:

DICKINSON-E: 3134, 793018, 400378, 3771, 134, 2, 0.

Finally, with Howard D. White as the nameseed, this chain is created:

WHITE-HD: 616, 382464, 806211, 10808, 324, 14, 0.

(The numbers next to the nameseed, separated by commas, indicate the number of unique names generated in the next level. For example, there are 616 unique authors associated with White, then 382,464 names associated with the 616 authors associated with authors who are associated with White, etc.)

4. Conclusion & Future Directions

It is hoped that the reader noticed that the last value for each of the previous petered chains was a zero. This indicates that after six levels, there were no authors in the set which had not been used in a previous level. In fact, preliminary analysis has found that most authors used as seeds are exhausted within six degrees; i.e., a sort of “six degrees of scholarship” exists within the AHCI.

Furthermore, it is left to the reader as an exercise to sum the numbers created by the petered chains for each of the examples above. The startling result is of great interest and leads to an interesting hypothesis: everyone who is connected to anyone is connected to everyone within AHCI.

The use of Piotr and a large computer allows for the further exploration of such ideas as well as traditional social-network measures, and that is an area for future research. Additionally, work to move the system from a specialized computer to a more user-friendly framework, such as a web site, is also projected.

Acknowledgements

This research was supported in part by the National Science Foundation through TeraGrid resources provided by Pittsburgh Supercomputing Center.

The dataset used for this study, AHCI, was given by the Institute for Scientific Information (ISI) as a research grant to Drexel University.

Footnotes

  1. 1

    The AHCI dataset is from the years 1988–1997 and contains over 1 million records, with about 1.2 million unique authors.

  2. 2

    http://www.psc.edu/resources.html

Ancillary