Going beyond counting first authors in author co-citation analysis

Authors


Abstract

The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed.

Introduction

Since its introduction by White & Griffith (1981), author co-citation analysis (ACA) has gained great popularity in the study of intellectual structures of scholarly fields and of the implied social structures of the corresponding communities. While most studies have applied the general steps and techniques of classic ACA to different research fields with little or no modification, some studies have proposed new techniques for mapping author clusters (White, 2003) or for statistically processing co-citation counts (Ahlgren, Jarneving & Rousseau, 2003). However, few studies (Persson, 2001) have examined some of the fundamental aspects of ACA such as the way that the co-citation counts are defined which provide the raw data on which all subsequent statistical analyses are based. The present study seeks to contribute to filling this gap, and aims to shed some light on future directions of ACA studies.

Research questions

ACA is one particular type of co-citation analysis. It is generally accepted that the co-citation concept was discovered independently by Small (1973) and Marshakova (1973), and that document co-citation analysis was introduced by Small (1973) and author co-citation analysis by White & Griffith (1981). Many co-citation analysis studies have been conducted since.

In co-citation analysis, a set of items (authors, documents, journals, etc.) is selected to represent a research area, and relationships between these items are then analyzed using co-citation counts - the number of articles that have cited two items together in the same articles - as similarity measures and multivariate analysis techniques as analysis tools, in order to study the intellectual structure of this research field and to infer some of the characteristics of the corresponding scientific community.

Depending on the units of analysis (documents, authors, journals, etc.) and on the citation / co-citation thresholds, both the macro-structure - the overall map of an entire science with each dot on the map representing a discipline - and the micro-structure - the structure of a single specialty with each dot on the map representing a single document - of a science can be mapped, providing either overviews of research areas or a look at the underlying fine structures (Small, 1999).

ACA takes authors as its units of analysis, which is, when compared with document co-citation analysis, more complex in at least two aspects. First, the interpretation of results is complicated by the fact that an author represents a larger and less homogeneous unit than an article does in terms of what the unit of analysis connotes. This problem was well addressed when ACA was first introduced, as well as in subsequent studies (White, 2003). Second, defining co-citation is tricky because of the existence of multiple authorships. Unlike the first problem, this one has rarely been discussed, partly due to the constraints imposed by the main data source that has so far been used for ACA studies - the set of databases developed by the Institute for Scientific Information (ISI). These databases only index the first authors of cited documents. Although all authors of a cited document can be found if it has also been indexed as one of their source papers (citing papers), the number of such documents, although increasing as time goes on, has been small. As a result, classic ACA only takes into account first authors in the definition of co-citation. Specifically, two authors are considered as being co-cited when at least one document from each author's oeuvre occurs in the same reference list, an author's oeuvre being defined as all the works with that author as the first author (McCain, 1990). For the purpose of convenience, this is called “first-author co-citation” in the present study.

Taking advantage of citation indexes for scholarly publications on the Web that index all cited authors, the present study attempts to go beyond first-author co-citation and to define an author's oeuvre as all works with this author as one of the authors, which is called “all-author co-citation” in the present study for the purpose of convenience. We hope to see what kind of a picture we can get this way from ACA regarding the intellectual structure of a scholarly field, and how and why it might be different from that resulting from the first-author co-citation-based ACA.

We took a simplified approach to all-author co-citation in that we only took into account the first five authors rather than all authors. We hoped that this approach would approximate sufficiently strict all-author co-citation counts, as publications with more than five authors were not expected to occur too frequently based on the statistics from the present study (Table 1), and even if its approximation was insufficient, it would still help us to see beyond the classic first-author co-citation analysis.

One immediate implication of all-author co-citation is that two authors will also be considered as being co-cited when a single paper which they co-authored is cited. In other words, co-authorship when cited is also counted into co-citation, which may be conceptually confusing at the first thought. However, all-author co-citation appears to be an authentic measure of the connectedness between authors because, just like co-citations, co-authorship indicates that authors are related to each other in some sense, and it is actually a closer relationship between authors than that formed by co-citations. It appears to be an even better measure since this way of counting co-citations takes into account more links between related authors, which may make it easier to identify interrelationships among authors. This was partly confirmed by one of our earlier studies (Zhao & Logan, 2002) and is illustrated in the present study.

It would be interesting to do another analysis which excludes cited co-authorship from all-author co-citation counts to see the differences this might make in the ACA results in order to explore the conceptual implications.

Table 1. Distribution of papers by number of authors
original image

Methodology

Data collection

The research area we analyzed in the present study was XML - eXtensible Markup Language. Although XML has applications in a wide range of areas, the core of XML research belongs to computer science. Thus, we used the NEC Corporation Research Institute's ResearchIndex (now a joint effort of NEC and the School of Information Science and Technology at Pennsylvania State University) to collect citing papers on XML along with their reference lists. ResearchIndex (aka CiteSeer) automatically indexes research papers that both fall within a broadly defined computer science field and are publicly available on the Web. It is a SCI-like tool freely available on the Web, but provides more information on cited papers than SCI, including their full titles and the names of all authors. Studies have shown that author rankings using data from ResearchIndex are highly correlated with those using data from the ISI's SCI when using identical citation counting methods (Zhao, 2003; 2005), implying that using this tool for citation analysis is just as valid as using SCI. More information about ResearchIndex can be found in Lawrence et al (1999), Zhao & Logan (2002), and Zhao & Strotmann (2004).

We developed a Java program to search for all documents indexed by ResearchIndex in “Header” fields under the term “XML” or “eXtensible Markup Language,” and to download all of the records that met the search criteria into a local machine. No citation windows were specified in the present study, indicating that publications from all years were used. The algorithm of this program can be found in Zhao (2003). The actual search was conducted on December 18, 2001.

Another program was developed in Java to parse these records, and to store the resulting citation information such as titles, authors, publishing sources and years of both citing and cited documents in a data structure that was convenient for later data analysis such as counting citations and co-citations using multiple methods. Since an earlier study (Zhao & Logan, 2002) had found the existence of duplicates to be one of the major differences between the ISI databases and ResearchIndex, the citing documents were examined first by another Java program and then manually to remove possible duplicates. Citations made by these duplicates were removed as well. The data structure and the algorithms of these programs can also be found in Zhao (2003).

This way, we collected 312 publications which made 4,578 citations altogether.

Data analysis

Based on the perception of all the authors of these 312 publications, we conducted an ACA using both first-author and all-author co-citation counts.

We followed the commonly accepted steps and techniques of ACA (McCain, 1990; White & McCain, 1998; White, 2003; Zhao, 2003) except for the way we defined co-citation, as discussed earlier. Core sets of authors were selected based on “citedness” — the number of citations they received. Two sets of highly visible authors were thus selected using two different citation counting methods — straight counts and complete counts. Simply put, when a paper with N authors is cited, with straight counts, only the number of citations of the first author of this paper increases by 1, and with complete counts, full credit is given to all authors of the paper, i.e. the number of citations of each of the N authors increases by 1. However, similar to our simplified approach to all-author co-citation counts discussed earlier, we also took a simplified approach to complete counts in that it only took into account the first five authors rather than all authors.

There are no strict rules regarding thresholds for citation-based author selection in author co-citation analysis studies (McCain, 1990). Assuming that the more authors the better a research field is represented, the present study used low thresholds to allow 100 authors to be included in the final multivariate analysis, the maximum number of variables possible when using ALSCAL, the multidimensional scaling routine in SPSS (version 10.0).

A Java program was developed to count both all-author co-citation frequencies and first-author co-citation frequencies, and to record them in two separate matrixes. These co-citation matrixes were then cleaned by deleting authors who were co-cited with very few other authors based on the assumption that authors who have little connection with the rest of the field are not good representatives of the field. Specifically, an author was deleted if the corresponding row/column contained more than 95% zero value cells. The resulting matrixes were then converted to Pearson's r correlation matrixes that were in turn used as input to the two multivariate analysis procedures employed: Factor Analysis (FA) and Multi Dimensional Scaling (MDS).

Factors were extracted by Principal Component Analysis (PCA) with an oblique rotation (SPSS Direct OBLIMIN). An oblique rotation was chosen because it is often more appropriate than orthogonal rotations when it can be expected theoretically that the resulting factors (in this case, specialties) would in reality be correlated (Hair et al., 1998). The number of factors extracted was determined based on Kaiser's rule of eigenvalue greater than 1 because the resulting model fit was adequate as represented by total variance explained, communalities, and correlation residuals as discussed below (Hair et al., 1998).

The multidimensional scaling procedure used in this study was SPSS ALSCAL as many studies have done (White & McCain, 1998; Kreuzman, 2001) which produced powerful two-dimensional solutions with the squared correlation (RSQ) value and the corresponding Stress 1 measure being 0.99 & 0.05 (Figure 1) and 0.98 & 0.07 (Figure 2) respectively. The two- dimensional maps (MDS maps as shown in Figure 1 & 2) were visualized using LaTeX from the coordinates resulting from the ALSCAL procedure.

With the aid of both factor analysis and multidimensional scaling techniques, the grouping of the scholars within each of the two sets of authors was analyzed, and results compared.

Results and discussion

We will base our discussion mainly on the factor analysis results presented in Table 2 and Table 3, complemented by MDS maps as presented in Figure 1 and Figure 2, because factor analysis applied in ACA has been shown to provide clear and revealing results as to the nature of a discipline (White and McCain, 1998).

Kaiser's rule of eigenvalue greater than one resulted in a five-factor model from all-author co-citation analysis (Table 2) and an eleven-factor model from first-author co-citation analysis (Table 3). They respectively account for 97.2% and 96% of the total variance, and the differences between observed and implied correlations are for the most part (almost 100%) smaller than 0.05 in both cases. The factor names shown on top of Tables 2 & 3 correspond to column headings indicated by numbers and were given based on an examination of the cited articles written by authors in the corresponding factors. Following White and McCain's example (White & McCain, 1998), authors are ranked in the factor on which they load most highly and their loadings on other factors that are above 0.4, if any, are also presented, indicating their contributions to more than one specialty. If an author does not load 0.4 or higher on any of the factors, the author's highest loading, whatever it may be, is presented.

Since the factor analysis result based on first-author co-citation counts is large in both dimensions (number of authors and number of factors), unlike Table 2 that shows all factors in both left and right half of the table, the left part of Table 3 only shows factors 1 to 3 and all other factors on which the authors who load mostly on factors 1 to 3 have secondary loadings while the right part shows factors 4 to 11 and all other factors on which the authors who load mostly on factors 4 to 11 have secondary loadings.

Table 2. Factor Analysis of 100 authors in the XML research field (all-author co-citation)
original image
Table 3. Factor Analysis of 100 authors in the XML research field (first-author co-citation)
original image

If major factors are interpreted as specialties, the results of the factor analyses presented in Tables 2 and 3 reveal structures of specialties within the XML research field and the associated authors' memberships in one or more specialties. A comparison between results in Tables 2 and 3 reveals that the major subfield structure of the XML research field is very similar for both first-author and all-author co-citation analysis, similar to an observation by Persson (2001) in the Library and Information Studies research field. This is probably due to the independent existence of the intellectual structure of a field. However, the number of subfields and authors' relative positions are different between the two sets of results.

Both types of co-citation analysis have identified four major specialties in the XML research field: (1) XML or semi-structured databases, (2) Foundations of XML or semi-structured data management and processing, (3) programming for or processing of XML data, and (4) The Semantic Web. This can be seen from the fact that almost all authors in these four factors who are common to both sets of top-ranked authors have been placed in the same factors in the two types of analysis.

The first two specialties are the most active ones, as indicated by the number of authors working in those areas (the size of the corresponding factors). They both deal with XML or semi-structured data management but with different emphases: design and implementation versus theory. The Semantic Web group has little to do with the other three specialties, which can be seen from the lack of overlap between the corresponding factor and other factors. In fact, this group has a high negative correlation in both types of analysis with the XML or semi-structured databases group (-0.638 and -0.31 respectively) and these two groups are located at opposite ends on the MDS maps (Figure 1 and Figure 2). It is different from the rest of the field in that it attempts to add semantics to the Web using technologies such as XML while other specialties essentially deal with a syntactic view of XML.

This structure is quite clear in the results from the all-author co-citation analysis depicted in Table 2, considering that the XML and Relational Databases group is highly correlated with the XML or semi-structured databases group, as indicated by the correlation coefficient (0.545) given by the oblique rotation procedure. The small group of authors labeled as XML and relational databases represent the research focus of mapping of data between Relational Databases and XML, that is, the representation of XML data through relational database or of data in a relational database into XML format. It is closely related to XML or semi-structured Databases because they both apply database theory and technology to the management of XML data. The difference is just that one uses relational database and the other semi-structured database technology. This close relationship is confirmed by the same general location of these two groups on the MDS map (Figure 1), and by the merging of the XML and relational databases group into the XML or semi-structured databases group when different factor models with a smaller number of factors were tried.

This structure can also be clearly seen on the MDS map generated from the all-author co-citation analysis as shown in Figure 1. The group of authors who study XML and databases including XML or semi-structured databases and XML and relational databases are located on the right, the Foundations of XML or semi-structured data management and processing group along the bottom across from the left to the right, The Semantic Web group far away from the others in the upper left corner and the Programming group along the X axis on the left.

Figure 1.

MDS map of top 100 authors in the XML research field (all-author co-citation). A light dot is placed at the origin and four circles around it to show more clearly the distance of author-points to the origin. The distance between the first circle and the origin and between any two consecutive circles is the same, namely 0.5.

It appears that one of the two dimensions on the map (X axis) is the degree to which database technology is a research focus. From the left to the right, the importance of database technology becomes more pronounced. This can be seen from the group structure: from The Semantic Web group at the far left end that has little to do with databases to the XML or semi-structured databases group at the right end in which the database technology is the focus. It can also been seen from the structure inside the Foundations of XML or semi-structured data management and processing group at the bottom: the core database people (e.g. Buneman, Ullman, Davidson, and Vianu) are at the right and the XML-related standards and specifications that are not database related in their own right at the left (e.g. Apparao, Champion, Thompson, Paoli, Sperberg-MacQueen and Bray). This reinforces the significant difference between The Semantic Web specialty and the XML or semi-structured databases group indicated by the high negative correlation (-0.638) between them. The meaning of the second dimension (Y axis), however, is not as readily apparent.

On the MDS map generated from the first-author co-citation analysis (Figure 2), however, the author grouping is not as clear, although the four specialties are identified from the factor analysis results (Table 3). Nevertheless, it is still quite visible on this map that the X dimension represents the degree of database technology being a focus of research.

In addition to these four major specialties that are common to both sets of results discussed above, the first-author co-citation analysis identified other areas of research within the field. The largest two are Natural Language Processing (NLP), which focuses on techniques for representing natural text in XML, and General database and information retrieval foundations, in which authors, rather than working with XML per se, have tended to discuss general database and information retrieval technologies that can be applied in XML research. The remaining factors have captured some tightly focused small groups including Version management, Data integration, Functional and Logic Programming in XML, Knowledge management (KM), and Access Control.

Figure 2.

MDS map of top 100 authors in the XML research field (first-author co-citation) The dot at the origin and the circles around it are drawn in the same way and for the same purpose as Figure 1.

It appears that the major differences between the structure revealed through all-author co-citation analysis and that found through first-author co-citation analysis are (1) the number of specialties identified and (2) the cohesion level of specialties identified. The picture produced through all-author co-citation analysis contains author groups that are more coherent, and is therefore considerably clearer. However, the same picture represents fewer specialties in the research field being studied than that produced through first-author co-citation analysis.

The first of these differences is probably due to the different methods used in selecting the representative authors to be included in the analyses. Since complete counts are used in the all-author co-citation analysis, authors who often publish as co-authors have the same chance as first authors to be selected into the analysis. These authors, however, are not likely to be able to make the cut to the top ranked authors in the first-author co-citation analysis that uses straight counts to select authors. Co-authors usually have been working on the same general topics and their being included in the analysis tends to push out from the list of top ranked authors other authors who may represent smaller, unrelated research areas. As a result, the same number of top ranked authors is likely to represent more research topics in first-author co-citation than in all-author co-citation analysis. In other words, when the same number of highly cited authors is selected, all-author co-citation analysis appears to reveal a picture of the most active research specialties while first-author co-citation analysis can cover more specialties in the research field being studied. This is evidenced by the observation that authors in the all-author co-citation analysis results are very concentrated: the first two large specialties include about 75% of the authors while those in the first-author co-citation analysis results are scattered into many specialties. It would be interesting to test whether more specialties will come out if we include more highly cited authors in the all-author co-citation analysis and what pattern it may follow.

The second difference appears to be due to the co-citation counting methods used in the two types of analysis. All-author co-citation takes into account co-citations received by scholars as authors other than first authors in addition to those as first-authors. It also counts co-authors of cited articles as being co-cited. In all, considerably more links between scholars are considered in all-author co-citation analysis. As a result, related authors tend to get higher co-citation counts in all-author co-citation than in first-author co-citation analysis, which ties authors in the same groups closer and pulls authors in different groups farther away from each other, resulting in a clearer picture. This can be easily seen from the MDS maps: the four groups in Figure 1 are quite clear-cut whereas those in Figure 2 show considerable overlap even between the four major groups common to these two maps.

Conclusion

We have examined one of the fundamental aspects of author co-citation analysis (ACA) that has rarely been touched since its introduction in 1981, namely the way that the co-citation counts are defined which provide the raw data on which all subsequent statistical analyses and mapping are based. A comparison between first-author and all-author co-citation analyses of the XML research field has indicated that an all-author co-citation analysis, which takes into account more links between related authors, results in a considerably clearer picture of the intellectual structure of a research field than the classic first-author co-citation analysis. Although the same number of authors selected by citedness when counting all authors tends to represent fewer specialties than counting only first authors, this should not be a problem if future studies can confirm that including a larger number of authors in the analysis will increase the number of specialties identified because more recent techniques such as Pathfinder Networks (PFNETs) certainly have this capability, unlike SPSS, which only allows a limited number of variables. For example, PFNETs when applied to ACA allow about 200 author names to be mapped and have shown considerable advantages for ACA (White, 2003).

ACA has been shown to be a powerful approach to the study of scholarly communication. However, as collecting co-citation counts in the print world is nearly impossible without the aid of citation indexes, ACA studies have been relying heavily on the ISI databases, and consequently have been limited to first-author co-citation. As full-text scholarly publications and tools for searching them are becoming increasingly available on the Web, there are now alternatives to the ISI databases for collecting co-citation data that allow us to go beyond first-author co-citation and thus to get a clearer picture of scholarly communication patterns. The present study has offered an example. We are confident that future studies will combine recent advanced information visualization techniques with various co-citation counting methods to produce even more interesting and revealing ACA results.

Acknowledgements

The author wishes to thank Dr. Andreas Strotmann, Dr. Elizabeth Logan, and Dr. Gary Burnett for their many helpful insights.

Ancillary