On the normalization and visualization of author co-citation data: Salton's Cosine versus the Jaccard index

Authors


Abstract

The debate about which similarity measure one should use for the normalization in the case of Author Co-citation Analysis (ACA) is further complicated when one distinguishes between the symmetrical co-citation—or, more generally, co-occurrence—matrix and the underlying asymmetrical citation—occurrence—matrix. In the Web environment, the approach of retrieving original citation data is often not feasible. In that case, one should use the Jaccard index, but preferentially after adding the number of total citations (i.e., occurrences) on the main diagonal. Unlike Salton's cosine and the Pearson correlation, the Jaccard index abstracts from the shape of the distributions and focuses only on the intersection and the sum of the two sets. Since the correlations in the co-occurrence matrix may be spurious, this property of the Jaccard index can be considered as an advantage in this case.

Introduction

Ahlgren, Jarneving, and Rousseau (2003) argued that one should consider using Salton's cosine instead of the Pearson correlation coefficient as a similarity measure in author co-citation analysis, and showed the effects of this change on the basis of a dataset provided in Table 7 (p. 555) of their article. This led to discussions in previous issues of this journal about the pros and cons of using the Pearson correlation or other measures (Ahlgren, Jarneving, & Rousseau, 2004; Bensman, 2004; Leydesdorff, 2005; White, 2003, 2004). Leydesdorff and Vaughan (2006) used the same dataset to show why one should use the (asymmetrical) citation instead of the (symmetrical) co-citation matrix as the basis for the normalization. They argued that not only the value but also the sign of the correlation may change between two cited authors when using the Pearson correlation in the symmetrical versus the asymmetrical case. For example in the dataset under study, Ahlgren et al. (2003, p. 556) found a correlation of r = +0.74 between “Schubert” and “Van Raan” while Leydesdorff and Vaughan (p. 1620) reported that r = −0.131 (p < .05) using the underlying citation matrix.

One can download a set of documents in which the authors under investigation are potentially (co-)cited in the library environment, but this approach of retrieving original citation data and then using Pearson's r or Salton's cosine to construct a similarity matrix often is not feasible in the Web environment. In this environment, the researcher may have only the index available and searches the database with a Boolean AND to construct a co-citation or, more generally, a co-occurrence matrix without first generating an occurrence matrix. Should one in such cases also normalize using the cosine or the Pearson correlation coefficient or, perhaps, use still another measure?

I argue that in this case, one may prefer to use the Jaccard index (Jaccard, 1901). The Jaccard index was elaborated by Tanimoto (1957) for the nonbinary case. Thus, one can distinguish between using the Jaccard index for the normalization of the binary citation matrix and the Tanimoto index in the case of the nonbinary co-citation matrix. The results will be compared using Salton's cosine (Salton & McGill, 1983), the Pearson correlation, and the probabilistic activity index (Zitt, Bassecoulard, & Okubo, 2000) in the case of both the symmetrical co-citation and the asymmetrical citation matrix.

The argument is illustrated with an analysis using the same data as in Ahlgren et al. (2003). This dataset (provided in Table 1) is extremely structured: It contains exclusively positive correlations within both groups and negative correlations between the two groups. The two groups are thus completely separated in terms of the Pearson correlation coefficients; however, there are relations between individual authors in the two groups. An optimal representation should reflect both this complete separation in terms of correlations at the level of the set and the weak overlap generated by individual relations (Leydesdorff, 2005, 2007; Waltman & van Eck, 2007). (A visualization of the co-citation matrix before normalization is provided in Figure 13 by Leydesdorff & Vaughan, 2006, p. 1625.)

Table 1. Author co-citation matrix of 24 information scientists (Table 7 of Ahlgren et al., 2003, at p. 555; main diagonal values added).
Braun50291919813597720000000000000118
Schubert2960301810205551421000000000000139
Glanzel19305316102291491153000000000000148
Moed191816551120517141264000000000000142
Nederhof81010113112813744200000000000089
Narin13202220126411202120119001100110000183
Tijssen559581122131056101210001000083
VanRaan951417132013501312116012100010000138
Leydesdorff7591472110134618149101100020000132
Price71411124205121854109111100201012132
Callon2256411612141026400110001000079
Cronin013429169942410000001011153
Cooper00000000110130145115862001156
Vanrijsbergen000000110100143071551353101168
Croft000001221110571896786212263
Robertson000001111111111593671211108544103
Blair000000000000556718942220042
Harman00000000000081371293195531173
Belkin0000010002006581149369910141098
Spink00000111201123610259211175471
Fidel00000000010001282591123119665
Marchionini000000000001001523107112410555
Kuhlthau000000000101112401145910261463
Dervin00000000020111240110465142051
 1181391481428918383139132132785456686310242739871655563512,244

In summary, two problems have to be distinguished: the problem of normalization and the type of matrix to be normalized. In principle, one can normalize both symmetrical and asymmetrical matrices with the various measures. Ahlgren et al. (2003) provided arguments for using the cosine instead of the Pearson correlation coefficient, particularly if one aims at visualization of the structure as in the case of social network analysis or multidimensional scaling (MDS). Bensman (2004) provided arguments regarding why one might nevertheless prefer the Pearson correlation coefficient when the purpose of the study is a statistical (e.g., multivariate) analysis. The advantage of the cosine being not a statistic but a similarity measure then disappears. Formally, these two measures are equivalent, with the exception that Pearson normalizes for the arithmetic mean while the cosine does not use this mean as a parameter (Jones & Furnas, 1987). The cosine normalizes for the geometrical mean. The question remains which normalization one should use when one has only co-occurrence data available.

The Jaccard Index

In his original article introducing co-citation analysis, Small (1973) suggested the following solution to the normalization problem in Footnote 6:

We can also give a more formal definition of co-citation in terms of set theory notation. If A is the set of papers which cites document a and B is the set which cites b, then A∩B, that is n(A∩B), is the co-citation frequency. The relative co-citation frequency could be defined as n(A∩B) ÷ n(A∪B). (p. 269)

This proposal for the normalization corresponds with using the Jaccard index or its extension (for the nonbinary case) into the Tanimoto index. The index is defined for a pair of vectors, Xm and Xn, as the size of the intersection divided by the size of the union of the sample sets, or in numerical terms:

equation image

where Xij = XiXj. The value of Smn ranges from 0 to 1 (Lipkus, 1999; cf. Salton & McGill, 1983, p. 203f).

In a number of studies (e.g., Egghe & Rousseau, 1990; Glänzel, 2001; Hamers et al., 1989; Leydesdorff & Zaal, 1988; Luukkonen, Tijssen, Persson, & Sivertsen, 1993; Michelet, 1988; Wagner & Leydesdorff, 2005), the Jaccard index and the cosine have systematically been compared for co-occurrence data, but this debate has remained inconclusive. Using coauthorship data, for example, Luukkonen et al. (1993) argued that “the Jaccard measure is preferable to Salton's measure since the latter underestimates the collaboration of smaller countries with larger countries; …” (p. 23). Wagner and Leydesdorff (2005) argued that “whereas the Jaccard index focuses on strong links in segments of the database the Salton Index organizes the relations geometrically so that they can be visualized as structural patterns of relations” (p. 208).

In many cases, one can expect the Jaccard and the cosine measures to be monotonic to each other (Schneider & Borlund, 2007); however, the cosine metric measures the similarity between two vectors (by using the angle between them) whereas the Jaccard index focuses only on the relative size of the intersection between the two sets when compared to their union. Furthermore, one can normalize differently using the margin totals in the asymmetrical occurrence or the symmetrical co-occurrence matrix. Luukkonen et al. (1993, p. 18), for example, summed the co-occurrences in their set (of 30 countries) for obtaining the denominator while Small's (1973) above definition of a relative co-citation frequency suggested using the sum of the total number of occurrences as the denominator. White and Griffith (1981, p. 165) also proposed using “total citations” as values for the main diagonal, but these authors decided not to use this normalization for empirical reasons.

Table 1 illustrates the two options by providing data for the set under study and adding the total number of citations as the main diagonal and the total number of co-citations as margin totals. For example, using the margin totals for Schubert and Van Raan, respectively, the Tanimoto index is 5/(139 + 132 − 5) = 0.019, while the Jaccard index based on the citations is 5/(60 + 50 − 5) = 0.048. Note that the co-occurrence matrix itself no longer informs us about the number of cited documents. The co-occurrence matrix contains less information than the occurrence matrix. 11 However, the total number of citations can be added by the researcher on the main diagonal. One also could consider this value as the search result for the co-citation of “Schubert AND Schubert,” and so on.

Note that the value added on the main diagonal of the co-citation matrix corresponds to the margin total of the asymmetrical matrix (i.e., the total number of citations). Therefore, a normalization of the symmetrical matrix using these values on the main diagonal precisely corresponds with using the Jaccard normalization of the asymmetrical occurrence matrix. Hereafter, I distinguish between the two normalizations in terms of the symmetrical and the asymmetrical matrix, respectively. In the latter case, I use the values on the main diagonal; in the former, the margin totals.

Recall that the Jaccard index does not take the shape of the distributions in account, but only normalizes the intersection of two sets with reference to the sum of the two sets. In other words, the cell values are independently evaluated in relation to margin totals and not in relation to other cells in the respective rows and columns of the matrix. This insensitivity to the shape of the distributions can be considered as both an advantage and a disadvantage. In the case of the asymmetrical matrix, the Jaccard index does not exploit the full information contained in the matrix. This can be considered a disadvantage. Both the cosine and the Pearson correlation matrix fully exploit this information. In the case of the symmetrical matrix, however, one already has lost the information about the underlying distributions in the asymmetrical matrix. Import of the margin totals from the asymmetrical matrix as a value on the main diagonal then adds to the information contained in the symmetrical matrix.

The Jaccard index has this focus on cell values instead of distributions in common with the probabilistic activity index (PAI), which is the preferred measure of Zitt et al. (2000). The PAI is the (traditional) ratio between observed and expected values in a contingency table based on probability calculus:

equation image

Like the Jaccard and Tanimoto index, this index can be applied on the lower triangles of symmetrical co-occurrence matrices while the Pearson coefficient and the cosine are based on full vectors and thus use the information contained in a symmetrical matrix twice (Hamers et al., 1989). 22

Results

Table 2 provides the Spearman rank-order correlations among the lower triangles of the various similarity matrices under discussion. Spearman's ρ is used instead of Pearson's r because objects in proximity matrices are based on dyadic relationships (Kenny, Kashy, & Cook, 2006); the assumption of independence required for parametric significance tests is violated (Schneider & Borlund, 2007).

Table 2. Spearman correlations among the lower triangles of similarity matrices using different criteria, and both asymmetrical citation and symmetrical co-citation data for 24 authors in both scientometrics and information retrieval.
 Pearson asymm.Cosine asymm.Jaccard asymm.Pearson symm.Cosine symm.Tanimoto symm.PAI symm.
  1. **Correlation is significant at the .01 level (two-tailed).

Spearman's ρPearsonCorrelation coefficient1.000.910**.909**.828**.818**.904**.910**
 asymmetricalSig. (2-tailed).000.000.000.000.000.000.000
  N276276276276276276276
 CosineCorrelation coefficient.910**1.0001.000**.834**.857**.998**.983**
 asymmetricalSig. (2-tailed).000.000.000.000.000.000.000
  N276276276276276276276
 JaccardCorrelation coefficient.909**1.000**1.000.834**.856**.998**.983**
 asymmetricalSig. (2-tailed).000.000.000.000.000.000.000
  N276276276276276276276
 PearsonCorrelation coefficient.828**.834**.834**1.000.818**.837**.823**
 symmetricalSig. (2-tailed).000.000.000.000.000.000.000
  N276276276276276276276
 CosineCorrelation coefficient.818**.857**.856**.818**1.000.856**.848**
 symmetricalSig. (2-tailed).000.000.000.000.000.000.000
  N276276276276276276276
 TanimotoCorrelation coefficient.904**.998**.998**.837**.856**1.000.984**
 symmetricalSig. (2-tailed).000.000.000.000.000.000.000
  N276276276276276276276
 PAICorrelation coefficient.910**.983**.983**.823**.848**.984**1.000
 symmetricalSig. (2-tailed).000.000.000.000.000.000.000
  N276276276276276276276

The perfect rank-order correlation (ρ = 1.00; p < .01) between the cosine matrix derived from the asymmetrical citation matrix, and the Jaccard index based on this same matrix, supports the analytical conclusions given earlier about the expected monotonicity between these two measures (Schneider & Borlund, 2007). There are, however, some differences in the values which matter for the visualization. Figures 1 and 2 provide visualizations using these two matrices of similarity coefficients, respectively.

Figure 1.

Cosine normalized representation of the asymmetrical citation matrix (Pajek;33 Kamada & Kawai, 1989).

Figure 2.

Jaccard-index-based representation of the co-citation matrix using total citations for the normalization (Pajek;3 Kamada & Kawai, 1989)

The cosine remains the best measure for the visualization of the vector space because this measure is defined in geometrical terms. Although the Spearman correlation of the cosine-normalized matrix with the Jaccard index of this same matrix is unity, Figure 2 does not provide the fine structure within the clusters to the same extent as does Figure 1. Figure 3 shows that the Jaccard index covers a smaller range than does the cosine (Hamers et al., 1989). The smaller variance (0.08 vs. 0.21 for the cosine-based matrix) may further limit the dissolvent capacity of the measure in visualizations.

Figure 3.

Relation between the Jaccard index and Salton's cosine in the case of the asymmetrical citation matrix: [ N = (24 * 23)/2 = 276].

In both cases, the analyst can emphasize the separation between the two groups by introducing a threshold. In the case of the Jaccard index, the amount of detail in the relations between the two groups is then lower than that in the case of the cosine-normalized matrix. For example, only the two co-citation relations between “Tijssen” and “Croft” pass a 0.05 threshold for the Jaccard index because both these authors have relatively low values on the main diagonal and therefore in the denominator of the equation while several other co-citation relations (e.g., the relative intermediate positions of “Price” and “Van Raan”) remain visible in the case of the cosine normalization and a cosine ≥ 0.05.

The rank-order correlations of both these lower triangles with the Tanimoto index of the symmetrical matrix also are near unity (ρ = 0.998). All correlations with the probabilistic affinity index are slightly lower (ρ < 0.99). The correlations between using the Pearson correlation or the cosine on the asymmetrical and symmetrical matrices, respectively, are below 0.90. Despite the relatively small differences among the lower triangles, the visualizations (not shown here) are different.

In summary, the cosine-normalized asymmetrical occurrence matrix provides us with the best visualization of the underlying structure. When one is not able to generate an occurrence matrix, the Jaccard index using the values of the total number of citations on the main diagonal for the normalization is the second-best alternative. Table 3 reports the results of using the 12 scientometricians as a subset. The results confirm that the Jaccard index normalized this way leads to results very similar (ρ > 0.99; p < 0.01; boldfaced in Table 3) to those of the cosine-normalized occurrence matrix.

Table 3. Spearman correlations among the lower triangles of similarity matrices using different criteria, and both asymmetrical citation and symmetrical co-citation data for the subgroup of 12 scientometricians.
 Pearson asymm.Cosine asymm.Jaccard asymm.Pearson symm.Cosine symm.Tanimoto symm.PAI symm.
  1. *Correlation is significant at the .05 level (two-tailed). **Correlation is significant at the .01 level (two-tailed).

Spearman's ρPearson asymmetricalCorrelation coefficient1.000.862**.838**−.042.253*.766**.912**
  Sig. (2-tailed).000.000.000.736.040.000.000
  N66666666666666
 Cosine asymmetricalCorrelation coefficient.862**1.000.995**−.268*.114.966**.857**
  Sig. (2-tailed).000.000.000.029.360.000.000
  N66666666666666
 Jaccard asymmetricalCorrelation coefficient.838**.995**1.000−.273*.109.974**.842**
  Sig. (2-tailed).000.000.000.027.382.000.000
  N66666666666666
 Pearson symmetricalCorrelation coefficient−.042−.268*−.273*1.000.682**−.256*−.005
  Sig. (2-tailed).736.029.027.000.000.038.966
  N66666666666666
 Cosine symmetricalCorrelation coefficient.253*.114.109.682**1.000.069.190
  Sig. (2-tailed).040.360.382.000.000.582.127
  N66666666666666
 Tanimoto symmetricalCorrelation coefficient.766**.966**.974**−.256*.0691.000.837**
  Sig. (2-tailed).000.000.000.038.582.000.000
  N66666666666666
 PAI symmetricalCorrelation coefficient.912**.857**.842**−.005.190.837**1.000
  Sig. (2-tailed).000.000.000.966.127.000.000
  N66666666666666

Conclusions

Leydesdorff and Vaughan (2006) provided reasons for using the asymmetrical matrix underlying the co-occurrence matrix as a basis for multivariate analysis (e.g., MDS, clustering, factor analysis). For the purposes of visualization, the cosine is the preferred measure for the reasons given by Ahlgren et al. (2003); for other statistical analyses, one may prefer to normalize using the Pearson correlation coefficient (Bensman, 2004) or Euclidean distances (in the case of MDS).

If the only option is to generate a co-occurrence matrix, as is often the case in webometric research, the Jaccard index is the best basis for the normalization because this measure does not take the distributions along the respective vectors into account. Like the Jaccard index, the PAI focuses only on the strength of the co-occurrence relation. If available, however, the frequencies of the occurrences which are conventionally placed on the main diagonal can be expected to improve the normalization. In the empirical examples, this Jaccard index was as good a measure as the cosine-normalized citation matrices. Remember that the research question concerned which similarity measure to use when the occurrence matrix cannot be retrieved.

Which of the two options for the normalization of the Jaccard index will be preferable in a given project depends on the research question and the availability of the data. However, one should be very cautious in using the symmetrical matrix as input to further statistical analysis because of the change of the size and potentially the sign of the correlation when multiplying the citation matrix with its transposed. Using the Jaccard index with the diagonal value based on the margin totals of the asymmetrical matrix circumvents this problem.

Acknowledgment

I am grateful to Liwen Vaughan and three anonymous referees for comments on previous drafts of this article.

Footnotes

  1. 1

    Two symmetrical matrices can be derived from one asymmetrical matrix. Borgatti, Everett, and Freeman (2002) formulated this (in the reference guide of UCINet) as follows: “Given an incidence matrix A where the rows represent actors and the columns events, then the matrix AA′ gives the number of events in which actors simultaneously attended. Hence AA′ (i,j) is the number of events attended by both actor i and actor j. The matrix A′ A gives the number of events simultaneously attended by a pair of actors. Hence A′ A(i,j) is the number of actors who attended both event i and event j” (p. 41).

  2. 2

    Leydesdorff (2005) discussed the advantages of using information measures for the precise calculation of distances using the same co-occur-rence data. Information theory also is based on probability calculus (cf. Van Rijsbergen, 1977). The information measure generates an asymmetrical matrix based on a symmetrical co-occurrence matrix because the distance from A to B can be different from the distance between B and A. The measure thus generates a directed graph while the measures under discussion here generate undirected graphs. Directed graphs can be visualized using Waldo Tobler's Flow Mapper, available at http://www.csiss.org/clearinghouse/ FlowMapper/

  3. 3

    Pajek is a software package for social network analysis and visualization which is freely available for academic usage at http://vlado.fmf.uni-lj.si/pub/networks/pajek/

Ancillary