All the PubMed authors are processed with our disambiguation system and updated weekly. Currently PubMed collects more than 22 million citations, which have 76 million author name instances. About 4 million namespaces are created from all the author instances, and about 2.5 million namespaces owning multiple citations are subject to name disambiguation. In total, 10.2 million clusters are produced by the disambiguation process, and the average number of citations per individual author is 7.3, which is similar to other studies. We apply a variety of methods to evaluate the disambiguation accuracy of our clustering results: (1) measure the error rate of random pairwise samples; (2) measure the precision and recall of clustering results against manually curated author profiles; (3) measure the split error with gold standard data (self-reference and common-grant).
Previous work (Torvik & Smalheiser, 2009) measures lumping error (merging citations by different authors to the same cluster) and splitting error (separating citations by the same author into multiple clusters) to evaluate the disambiguation performance. To measure the overall error rate, the pairwise lumping error rate and splitting error rate are defined:
Then the overall error rate is evaluated by combining both lumping and splitting error rates based on their distribution:
In addition to the clustering performance metrics, we also define precision and recall as follows:
To evaluate our disambiguation performance and compare to another state-of-the-art method, over 2 million pairs are sampled from Authority 2009 data (Torvik & Smalheiser, 2009), and compared against the same set of pairs from our disambiguation results. All possible pairs within a namespace are considered as candidates for sampling. For a namespace of size N, there are in total N×(N-1)/2 possible pairs. A namespace with more than one citation is randomly sampled at a probability in proportion to its total possible pair counting. A sample pair is either clustered (C) or unclustered (U) by the clustering algorithm. Our results agree with Authority 2009 data on about 83% of the samples, while making different decisions on the remaining 17% of samples, as shown in Table 4.
Table 4. Comparing clustering results
| ||Our clustering|
|Clustered (C) 66.6%||Unclustered (U) 33.4%|
| || ||CC: 65%||CU: 15.4%|
| ||UC: 1.6%||UU: 18%|
Random samples are extracted from the four groups of samples for validation by human reviewers. To ensure a fair comparison, these samples were re-disambiguated within the same namespace but only with the citations dated until 2009. Each sample is first examined by two reviewers with all the information that a reviewer can obtain from all possible information sources to independently determine if the pair of citations belong to the same author. If both reviewers make the same decision for a sample, the validation ends with that consensus judgment. For those samples without consensus, the result depends on if one reviewer can persuade the other reviewer to reach agreement. If the reviewers still disagree, the pair will be arbitrated as unclustered.
Since consensus is reached for the samples in group CC (pairs clustered in both Authority 2009 and our result) and group UU (pairs unclustered in both Authority 2009 and our result), the clustering decisions for those samples highly match the human judgments. This is proven by reviewing 50 random samples from group CC and 50 random samples from group UU. Initially, the reviewers disagreed on 6% and 2% of samples for CC and UU groups, respectively. After discussion the disagreement rates dropped to 2% for CC group while UU group remained the same. Compared with the reviewers' labeling, the observed error rates of our clustering are 2% and 6% in group CC and group UU, respectively, as shown in Table 5.
Table 5. Pairwise error rate by human review
|Category (# of samples)||CC (50)||UU (50)||CU (100)||UC (100)||PSER||PLER||Error rate (splitting + lumping)|
|Authority 2009||2%||6%||57%||46%||9.3%||12.5%||11.9% = 1.8% + 10.1%|
|Our clustering||2%||6%||43%||54%||23.1%||3.2%||9.9% = 7.7% + 2.2%|
However, in group CU (pairs clustered in Authority 2009 but unclustered in our result) and group UC (pairs unclustered in Authority 2009 but clustered in our result), there is a higher level of error. For 100 samples in group CU and 100 samples in group UC, initially the reviewers did not agree on 23% and 16% of samples, respectively. Discussions between reviewers reduced disagreement rates to 5% and 4% for CU and UC groups, respectively, which were eventually solved through examination and majority voting by all participating reviewers. Based on reviewers' labeling, our clustering outperforms Authority 2009 in group CU while underperforming in group UC. As a result, pairwise splitting error rate (PSER) is 9.3% for Authority 2009 and 23.1% for our result, while lower pairwise lumping error rate (PLER) is 12.5% for Authority 2009 and 3.2% for our result, respectively. The performance on the 100 samples for each of CU and UC groups comparing our clustering and Authority 2009 is examined with one-tailed Sign test (Mendenhall, Wackerly, & Scheaffer, 1989), which gives P values of 0.097 and 0.242 for CU and UC groups, respectively.
Splitting errors contribute 7.7% and 1.8% to the final error rates for our clustering and Authority 2009, respectively, while lumping errors contribute 2.2% and 10.1%. Since group CU is almost 10 times the size of group UC, our clustering reduces the overall pairwise error rate to 9.9% from 11.9% for Authority 2009. Based on the pairwise error results, we also compute precision and recall scores for both Authority 2009 and our clustering. Compared to Authority 2009, our clustering improves precision while lowering recall. Overall, the F-score increases to 92.9% from 92.2% for Authority 2009.
In addition to evaluation based on random samples, researchers also propose computing precision, recall, and F-scores by comparing reliable reference authorship records with clustering results generated by a disambiguation process (Kang et al., 2009; Ferreira et al., 2010; Levin et al., 2012). With knowledge of all the publications by a particular author (gold cluster), a cluster is selected as the matching cluster, which among all the computed clusters has the most overlapping publications with the gold cluster, and if a tie occurs, choose the one with the fewest nonoverlapping publications. Then pairwise precision, recall, and F-score are defined:
We collected reference authorship information by examining more than 200 researchers in the clinical medicine field from the preliminary lists of Highly Cited Researchers (http://ip-science.thomsonreuters.com/hcr/clinical_medicine.xlsx). Based on availability, some profiles of these researchers are downloaded from various sources, including www.researchgate.net (an online research profile hub based on researchers' own contribution), Harvard Catalyst Profiles (http://catalyst.harvard.edu/, an academic profile database created with aid of faculty members), COS Pivot (http://pivot.cos.com, a commercial profile database editorially created from publicly available content, university websites and user input), and www.researcherid.com (online profiles managed by researchers with a unique identifier). However, these sources also suffer from inaccurate bibliographies for some authors. Institutional databases (e.g., Harvard Catalyst) may not collect publications written at other institutions, and the COS Pivot database divides some authors' publications into separate profiles by the different institutions, with which the same author has been affiliated. On the other hand, some authors may be associated with publications that are not authored by them (e.g., more than half of the publications listed under Peter M. Schneider's ResearcherID do not list his name as author). By comparing and reconciling bibliographic information from different sources, we created gold standard publication records for 40 highly cited researchers complemented with their individual homepage information, and used them as reference clusters for evaluation.
With reference clusters of the 40 researchers, we compare precision, recall, and F-scores in Table 6 for our clustering results against Authority 2009. Our clustering results are generated for the namespaces including the 40 author names based on the citations until 2009. The average precision of our clustering registers a 0.6% increase over Authority 2009 while the average recall is 0.3% lower. The overall F-score is improved by 0.4% over Authority 2009. To test if these differences are statistically significant, we chose the Wilcoxon signed rank test (Neuhäuser, 2012), a nonparametric test examining the magnitudes and signs of the differences between paired observations. The Student's t-test is not applicable because the samples fail the Shapiro–Wilks normality test (Neuhäuser, 2012). With the one-tailed Wilcoxon test, the P values for precision, recall, and F-score are 0.015, 0.326, and 0.136, respectively. The test confirms a statistically significant precision improvement in our clustering over Authority, while the loss of recall is not statistically significant at the 0.05 level. As a result of the combined effects, the average gain of F-score is not statistically significant. Additionally, to evaluate the impact of name compatibility check on the disambiguation performance, we conduct disambiguation for the 40 researchers by removing first name information. Consequentially, more lumping errors and fewer splitting errors occur, with average precision and F-scores dropping from 95.7% to 87.9% and from 93.4% to 89.0%, and recall increasing from 92.3% to 92.7%, respectively.
Table 6. Pairwise precision, recall, and F-scores for highly cited researchers
|Researcher name (last name, first name)||Our clustering||Authority 2009|
|Rodriguez Martinez, Heriberto||0.780||1.000||0.877||0.768||1.000||0.869|
|Ter Kuile, Feiko||0.823||1.000||0.903||0.823||1.000||0.903|
|Vincent, Jean Louis||1.000||0.876||0.934||1.000||0.957||0.978|
|Average ± standard deviation||0.957 ± .057||0.923 ± .123||0.934 ± .083||0.950 ± .071||0.925 ± .125||0.930 ± .076|
We also collected individual publication records from NIH researchers (many in fellowships) who had at least two publications before 2009. In total, 47 profiles were collected and are employed here as gold standard clusters for evaluation. Compared to Authority 2009 results, the average pairwise precision, recall, and F-scores of our clustering improve by 2.0%, 2.3%, and 2.2%, respectively. Compared to the highly cited researchers, our NIH researchers have significantly smaller numbers of publications (most <100 papers). Small gold standard clusters tend to produce higher variance in performance numbers, and the performance gains are not statistically significant.
In addition to gold standard author profiles, we also evaluated the pairwise error rate with two other gold standard data sets: self-reference pairs and grant participants. By analyzing citation data in PubMed Central (PMC), more than 2.7 million pairs of citations were verified as same author by self-citation. Some pairs of self-citation papers share multiple common authors, and each coauthor is converted to a sample pair for its own namespace. In total 4.7 million such pairs constitute a high-quality gold standard data set for true positive pairs as it is rare for one author to cite another same-name author and our name compatibility check removes incompatible author names automatically. Our results fail to cluster 3% of all these pairs, while 2.1% are unclustered in Authority 2009. The second gold standard test data are created from analyzing over 173 thousand distinct grants in over 1.3 million citations in PubMed. With name compatibility check applied on papers sharing common grant information, about 23 million pairs of common-grant citations are created for all the common authors from those pairs. For a pair of citations sharing multiple grants, only one pair is created. Authority 2009 splits 7.4% of all the pairs, while our result splits 8.5% of all the pairs.
Both self-reference and common grant information are high-quality features indicating the same author relationship; therefore, these data sets demonstrate much lower splitting error rates than random samples. In self-reference data, there is a strong tendency for a pair of papers to share a common topic, as well as coauthors. On the other hand, in large grants, different citations by the same author may be published with completely different collaborators, and share fewer features (e.g., different affiliations), which may lead to lower pairwise similarity and higher splitting error rates.