Sneaked references: Cooked reference metadata inflate citation counts

We report evidence of an undocumented method to manipulate citation counts involving 'sneaked' references. Sneaked references are registered as metadata for scientific articles in which they do not appear. This manipulation exploits trusted relationships between various actors: publishers, the Crossref metadata registration agency, digital libraries, and bibliometric platforms. By collecting metadata from various sources, we show that extra undue references are actually sneaked in at Digital Object Identifier (DOI) registration time, resulting in artificially inflated citation counts. As a case study, focusing on three journals from a given publisher, we identified at least 9% sneaked references (5,978/65,836) mainly benefiting two authors. Despite not existing in the articles, these sneaked references exist in metadata registries and inappropriately propagate to bibliometric dashboards. Furthermore, we discovered 'lost' references: the studied bibliometric platform failed to index at least 56% (36,939/65,836) of the references listed in the HTML version of the publications. The extent of the sneaked and lost references in the global literature remains unknown and requires further investigations. Bibliometric platforms producing citation counts should identify, quantify, and correct these flaws to provide accurate data to their patrons and prevent further citation gaming.


Introduction
It is now well recognised that the Publish or Perish atmosphere fuels questionable research practices (Crous, 2019).The introduction and widespread adoption of computed indicators (h-index, impact factor. . . ) has been leading academics to a situation where publishing is not enough and being cited is crucial.In this world of Be Cited or Perish, motivations for citation manipulations are on the rise (Lawrence, 2007).Possibilities of such manipulations have been documented by whistleblowers and researchers alike (Baccini, De Nicolao, & Petrovich, 2019;Haley, 2017).Beel and Gipp (2010) experimented hiding citations to human eyes by using 'white on white' text.Labbé (2010) achieved h-index manipulation through injection of meaningless texts containing a fixed set of references.Delgado López-Cózar, Robinson-García, and Torres-Salinas (2014) reproduced the previous experiment, demonstrating how the h-index and impact factors of real researchers and journals can be manipulated.It is worth noting that some editorial practices may not be too far away from this type of manipulation: a seemingly legitimate editorial could cite all articles from a journal, thereby increasing its Impact Factor (e.g., Foley & Valkonen, 2012;Heathers & Grimes, 2022).Another method is the socalled 'citation cartel' method (Franck, 1999).As part of the cartel, you cite specific authors that will cite you in return.This kind of manipulation also arises at the journal level (Davis, 2016;Kojaku, Livan, & Masuda, 2021).Another example is called 'citation plantation'1 and refers to undue over-citation of certain authors, even on unrelated topics.Last but not least, one of the most famous and common methods, is the addition of references through the peer-review process.At review time, authors may be asked by reviewers and editors to add undue references to their submission.Whistleblowers and academic sleuths often try to detect citation manipulations through skews in citation (or self-citation) data (Szomszor, Pendlebury, & Adams, 2020;Van Noorden, 2020b;Wren & Georgescu, 2022).
As the motivation for and practice of citation manipulation gain traction, the consequences of such a practice are starting to become visible in academia.From time to time, highly cited researchers are banned from editorial boards (Van Noorden, 2020a) because of their unethical practice to trade citations for manuscript acceptance.In 2021, Clarivate excluded 300 researchers from its Highly Cited Researchers list, and about 550 in 2022 (Oransky, 2022).This decision was taken based on evidence of citation manipulation.Another example: some malevolent individuals forge hijacked journals by imitating current or defunct journals (Abalkina, Cabanac, Labbé, & Magazinov, 2022).They publish nonreviewed papers that cite papers generating potential undue citations.Some manage to get these indexed by Elsevier's Scopus, a bibliometric platform that computes author-level indicators for research assessment (Baas, Schotten, Plume, Côté, & Karimi, 2020).
It is worth pointing out that citation manipulation by various actors occurs at many places and at different times during the life cycle of a scientific publication.Up until now, the documented manipulations always implied modifications of the version of record (Hinchliffe, 2022) (i.e., the real article available in PDF/HTML in its final version) by adding references to it.In this paper, we document a new flaw that is currently exploited: sneaking undue references during the DOI registration by supplying extra and irrelevant metadata.
The scientific publication itself, namely the version of record, remains unaltered and undue citations are actually unreachable by readers.We provide evidence that this manipulation is in use as we discovered in at least three journals of an open access publisher.This exploit will remain available as long as the metadata pushed by publishers are not carefully verified.
2 The exploit: Increased citation counts with sneaked references From a paper's bibliography to bibliometric dashboards, the path is long for references to be counted.Different actors using various deception techniques can sneak undue references in along this path.

Context: the DOI and metadata registration process
As Figure 1 shows, after acceptance and before publication of papers, publishers register DOIs to Registration Agencies.The main one is Crossref that mints DOIs for a fee and hosts the publishers' metadata that become publicly available (Hendricks, Tkaczyk, Lin, & Feeney, 2020).Most publishers push the reference lists of their papers as part of the registered metadata (Singh Chawla, 2022).2Crossref is then used as a source by multiple platforms such as SpringerLink, 3 The Lens (Penfold, 2020), or Dimensions (Herzog, Hook, & Konkiel, 2020). 4Bibliometric platforms source from the metadata registered at Crossref inter alia to report indicators at the individual/institutional/journal levels, such as citation counts, impact factors, and h-indices.When registering a new publication and its references at Crossref, a publisher may sneak extra undue references in the metadata sent in addition to the ones originally present.Then, digital libraries (e.g., SpringerLink) and bibliometric platforms (e.g., Dimensions) harvest these metadata, undue citations included.These sneaked references are processed and counted even if they are not present in the original publication.
This new way to manipulate citation counts relies on metadata manipulations that leave the original text untouched.This exploit is made possible because Crossref trusts publishers to extract, report, and send them metadata about the publications, including the references.As a matter of fact, Crossref not controlling the accuracy of the metadata provided by publishers creates a 'security breach' within the information flow.The next section shows that this manipulation is actually in use.

Case study: Evidence of sneaked references in three journals of a given publisher
To provide evidence of citation counts manipulation, one needs to collect samples of metadata at three different places along the reference registration path depicted in Figure 1.Sneaked references are revealed when comparing the reference lists of publications as provided 1) by the publisher on its website, 2) on the metadata registry at Crossref and 3) by a bibliometric platform: Dimensions.
As proof of the 'sneaked references' manipulation happening, let us analyse three journals published by Technoscience Academy, 5 an Indian open access publisher and Crossref member.These three journals were selected after we identified incoherent metadata that we flagged in May 2022 on PubPeer (Figure 2).This case involves a Hindawi journal article published on 22 March 2022.The Hindawi website showed a large number of citations (n = 107) for a publication that had been online for less than two months.On the screenshot in Figure 2, the number 107 stems from Altmetric, a service offered by Dimensions that sources data from publishers and Crossref.6Moreover, this number was far greater than the number of downloads (n = 62).These two observations combined had us suspect manipulations going on.
Further examination revealed that this Hindawi publication had no citations on Google Scholar.According to Dimensions, citations stemmed mostly from three main journals with 1,000+ DOIs registered at Crossref.After careful verification, the citing publications did not contain any references to the Hindawi article.This is clear evidence that some references registered at Crossref (for the citing publication) do not exist in reality.We assess the extent of the discrepancy between the bibliographies of 1) the published papers and 2) the metadata that were registered, hypothesising that these two sets of references should be identicalexcept for undue sneaked references.

Method to assess the extent of sneaked references
This section introduces a two-step method to measure differences between reference lists.First, we collect metadata about a publisher's catalogue from three sources: the publisher's website, Crossref, and Dimensions.Second, we compare the reference lists as they appear in these three sources.We illustrate this method with the three largest journals published by Technoscience Academy and report numbers as of January 2023.

Collecting metadata from Crossref
Crossref releases the list of DOIs they mint by journal and by publisher in the Crossref Depositor. 7For example, here are the DOIs of the journals registered for Technoscience Academy: We retrieved the reference list of each publication by querying the Crossref API.For instance, https://api.crossref.org/works/10.32628/IJSRST229212provides the metadata of publication doi:10.32628/IJSRST229212,including an attribute called reference-count.For this particular example, Crossref provided a list of 47 references (Figure 3).

Metadata collection from the Publisher's web site
We retrieved the reference list of each publication identified in the previous section.Without any available API to retrieve metadata from the publisher, this step is specific to each journal.The journal articles Technoscience Academy publishes are in open access: available in both PDF and HTML.We assumed that the reference lists provided in HTML conformed to the ones present in the PDF files-and verified this by visual inspection of a dozen cases.HTML pages feature a tab with the list of references that we collected via ad hoc scripts.
The paper of our running example (doi:10.32628/IJSRST229212)has seven references shown in the PDF and on the HTML page (Figure 4).The references listed in HTML are also found in Crossref.But an additional set of 40 well-formed references turn out to be undue references to unrelated publications.This set comprises sneaked references that might have been added at registration time.

Metadata collection from Dimensions
Dimensions provides registered accounts for free, allowing users to query their database and export results up to 5k publication records.We used the 'Publisher' filter of Dimensions to collect the metadata of all papers published by Technoscience Academy and exported results using the 'Export for bibliometric mapping' feature.The export came as a CSV file of 3,634 publication records.One of the columns contains the reference list for each paper, as recorded by Dimensions.
According to this file, the article of the running example (doi:10.32628/IJSRST229212)has 13 references. . . to be compared to seven in HTML and 47 registered at Crossref.Visual inspection of the references found at Dimensions (Figure 3) reveals that none of these 13 references are from the original set of seven references (PDF and HTML, Figure 4).
Along the registration process, the seven original references were replaced by 13 undue sneaked references.The original version of the publication lists seven references; it was registered at Crossref with 40 undue sneaked references.Finally, Dimensions reports 13 references for this paper, all sneaked.The seven original references appearing in HTML/PDF got lost along the path.

Detecting sneaked and lost references
Tracing the propagation of individual references from one platform to another proves quite challenging due to the variability of reference formatting (e.g., APA, MLA, Chicago. . .).We decided to examine and compare the number of references to estimate inconsistencies between the size of the reference list in HTML/PDF versions and the registered metadata.
For each publication p, let R p C (resp.R p D ) be the number of references registered at Crossref (respectively Dimensions) and S p the number of references shown in the PDF or HTML versions.Then δ p x = R p x − S p given x ∈ {C, D} estimates inconsistencies.The value δ p D (respectively δ p C ) reflects inconsistencies between registered references at Crossref (respectively Dimensions) and those present in HTML/PDF for publication p.Let us interpret δ p x : • A zero value for δ p x indicates that, for publication p, the number of references registered in x equals the number of references listed in its PDF/HTML version.However, δ p x = 0 does not guarantee that the registered references are the same as the references in the PDF/HTML.

• δ p
x < 0 reveals lost references: some are present in the publication p but are not registered.In that case δ p x is a lower bound of lost references.

• δ p
x > 0 is the lower bound of the number of sneaked references for publication p.
Let us illustrate the 'lower bound' nuance on the running example: p = IJSRST229212.The number of sneaked references is underestimated when computing δ p D = R p D − S p = 13 − 7 = 6 in comparison with the exact number of sneaked references which is equal to 13 (see Figure 3 and Figure 4).In that example, since δ i p > 0 we cannot conclude that references are lost.However, comparing the content of the reference list allows us to see that all seven references of the HTML/PDF version are lost (see Figure 3 and Figure 4).We can therefore see that δ i p also underestimates the number of lost references.
For a particular set A of journal articles, three publication subsets can be distinguished: • The subset OK noted with a checkmark ✓, contains publications for which δ p x = 0. • The subset Sneaked noted with a ghost , contains publications for which δ p x > 0, where we have evidence that references have been sneaked.
• The subset Missing noted with a skull , contains publications for which δ p x < 0, where we have evidence that references are lost.
For a set A , we can compute ∆ x (respectively ∆ x ) the overall lower bound of sneaked (respectively lost) references with the sum over p ∈ A of positive (respectively negative) δ i : It is also possible to see if references found in publications of the Sneaked set benefit a few people or a few journals in particular.We detail the results of our analysis below.

Quantitative analysis
The lower bound of sneaked (∆ x ) and lost references (∆ x ) for the set of journal articles from three journals presented previously are given in Table 1 and Table 2. Data were collected from three different sources (publisher's website, Crossref, and Dimensions).Differences observed between HTML/PDF and Crossref (∆ x C ) are shown in Table 1, whereas Table 2 shows the differences between HTML/PDF and Dimensions (∆ x D ).In Table 1 an article is counted in the Sneaked set if the reference list in HTML/PDF is shorter than the one found at Crossref (δ i C > 0).Among the 3, 506 articles published by these three journals, at least 230 articles contain more references than they should.∆ C = 5, 978 is the lower estimation of the total number of references that were unduly sneaked at registration time.This represents an augmentation of 9.8% of the original set of references (60, 635).Out of 65, 836 references that were registered, 9.1% = 5,978 /65,836 are therefore Sneaked.In addition, for 73 articles some references were missing (status Missing), and in total, at least 777 references are missing in Crossref.This represents a decrease of 1.2% = 777 /60,635.Table 2 compares the sizes of the reference lists in HTML/PDF and in Dimensions.For the vast majority of publications some references are missing.This is the case for 3, 184 articles (status Missing) out of the total of 3, 506.For these publications, some references can be seen in HTML/PDF but are not registered in Dimensions.In total, at least 40.7% = 24,712 /60,635 of the original references are missing in Dimensions.For 120 publications, more references can be found in Dimensions than in the HTML version (status Sneaked).In total, at least 2.7% = 1,016 /36,939 of references registered for these journals are undue sneaked references.

Qualitative analysis
To understand the discrepancies highlighted above, we decided to closely inspect some examples of problematic cases.In particular, we decided first to inspect the cases displaying significantly large discrepancies.For instance: • doi:10.32628/ijsrset21852has 150 references in its HTML version but 300 are registered in Crossref.We noticed that the reference list is duplicated.Only 114 references can be found in Dimensions.Among the 186 = 300 − 114 missing references, an example is a reference claimed to be a technical report from the Liverpool John Moores University, UK by Younis & Kifayat which, after verification, is not indexed by Dimensions (but is indexed in Google Scholar).
• doi:10.32628/ijsrst229394lists 27 references in HTML/PDF but 108 = 4 × 27 were registered in Crossref.We noticed that the same set of 27 references were registered four times.Nevertheless, only 19 references can be found in Dimensions such that eight references are missing.
From these examples, we can conclude that lost references (status Missing) may often result from a failure to attach a given reference to a citable item because of incomplete or erroneous registered metadata in Crossref.Noteworthy, some types of references are, by definition, not indexed in Dimensions (private correspondences, songs. . .).We can also conclude that some of the sneaked references may be due to careless management of metadata resulting in such erroneous registrations.These duplications however do not seem to propagate to Dimensions: at most one occurrence of the duplicated references was listed.
However, not all sneaked references can be explained by careless metadata registration as can be seen in the following example.The article doi:10.32628/ijsrst229154 has an HTML/PDF version that lists 23 references.However, 63 can be found in Crossref and 33 in Dimensions.An analysis of the 10 sneaked references in Dimensions reveals that they benefit mainly to two authors (Rao & Kataria).It therefore seems that additional references may be sneaked to benefit specific scholars.To verify this hypothesis, we computed the most frequent words in Crossref's metadata for papers identified as containing sneaked references.This analysis reveals that undue sneaked references mostly benefited to two scholars and to a few journals published by Technoscience Academy: It is worth noting that Abalkina et al. (2022) identified as 'hijacked' these last two journals in the list above.

Discussions: Outcomes and possible countermeasures
Crossref, the largest DOI registration agency, provides metadata to many downstream consumers, such as Dimensions, The Lens, or SpringerLink.The numbers provided by these downstream services guide funding decisions and state policies.Our results shed light on flawed metadata affecting reference registration and, in turn, citation counts.We have identified a new source of quality problems: undue references sneaked at metadata registration time.To the best of our knowledge, the vulnerability we discovered is the first documented exploit of metadata that does not modify the underlying PDF/HTML article.Our analysis highlights that the problems may arise from various origins ranging from publishers' careless management of metadata to potential citation counts manipulations.We indeed observed artificially inflated citation counts that seem to mostly benefit specific scholars or scientific journals.The metadata registration process is vulnerable: it was and is likely to be abused by various actors (authors, journals, publishers) to unduly inflate their citation counts.Additionally, this vulnerability, if exploited, may hinder other scholars who will not obtain their deserved citations.
To prevent exploits of this vulnerability affecting the computation of citation counts, many actions and countermeasures exist.The most trivial ones imply the three key actors (see Figure 1) checking each others' metadata: • Publishers and Crossref should check and compare the coherence of references registered and the ones actually present in publications (PDF/HTML).• Bibliometric platforms and Crossref should check on each other to make sure that citation counts are coherent with registered metadata.• Bibliometric platforms and publishers should check on each other, to ensure that citations credited to articles are indeed supported by the associated references in the citing publications.
A more extensive countermeasure would involve third parties independently auditing the whole process: from checking the metadata uploaded into metadata registration agencies to checking the validity of citation counts.The Initiative for Open Citations (I4OC) currently estimates that 99% of the all citations in the literature are pushed to Crossref (Schiermeier, 2017;Shotton, 2013;Singh Chawla, 2022).Open and free access to APIs at various steps of the process is required to enable third parties to check the global quality of the provided data.
A curative action is also needed.The COPE (2019) guidelines on citation manipulation should account for the exploits that we have introduced in this article.We believe it is important to issue guidelines to specify the appropriate reporting and editorial actions regarding such cases of exploits and manipulations.On top of correcting science and the scholarly literature in due time (Besançon, Bik, Heathers, & Meyerowitz-Katz, 2022), extra attention must be given to correct erroneous reference metadata.

Conclusion
This article showed evidence of an undocumented vulnerability affecting the process of metadata registration for academic works.Despite being absent from the Version of Record (in HTML/PDF), sneaked references exist in the metadata, which in turn inflates citation counts unduly.The method we proposed estimates lower bounds for the number of references that were lost and sneaked in.Through a case study, we show that this vulnerability is actually exploited.One still needs to apply this method on the entire literature to estimate the extent of the 'sneaked/lost references' issue at the global scale.
Our work questions the quality and veracity of the reference metadata harvested at Crossref and used by bibliometric platforms, such as Dimensions.These metadata support commercial bibliometric services and inform influential rankings of institutions and individuals.All actors involved should be held accountable for the quality of the data they provide and trade.We believe they must prevent metadata abuse, keeping in mind the inerrant drawbacks of the extensive use of citation metrics, fuelling elaborate cheating schemes.

Supporting information
We release supplementary materials for reproducibility purposes and future scientific literature screening.The code developed to collect and analyze the data reported in this article is archived at Zenodo (https://doi.org/10.5281/zenodo.8388930).

Fig. 2
Fig.2PubPeer post https://pubpeer.com/publications/A172115FC8D0A5F44B31A18B08BB26reporting a Hindawi journal article with more citations than downloads.Most citations appear not to match any of the references in the allegedly citing publications.After careful examination, it appeared that these were sneaked references: existing in the metadata only and not in the PDFs of the allegedly 'citing' publications.

Fig. 4
Fig.4Reference list in PDF (left) and in HTML (right) versions of doi:10.32628/IJSRST229212.In this case, the PDF and HTML versions match each other, which is expected.

•
'J. Nageswara Rao' benefited from 3,103 extra citations.• 'Bhavesh Kataria' benefited from 1,564 extra citations.• The International Journal of Scientific Research in Science, Engineering and Technology (IJSRSET) gained 826 extra citations.• The International Journal of Advanced Science and Technology (IJAST) was unduly cited 537 times.• The Turkish Journal of Physiotherapy and Rehabilitation appeared 428 times in sneaked references.

Table 1
Statistics on the Technoscience Academy corpus showing the discrepancies between the references found in the versions of record (HTML/PDF) and the ones registered at Crossref.

Table 2
Statistics on the Technoscience Academy corpus showing the discrepancies between the references found in versions of record (HTML/PDF) and the ones registered in Dimensions.