Why are these publications missing? Uncovering the reasons behind the exclusion of documents in free‐access scholarly databases

This study analyses the coverage of seven free‐access bibliographic databases (Crossref, Dimensions—non‐subscription version, Google Scholar, Lens, Microsoft Academic, Scilit, and Semantic Scholar) to identify the potential reasons that might cause the exclusion of scholarly documents and how they could influence coverage. To do this, 116 k randomly selected bibliographic records from Crossref were used as a baseline. API endpoints and web scraping were used to query each database. The results show that coverage differences are mainly caused by the way each service builds their databases. While classic bibliographic databases ingest almost the exact same content from Crossref (Lens and Scilit miss 0.1% and 0.2% of the records, respectively), academic search engines present lower coverage (Google Scholar does not find: 9.8%, Semantic Scholar: 10%, and Microsoft Academic: 12%). Coverage differences are mainly attributed to external factors, such as web accessibility and robot exclusion policies (39.2%–46%), and internal requirements that exclude secondary content (6.5%–11.6%). In the case of Dimensions, the only classic bibliographic database with the lowest coverage (7.6%), internal selection criteria such as the indexation of full books instead of book chapters (65%) and the exclusion of secondary content (15%) are the main motives of missing publications.


| INTRODUCTION
Scholarly bibliographic databases are key elements to support the advance of science because they provide updated information about past scientific developments that makes possible to contrast current discoveries.Many of these products index the cited references included in the publications to enlarge the discovery of information and to value the influence of that records.Web of Science and Scopus are traditional citation indexes that gather bibliographic records from a selected list of sources, mainly scholarly journals.
However, the advent of the Web in 1989 meant the transformation of the publishing model (Borgman & Furner, 2002), and consequently, a new way to gather publications and measure citations.Launched in 1997, CiteSeer was the first academic search engine that used a crawler to harvest electronic publications, extracting and computing citations between publications (Fiala, 2011).This model served as basis for subsequent developments such as Google Scholar (Delgado L opez-C ozar et al., 2019) and Microsoft Academic (Wang et al., 2019(Wang et al., , 2020)).Search engine-based bibliographic information systems tend to provide more comprehensive document coverage than traditional selective systems, due to the digital transformation of the publishing system and the proliferation of repositories and web platforms (Ortega, 2014).
Currently, new hybrid models combining publications gathered both through traditional curation and web crawling have been released, thanks to improvements in the harvesting, storage, and processing of bibliographic data, besides the free releasing of citation metadata (Open Citations) (Ortega, 2021).The free version of Dimensions (Herzog et al., 2020;Orduña-Malea & Delgado-L opez-C ozar, 2018;Thelwall, 2018), Lens (Penfold, 2020), or Scilit are new hybrid information services (referred to as free-access databases) that are facilitating discovery of the scientific literature as well as providing new analytic tools and bibliometric indicators (i.e., altmetrics, field-normalized metrics, usage-based metrics).
The proliferation of new free-access scholarly databases has fostered many studies comparing their coverage and overlap (see section 2) to help practitioners, metaresearchers and scholars to select the most appropriate databases to carry out systematic literature reviews, meta-analysis, bibliometric analyses, or literature searches (Bramer et al., 2017;Gusenbauer & Haddaway, 2020).As Mongeon and Paul-Hus (2016) states, "the validity of bibliometric analyses for research evaluation lies in large part on the databases' representativeness of the scientific activity studied."An incorrect selection of scholarly databases might report incomplete or misleading results and false conclusions.
However, most of these analyses are based on the direct comparison of one database with the other ones.In our opinion, this procedure could sweep along biases from the original database and distorting the coverage of those sources.This study attempts to develop a different approach using a random sample from a non-selective service (Crossref) to compare different scholarly databases' coverage.We hypothesize that using a third party service, that is, using a third database to compare the coverage of other two, would reduce possible biases in the comparison of databases, as well as to know how selection criteria and technical requirements influence the coverage of scientific literature.

| LITERATURE REVIEW
As information products, scholarly databases can be evaluated under different features (e.g., search and results interface, quality of data, exporting capabilities), where coverage is one of the most important parameters to test their bibliometric capabilities.Coverage is measured not only to test the databases' power to find and index scientific literature, but also to check their completeness and to detect potential biases.Coverage can be measured in two different ways: measuring indexed documents and measuring cited documents (coverage via citations).Comparative analyses can be carried out by applying different methods (e.g., direct comparisons, third-party comparisons).The most relevant literature on free-access bibliographic databases coverage is discussed below.

| Coverage via direct comparisons
A way to test coverage biases is a direct comparison between databases, with the aim to identify the most appropriate database according disciplines, documents types, or languages.
Due to the scarce availability of data, the first studies on the topic were focused on cited documents (citations) as proxy of coverage (Bakkalbasi et al., 2006;Levine-Clark & Gil, 2008;Meho & Yang, 2007).All of them concluded that Google Scholar surpassed Web of Science and Scopus.Kousha et al. (2011) demonstrated that Google Scholar captured more citations to books and book chapters than traditional citation indexes, while Adriaanse and Rensleigh (2013) warned that the higher citation count of Google Scholar could be due to duplicated records, while other errors might occur due to the uncontrolled nature of the database (Orduña-Malea et al., 2017).The appearance of Microsoft Academic produced several studies that show this new product to perform similarly to Google Scholar (Haley, 2014;Harzing & Alakangas, 2017;Ortega & Aguillo, 2014), also improving the citation coverage of Web of Science and Scopus (Hug & Brändle, 2017).
Another approach is measuring indexed documents, using standardized search queries to compare the results in several platforms.Jacso (2005) was the first one to use several specific search queries to compare the coverage of several databases, finding that Google Scholar surpassed the coverage of Web of Science and Scopus.Khabsa and Giles (2014) used this procedure to estimate the size of Google Scholar (100 M), Microsoft Academic (50 M), Web of Science (50 M), and Pubmed (20 M).Orduña-Malea et al. (2015) employed the same method to estimate the size of Google Scholar, concluding an estimation of 160-165 million documents, a figure subsequently updated to 331 million documents, including publications, cited references and patents (Delgado L opez-C ozar et al., 2019).Later, Gusenbauer (2019) performed the largest comparison of scholarly databases counting query hits, calculating 389 million documents indexed in Google Scholar.The drawback of this method is that the results are always estimations and they depend on the search interface of each database.
The availability of data (e.g., API endpoints, dump files) and the proliferation of new products increased the number of coverage studies using direct comparisons.Van Eck et al. (2018) were the first one in comparing Crossref with traditional citation indexes (i.e., Web of Science and Scopus).Their results showed that Crossref had a similar coverage, but with limitations according to reference and metadata quality.Harzing (2019) concluded that Crossref and Dimensions could be good alternative to traditional citations indexes, but not for academic search engines such as Google Scholar and Microsoft Academic.Singh et al. (2021) adopted a journal coverage approach to compare Web of Science, Scopus and Dimensions.Their results showed that Dimensions is more inclusive in the journal indexation than the other platforms.Guerrero-Bote et al. ( 2021) compared Scopus and Dimensions at country and organizational level, finding that Dimensions lacked affiliation data in more than half of the publications.Finally, Purnell (2022) showed that large databases such as Dimensions and Microsoft Academic have more affiliation discrepancies than Scopus or Web of Science.

| Coverage via third party comparisons
The use of third party sources to compare the coverage of bibliographic databases is scarce.We can highlight the use of Google Scholar's Classic Papers product 1 as a baseline to generate comparisons between free access and traditional databases, measuring both indexed documents (Martín-Martín, Orduna-Malea, & Delgado L opez-C ozar, 2018) and citations (Martín-Martín et al., 2021;Martín-Martín, Orduna-Malea, Thelwall, & Delgado L opez-C ozar, 2018).Specifically, Martín-Martín, Orduna-Malea, and Delgado L opez-C ozar (2018) showed that a large fraction of highly-cited documents in the Social Sciences and Humanities (8.6%-28.2%)were invisible to Web of Science and Scopus.Martín-Martín, Orduna-Malea, Thelwall, and Delgado L opez-C ozar (2018) compared 2 M Google Scholar citations with Scopus and Web of Science according to disciplines, evidencing that Google Scholar detected 37% more citations than the traditional citation indexes.Martín-Martín et al. (2021) compared Google Scholar citations with other five bibliographic products, confirming that Google Scholar is the most comprehensive service finding citations.
However, these studies could be influenced by Google Scholar coverage, as the classic papers used to compare all the databases (2515 highly cited documents written in English and published in 2006) constituted a subset of Google Scholar, being all indexed in this database.

| Reasons for no indexation
Beyond the differences between scholarly databases, either measuring citing or cited documents, or using direct or third-party comparisons, very few studies had explored the reasons why these coverage differences occur.While testing her own curricula, Harzing (2019) observed which publications were not indexed in several platforms.Visser et al. (2021) manually checked the content of non-indexed documents in several sources envisaging that some of these documents did not contain scientific content.However, there are no studies whose objectives were focused on the causes of the no indexation.

| OBJECTIVES
The main objective of this article is to compare the coverage of the largest number of freely accessible databases (Dimensions, Google Scholar, Lens, Microsoft Academic, Scilit, and Semantic Scholar) using a third-party comparison (via Crossref) to show which databases differ in the coverage of publications, which allows us to identify potential reasons in the no indexation of data.Specifically, this study aims to answer the following research questions: RQ1.Are there significant coverage differences among the currently available freeaccess bibliographic sources?RQ2.Which document typologies cause greater coverage differences?
RQ3.What are the potential reasons behind the no indexation of documents in free-access bibliographic sources?all the analyzed databases provide a search interface that makes possible to search and retrieve records without any cost (which excludes paywall citation indexes such as Web of Science and Scopus) and compute bibliometric indicators.For example, Dimensions and Lens could be considered freemium products that provide a free access to the search interface of a public version, but require a subscription or agreement to access to a version with more functionalities (i.e., Dimensions Plus, Lens Reports).In total, six bibliographic databases (Dimensions, Google Scholar, Lens, Microsoft Academic, Scilit, and Semantic Scholar) were analyzed according to a reference sample from Crossref.

| Crossref sample
This study takes a third-party approach, in which the comparison between databases is done through a third or control database.The strength of this procedure is its ability to avoid potential coverage biases in one database that could influence the comparison.Using a third-party database would reduce this risk because all the databases being compared are now influenced in the same way by the same external database, thus balancing the comparison.
Crossref was used as the control sample due to several reasons.The first reason is operational.This database is the main data provider of Document Object Identifiers (DOIs) for research publications, being the most extended persistent identifier of research publications in the publishing system.2Despite their coverage not being exhaustive (Visser et al., 2021), its use is justified because all the remaining six databases under study integrate the DOI as a searchable field, facilitating a rapid and exact matching.The second reason is methodological.Crossref allows the extraction of random samples of documents from its API and dump files.This fact favors the representativeness of the sample, avoiding ranking algorithms, filters, or matching procedures that could disrupt the coverage analysis.Random samples also reduce time and processing costs, favoring the comparison of multiple sources.The third reason is procedural.Crossref assigns DOIs to any published material in a book, journal, or conference, regardless of its informative value (e.g., front covers, indexes, news).Therefore, no inclusion criteria limit the coverage of certain types of documents.This non-selective criterion would lead us to clearly appreciate the inclusion policies of the other bibliographic databases.

| Data collection
A random sample of 116,648 DOIs from Crossref was retrieved in August 2020, and subsequently updated in July 2021.This sample was generated performing 1200 automatic requests to https://api.crossref.org/works?sample=100.This random process produced duplicate records that were removed to obtain the final list.These requests were limited to documents published between 2014 and 2018.The distribution by document type matches with the entire database (Hendricks et al., 2020), confirming the reliability of the sample.In addition, Table 1 compares some parameters of the entire Crossref database in May 2020 and the random sample in August 2020.The proportion of types of records is similar and constant around 0.1%, which confirms that the sample is balanced regarding the total database.This control sample was subsequently queried to each database to match the records and extract all the information related to each publication.This task was performed during July 2021.A detailed description of the extraction process for each database and additional information (size and sources) is offered in Table 2.
In addition to conducting a search using DOIs, we have also conducted a search using the title of the publication on both Microsoft Academic and Google Scholar.The reason for searching on Microsoft Academic using the title is due to the low indexation rate of DOIs (37.1%).Consequently, we resorted to downloading the complete table of publications from Zenodo (https://zenodo.org/record/2628216) and matching the publications by their titles.As for Google Scholar, a search by title was carried out as there is no specific search option for DOIs.This was done to verify if additional publications could be retrieved by conducting a title search.The results showed that only 898 (0.8%) publications were identified.This suggests that the benefits of conducting a title search are minimal in comparison to the required efforts.Other databases, however, were not tested since their endpoints do not provide full title search or because they use Crossref as their main source.
Additional data processing was performed to explain the coverage of specific document types.For instance, to check the coverage of the entire book instead the book chapter, we had to remove the chapter suffix (e.g., https://doi.org/10.1002/9781119160243.ch3) or to search for the title of the book in the Web, and then extract its DOI.The categorization of secondary content was done from the title of the document and exploring its content in their landing page.

| Identification of non-indexation reasons
We have defined the following two main types of reasons: • Internal requirements: Each database defines what materials shall be indexed.These criteria could be motivated by informative (some documents could be more interesting to scholarly audiences), technical (some document types could require additional fields), or accessing reasons (some publications could not be openly available).For example, Google Scholar only indexes "scholarly articles," excluding "news or magazine articles, book reviews, and editorials" (Google Scholar, 2022), and Dimensions includes articles "from a scientific journal or trade magazine, including news and editorial content" (Dimensions Plus, 2019).Internal requirements are more associated to document types.• External criteria: These conditions are caused by external sources that do not provide the information as the database requires.That is, the database decides including information but the source does not provide sufficient information (e.g., metadata) to be indexed.This problem is especially important in academic search engines, which use bots to crawl the Web and they require that the information is suitable for data harvesting.For example, we find the Google Scholar's Inclusion Guidelines for Webmaster (Google Scholar, 2022).External criteria are related to specific sources such as data providers or publishers.
Due to the particular operating mode of these external criteria, we adopted a web crawler perspective to identify the indexation problems.To do this, documents not found on Google Scholar, Microsoft Academic and Semantic Scholar were resolved using their DOI (12,404) (https://hdl.handle.net/) to explore the landing page of each publication.Next, a link checker (Xenu's Link Sleuth3 ) was used to test the accessibility of these webpages.Only those pages that returned the 200 status code (OK) were selected to be crawled, while the remaining ones were classified as access problems.An R script was written to extract robots' instructions (i.e., meta name = "robots", {{ngMeta.robots}})and directions for robots' exclusion (i.e., noarchive, noindex).
This distinction of criteria allows to identify the principal reasons for not indexing specific documents, and to analyze the coverage problems in bibliographic databases and academic search engines in a differentiate way.

| Coverage
To compare the performance of each database and highlight the differences in the coverage of Crossref publications, the number and percentage of missing documents in each database is shown in Table 3. Lens (0.1%) and Scilit (0.2%) almost exactly reproduce the initial sample.On the other hand, Microsoft Academic (12%) and Semantic Scholar (10%) are the databases that miss more publications from our sample.The high missing values achieved by Dimensions (7.6%) and Google Scholar (9.8%) were unexpected.First, Crossref feeds Dimensions; and second, Google Scholar is considered the largest academic database (Gusenbauer, 2019;Martín-Martín et al., 2021).The high percentage of duplicated DOIs in Microsoft Academic is also worthy of mention (1107, 1.1%), and might be caused by the assignation of the same DOI to preprint copies and book chapters.Overall, these results show a high degree of overlap (>85%) with regard to the initial sample, confirming a high overlap between scholarly databases, a fact already found in the literature through other databases (Harzing, 2019;Visser et al., 2021).The coverage differences shown in Table 3 are subsequently analyzed in the following sections to identify the reasons behind the no indexation of documents in certain databases.This tour allows us to uncover how methodological differences in the building, design and data feeding of these databases influence the indexation of scholarly publications.

| Similarities and differences
A first step to understand the different coverages is to study the similarities and differences among databases according to the overlap of documents.This overlap was calculated comparing the same records retrieved from the Crossref sample in each database.Figure 1 shows a multidimensional scaling (MDS) plot, in which the distances between services are calculated according to the proportion of overlapped documents in each platform.MDS was proposed to overcome the limitation of Venn diagrams of plotting more than three sets.K-means clustering algorithm was used to confirm the clusters (node color) observed in the MDS map.MDS is a visualization technique for displaying the information contained in a distance matrix.K-means is a clustering algorithm that groups elements according to the nearest mean of each cluster.
The K-means algorithm identifies an initial group (blue) shaped by Scilit and Lens, very similar to Crossref.This closeness evidences that both databases feed on Crossref incorporating almost entirely all the records stored in Crossref (miss <1%).The main characteristic of this group is that these products create their databases incorporating publications from external data sources (i.e., Crossref, PubMed, Microsoft Academic).
A second intermediary group (red) is set by Dimensions and Google Scholar (miss ≈10%).Both databases have different methodologies to create their databases.While Dimensions is also based on external sources, Google Scholar is mainly supported on web crawling.Their similarity could be due to their broad coverage.
A third group (green), very far from Crossref (>10%), is shaped by Microsoft Academic and Semantic Scholar, two similar exact academic search engines that obtain their information through web crawlers.This similarity is also explained because Semantic Scholar also uses Microsoft Academic as data source (i.e., Microsoft Academic Graph) (Boyle, 2018).
These results evidence that the methodological and technical approaches used in the building of scholarly databases influence to great extent the coverage of documents.

| Missing publications
Figure 2 shows the proportion of missing publications by document type in comparison with Crossref.The aim is to check whether the document typology has any influence in the coverage, and to highlight which specific types of documents are prone to be indexed in each database.Due to this, Crossref is included in the graph to contrast the proportion of indexed documents in this database with the proportion of missing documents in the other ones.Crossref categories were used as reference in the comparison.These were grouped in eight principal classes: book (includes monographs and book series), book chapter (includes reference entry, reference book), dataset, journal article (includes journal issue), posted content, proceedings (include proceedings article), report, and other (includes component, correction, retraction, peer review).Appendix A details the number and proportion of all document typologies (Table A1).Overall, the bar graph shows that some typologies, in particular book chapters and journal articles, experience more problems to be indexed (Figure 2).
Figure 2 also shows different patterns according to the type of database.This way, in databases mainly based on Crossref (e.g., Scilit and Lens), most of the documents from the sample that are not found are journal articles (Scilit: 76.1%, Lens: 83.3%).However, these percentages are similar to the total coverage of journal articles in Crossref (74.7%), which suggest that this lack of coverage could not be due to this specific document typology.Academic search engines (Google Scholar, Microsoft Academic, and Semantic Scholar) show a different pattern, finding difficulties in the indexation of journal articles and book chapters in a similar proportion.For instance, of the documents in the sample that Google Scholar does not index, 41.3% are book chapters and 43.5% are journal articles.A manual inspection of a random sample of these documents (N = 1354; Confidence interval = 95%; error margin = 2.5%) disclosed that only 63.7% of the book chapters had scholarly content and 17.3% were reference entries.In the case of journal articles, where there are more different types, only 32.8% were strictly research papers.In Microsoft Academic, book chapters account for 39.9% of the documents that are not found, and journal articles make up 42.9% of the F I G U R E 1 MDS map showing the distance between scholarly databases regarding to the overlap of indexed documents missing documents; and in Semantic Scholar, 43.91% of the missing documents are book chapters and 40% journal articles.Finally, Dimensions displays a particular pattern, finding problems specifically in the indexation of book chapters (68.8% of missing documents).

| Reasons for non-indexation
Next, we analyze why certain document types experience more problems to be indexed and how the different databases manage to index them.

| Bibliographic databases
The coverage of these databases is mainly determined by internal requirements.In the case of Lens (99.9%) and Scilit (99.8%), we consider that there are no inclusion criteria regarding Crossref data because the coverage is almost complete.In the case of Scilit, it is worth mentioning that 61.4% of the missing publications are records without a title, which suggests that both Lens and Scilit only employ technical criteria to exclude content, such as metadata completeness of the records.
Beyond internal indexing criteria, we find additional causes to explain the non-indexation of documents.Considering Dimensions as a case study (Table 4), which also uses Crossref as primary source, we find the following causes: • Book chapters (65.4%):Book chapters are not separated from the full book.In other words, despite the full book being indexed, some of their chapters are missing.This problem occurs with 37.3% of the book chapters in the whole sample.• Secondary content (15.3%):Secondary content refers to publications that have a DOI but, strictly, they are not research publications.For example, in the case of journal articles, we can find editorials, news, table of contents, front matters, covers, etc., that accompany research articles but they do not have scientific content in their own.In the case of book chapters, this secondary content is related to indexes, forewords, abbreviations, glossaries, etc. 30.8% of all the secondary content in Crossref is excluded from Dimensions, while the remaining ones correspond to news and editorials that are indeed indexed.It is worth mentioning that more than half of books and book chapters from Oxford University Press (50.1%) are not indexed, which suggests that Dimensions experiences certain problems when it comes to indexing bibliographic data from this publisher.

| Academic search engines
This group refers to scholarly information databases that mainly use crawlers and bots to gather bibliographic information.10,554 webpages (85.1%) returned 200 (Ok) status code, being the remaining ones classified as "access" problems.Otherwise, only 5283 (50.1%) webpages had instructions for robots and 4819 (91.2%) included directions for robot exclusion.
Table 5 depicts the main causes that explain why some Crossref publications are not indexed in the academic search engines under analysis.Notice that some of these criteria are different from scholarly databases, illustrating the important methodological differences in the construction of these products.The distribution of causes found in the three search engines show similar percentages, suggesting that some of these external criteria equally influence each search engine.However, it is important to notice that these causes only explain 88.3% of missing documents in Semantic Scholar,87.8% in Google Scholar,and 87.9% in Microsoft Academic.As with Dimensions, the main limitation to index publications in Google Scholar is the indexation of entire books instead of the chapters (38.3%), an issue that affects 28% of the book chapters in the sample.Although it is officially stated that "Google Scholar automatically includes scholarly works from Google Book Search" (Google Scholar, 2022), the chapters of these books are not disaggregated, and then these documents cannot be retrieved from Google Scholar, unless the author/ publisher has uploaded the specific chapter to some source indexed by Google Scholar.This problem is also remarkable in Microsoft Academic (19.3%) and Semantic Scholar (31.3%), although the reasons are unknown, and they could be due to the inexistence of appropriate landing pages or insufficient information for indexing them correctly.
The second most important case for no indexation is the robot exclusion.This is the main external limitation that prevents indexing of publications in academic search engines.Google Scholar (24%) is the service less affected by this problem, while this is the principal reason in Microsoft Academic (35.2%) and Semantic Scholar (35.2%).The better performance of Google Scholar in this area may have to do with how it indexes documents that it finds in the lists of cited references of other documents Academic search engines also have internal indexing criteria to select the content to be indexed.Google Scholar states that "Content such as news or magazine articles, book reviews, and editorials is not appropriate for Google Scholar" (Google Scholar, 2022), while Microsoft Academic ( 2021) and Semantic Scholar (2022) do not provide clear information about selection criteria.Because of this, secondary content is not indexed in Google Scholar (6.5%), Semantic Scholar (9.5%), and Microsoft Academic (11.6%).In the case of Google Scholar, the manual inspection showed that this percentage would climb up to 28.3% excluding other causes.In this sense, Google Scholar also claims that "Sites that show […] bare bibliographic data without abstracts will not be considered for inclusion" (Google Scholar, 2022).Then, publications without a short description about their content (e.g., no abstract) are also excluded in Google Scholar (5.9%) and Microsoft Academic (2.9%).Other document types excluded are Datasets and Posted content, which altogether represent 3.8% of documents in Google Scholar, 4.8% in Microsoft Academic, and 5.6% in Semantic Scholar.

| DISCUSSION
This study has attempted to compare different scholarly databases from an original point of view, exploring the reasons behind the no indexation of publications in each of the databases.This new point of view has revealed significant differences between two types of products, bibliographic databases and academic search engines, which build their databases using different methodologies that greatly influence the coverage of publications.Document typology has been the primary approach in this work.The results have demonstrated this is the most explicative element for detecting coverage differences.More than 80% of the missing documents were explained by their typology.Other variables such discipline or language were less explicative and presented important methodological problems.Nineteen percent of the Crossref documents included a thematic category, and only for journal articles, then the disciplinary analysis would be incomplete and biased.Regarding to language, an initial analysis showed that the proportion of missing documents in all the databases was biased in favor to English-speaking language in a similar proportion, going from 81.4% of Englishspeaking publications in Google Scholar to the 86.5% of Scilit.Only statistical pairwise differences were found in the case of Google Scholar.Therefore, language analysis was excluded due to the little information that it provided.

| Reasons for missing publications in classical bibliographic databases
The results have shown bibliographic databases, principally Scilit and Lens, to attain the highest coverage levels relative to Crossref.
The specific causes of non-indexation found (see Table 3) are mainly related to the adaptation of Crossref data to the characteristics of each bibliographic database (internal requirements).While Scilit and Lens ingest Crossref data without remarkable differences, the coverage is higher.However, in those databases where the adaptation process is higher, indexation problems arise.This is the case of Dimensions with book chapters, in which 37.3% of them is not found, being 65.4% of the unmatched documents in that database.This problem was already pointed out by Harzing (2019), who only found one chapter out of 25 in her sample.The reason is that, like search engines, Dimensions indexes 95% of the books from those missing book chapters.This issue is even more striking because the Dimensions core is based on Crossref database, where the book chapters are independently recorded (Hook et al., 2018).The explanation to this lack of book chapter is due to Dimensions does not include book chapters from books labeled monograph in Crossref. 4This would suggest that Dimensions does not consider book chapters as independent publications, because they are conditional to the previous indexation of the book.
These differences between bibliographic databases have been also perceived according to the management of secondary content.While 15.3% of the missing documents in Dimensions match with this category, Scilit and Lens scarcely limit their indexation.This result suggests that these last databases do not have indexation criteria that filter this type of content.

| Reasons for missing publications in academic search engines
Academic search engines (Google Scholar, Microsoft Academic, and Semantic Scholar) build their databases crawling and harvesting research publications available on the Web, independently of third sources.This might explain their lower coverage of Crossref publications (<90%).
Access problems (either bot exclusion or access problems) constitute the principal cause for missing publications in academic search engines, being 43.2% for Microsoft Academic, 41.9% for Semantic Scholar and 33.3% for Google Scholar.These external factors highlight the important technical limitations of collecting publications from the Web, where metadata are not always accessible or accurate.Perhaps, due to these problems, search engines have stricter indexation criteria.The significant percentage of missing documents in Microsoft Academic (19.3%),Google (16.2%), and Semantic Scholar (15.1%) due to internal requirements report that search engines avoid indexing documents with limited information (e.g., no abstract pages, datasets, comments).
A generalized problem is the coverage of book chapters.Academic search engines do not index these publications properly either.In the case of Google Scholar, there is a technical limitation, a one URL does correspond to only one publication (Delgado L opez-C ozar et al., 2019).The fact that a book in PDF format can include different independent publications, each of them authored by different authors, is not automatically matched by the indexing algorithm.Chapters are not indexed unless they are independently available with their own URL.Eightyone percent of the book chapters not indexed in Google Scholar is included in Google Books within the full book, which suggests that Google Books is the main reference for books in Google Scholar and also proves the lack of coordination of these two databases.Microsoft Academic and Semantic Scholar might face similar problems when crawling books.Beyond this technical limitation, the existence of books published as an image instead of text prevents the correct indexation not only of the corresponding chapters but of the references.Lack of commercial agreements with publishers and limitations of book publisher websites might explain the limitation of book chapters indexation.

| Research implications
These findings have important implications both for the design of scholarly information systems and for research evaluation.
From a technical point of view, the observed differences between bibliographic databases and academic search engines encourage us to recommend using both approaches.This mixed approach could provide a more complete picture about research fields or organizations by combining the scientific literature exploration and the design of accurate information services.The recent case of OpenAlex5 is a good example of integration of academic search engine data (Microsoft Academic) with external sources (Crossref, Pubmed).This source was tested for inclusion in the study, finding 101,053 (86.6%) records created before 2022.However, in that moment, all the records came from MAG, we accordingly suppose that OpenAlex would not provide more information than the reported by MAG.A recent publication, testing differences between MAG and OpenAlex, showed that, in the early moments, OpenAlex was just a MAG mirror enriched with Crossref's DOIs (Scheidsteger & Haunschild, 2023), being in line with our preliminary results.
For research evaluation, the most problematic result is the incomplete indexation of book chapters.Regardless of the criteria of each database, the absence of a great volume of book chapters in many of the databases undervalues the contribution of researchers and organizations, when these services are used for research evaluation.This problem is especially harmful in research areas with a high production of book and book chapters, such as social sciences and humanities (Huang & Chang, 2008).
Methodologically, this study has evidenced that the size and coverage of databases should be interpreted according to the reference sample used in the analysis, because this always introduces a selection bias.The most illustrative example in our case is Google Scholar, accounted as the largest scholarly information service (Gusenbauer, 2019;Martín-Martín et al., 2021), but with a lower coverage of items that are deposited in Crossref.This result does not invalidate Google Scholar as the largest scholarly information service, but illustrates that there is a considerable amount of scientific literature that is not indexed in Google Scholar.Previous studies already warned on this fact (Adriaanse & Rensleigh, 2013;Bar-Ilan, 2010;Giustini & Boulos, 2013;Martín-Martín & L opez-C ozar, 2021).This consideration leads us to a second criticism to coverage studies: it should take more into account the quality and value of the indexed documents than the mere number of publications.Thus, for example, the fact that Dimensions or Google Scholar cover fewer publications from Crossref than Scilit or Lens should not be seen as a weakness, but as a sign that these services have stricter indexation criteria, selecting publications with a rich scientific content (e.g., journal articles, book chapters) and filtering out scarcely informative items (e.g., indexes, announcements, front covers, prefaces, glossaries).This fact has important implications in the appreciation of scholarly databases because if a database does not filter and select content, then it does not add value and therefore its use is less attractive.Precisely, because the lack of content processing would cause noise in the retrieval of documents and inflated coverage.
The comparative study of bibliographic sources always deals with data access problems that make it difficult to value the performance of these services.These problems are more evident in the case of commercial platforms, some of which impede their data be used for research purposes or condition the data usage on the approval of a research project proposal, considering aspects beyond the technical use of their servers and downloading services.We understand that these policies constrain the development of research focused on describing how these platforms operate.This is undesirable as many of these platforms take their data from open sources such as Crossref, PubMed, or MAG.This is added to the fact that information about content selection is sometimes limited (e.g., Dimensions, Google Scholar, Semantic Scholar) or even absent (e.g., Scilit, Lens).This makes difficult a more detailed discussion about to what extent indexation criteria determine the non-coverage of publications.

| Limitations
A third-party study is determined by the coverage limitations of the reference sample.In our case, Crossref only includes publications from partner publishers, leaving aside some conference proceedings and local journals (see footnote 2).These publications are not curated, being able to include non-strictly research materials.The identification of secondary content indicates that publishers deposit any type of material, regardless of the scholarly content.This issue underscores a second limitation of Crossref.Document typologies are not precise because publishers may confuse or misattribute typologies.We have encountered this issue with Dimensions, where book chapters from books with monograph type are not indexed.Similarly, manual inspection of Google Scholar revealed that 32.8% of missing journal articles fell into different categories.
Another problem could stem from the extraction process.In the case of Dimensions, Semantic Scholar, and Microsoft Academic, specific R packages were used to query these services (dimensionsR, microdemic, and semscholar).Our experience shows us that all these packages present some type of bug or error, which leads us to directly query the API in some cases.This problem could have caused some type of loss of information.
Searching by DOIs introduces the risk that this identifier could not be assigned to the document in the searched database (Van Eck et al., 2018).This issue has been clear in Microsoft Academic.To mitigate this risk, searches by title were conducted in those cases in which a reliable endpoint was not available.The slight improvement in Google Scholar has shown that this form of search requires considerably more effort than the reward that is received.
In the specific case of Google Scholar, we found problems with the results page.The first one is when we search by title, documents with a very short title and common words did not produce exact matching, and several items were showed.This made very difficult and time-consuming to identify the correct document.Another problem was the false positives, when a DOI query retrieves a wrong document because it mentions that DOI in the abstract (e.g., retractions).These retrieval problems lead us to point out that limited search functionalities would influence on the matching of documents, distorting the real coverage, as could be the case of Google Scholar (Boeker et al., 2013;Gusenbauer & Haddaway, 2020).

| CONCLUSIONS
The results of this study allow us to conclude that, from the Crossref point of view, there are remarkable coverage differences between scholarly databases.These differences are mainly due to methodological approaches used by each database to build their databases.The proportion of missing documents has evidenced that bibliographic databases, such as Scilit (0.2%) and Lens (0.1%), almost exactly reproduce the content of Crossref.However, academic search engines, such as Microsoft Academic (12%) and Semantic Scholar (10%), showed important absence of records.Dimensions (7.6%) and Google Scholar (9.8%) stand in an intermediate position.However, these coverage differences should be critically considered because a high coverage of Crossref records also implies low filtering levels of scholarly publications, causing noise in the retrieval and poor content curation.
The cause of these disparities is principally due to the management of specific document types.Bibliographic databases experience more problems covering journal articles (>75%), and search engines find limitations both in journal articles (≈40%) and book chapters (≈40%).
The reasons of this non-indexation of documents are different according to the type of scholarly product.For bibliographic databases, such as Dimensions, is due to internal selection criteria that index full books instead book chapters (65%) and exclude secondary content (15%).In the event of academic search engines, there are important external limitations (web accessibility, robot restrictions) that prevent the indexation of research documents (39.2%-46%), and internal requirements that exclude secondary content (6.5%-11.6%).
This work represents an advance in the study of bibliographic databases coverage, by introducing the reference sample (third party) method, and by considering free-access bibliographic databases and academic search engines.The results obtained have made it possible to know accurately the reasons for the non-indexing of documents, identifying specific motives according to the type of database (classical databases or academic search engines).These results are helpful to meta-researchers, when learning about the characteristics of the databases used in bibliometric studies, as well as to librarians and practitioners who need to use scholarly databases to assist researchers or carry out training tasks.Likewise, it uncovers the need for publishers to properly update their websites and reach specific agreements with academic search engines to be correctly indexed by these products, which are called to coexist with the classic databases.

F
I G U R E 2 Distribution of non-indexed documents by typology in each platform L E A 1 Distribution of indexed publications in the Crossref sample by document type and the proportion of missing documents in the remaining databases Comparison between the total coverage of Crossref in May 2020 and the random sample (July 2021) T A B L E 1 Data collection process carried out in each bibliographic database T A B L E 2 First, SPARQL (https://makg.org/sparql) and REST API (https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate)endpoints were used to extract publications using DOIs.Then the entire table of publications available in Zenodo) was downloaded and locally matched with the sample, using now DOIs and titles Note: Estimated values as of August 26, 2022.
Missing publications of the Crossref random sample (N = 116,647) over the different scholarly databases Principal causes for the missing records from Crossref in academic search engines Citations), and how it is able to identify different versions of the same document in different websites, both of which are practices that could lessen the impact of a robot exclusion policy in a particular website.Another external problem is the open availability of publications on the Web.Access problems include link rot, error pages, login, captchas, etc., and any technical obstacle to search engines bots.This problem causes the no indexation of 9.3% of publications in Google Scholar, 8% in Microsoft Academic, and 6.7% in Semantic Scholar.