Feasibility and desirability of screening search results from Google Search exhaustively for systematic reviews: A cross‐case analysis

A commonly reported challenge of using Google Search to identify studies for a systematic review is the high number of results retrieved. Thus, ‘stopping rules’ are applied when screening, such as screening only the first 100 results. However, recent evidence shows that Google Search estimates a much higher number of results than the viewable number, raising the possibility of exhaustive screening. This study aimed to provide further evidence on the feasibility of screening search results from Google Search exhaustively, and to assess the desirability of this in terms of identifying studies for a systematic review. We conducted a cross‐case analysis of the search results of eight Google Search searches from two systematic reviews. Feasibility of exhaustive screening was ascertained by calculating the viewable number of results. Desirability was ascertained according to: (1) the distribution of studies within the results, irrespective of relevance to a systematic review; (2) the distribution of studies which met the inclusion criteria for the two systematic reviews. The estimated number of results across the eight searches ranged from 342,000 to 72,300,000. The viewable number ranged from 272 to 364. Across the eight searches the distribution of studies was highest in the first 100 results. However, the lowest ranking relevant studies were ranked 227th and 215th for the two systematic reviews. One study per review was identified uniquely from searching Google Search, both within the first 100 results. The findings suggest it is feasible and desirable to screen Google Search results more extensively than commonly reported.

is typically prohibitive to screen in full, thus 'stopping rules' are applied, such as limiting the screening process to the first 100 results. However, a recent study found that the number of estimated results in Google Search is much higher than the viewable number of results, thus raising the possibility of screening the results exhaustively.

What is new
This study contributes further evidence on the feasibility of screening Google Search results exhaustively, demonstrating that the viewable number of results is typically in the low hundreds rather than the hundreds of thousands or millions which are estimated by the search engine. The study also found that screening a higher proportion of the results is potentially useful for identifying relevant studies for inclusion in systematic reviews.
Potential impact for research synthesis methods readers outside the authors' field Systematic reviews may benefit from more extensive screening of Google Search results than commonly carried out if this leads to the identification of additional relevant studies for inclusion in analyses.

| BACKGROUND
Searches for studies for inclusion in systematic reviews typically use a variety of different resources. Bibliographic databases are usually the main source of studies, with 'supplementary' sources used alongside to identify additional studies not retrieved by bibliographic databases. 1,2 One such supplementary resource are search engines such as Google Search (www.google.com), which are a gateway to a vast amount of content on the World Wide Web (hereafter, web). Several case studies report the value of using Google Search or another search engine in a systematic review, measured in terms of 'uniquely' identified relevant studies, i.e. studies which meet the inclusion criteria for a systematic review and which are not identified by other search methods. [3][4][5] Despite this, search engines have been contested as valid sources of studies for systematic reviews. 6 This is mainly due to how search engines retrieve and rank results using hidden algorithms which take into account a user's search history and geographical location. 6,7 However, in view of the potential for finding relevant studies uniquely, systematic review guidance on searching for studies recommends their use as supplementary to bibliographic databases and other search methods. 1,8 The extensiveness of information which can be found on the web makes search engines valuable resources for identifying studies, but this also poses challenges. One such challenge is that Google Search often estimates very high numbers of search results, numbering in the hundreds of thousands or more, which would be impractical to screen exhaustively. 5,9 Thus, systematic reviewers typically use 'stopping-rules' which rely on Google Search's algorithms for ranking search results according to relevance to a search query. 10 Stopping-rules for web searching involve limiting the search results to either a pre-specified number of results (e.g. the first 100) or screening the results until one or two pages of results are inspected without identifying any relevant content. 5,11,12 However, a recent study by Briscoe and Rogers showed that the number of estimated results in Google Search is sometimes far in excess of the viewable number, thus raising the possibility of screening the results exhaustively. 13 In summary, Briscoe and Rogers showed that the mean number of viewable results for three Google Search searches was 463, in contrast to the mean number of estimated results reported by the search engine of 569,454,000. 13 The viewable number of results was calculated by setting Google Search to display 100 results per page and scrolling to the final page of results. 13 Despite this finding, it is not clear whether exhaustive screening is desirable, particularly in view of Google Search's PageRank algorithm which ranks content according to relevance, which might make screening the results in full unprofitable (in terms of the identification of studies) even if feasible. 14

| Aims and objectives
This study had two main aims: 1. To provide further evidence on the feasibility of exhaustively screening the results retrieved by Google Search. 2. To assess the desirability of screening the results exhaustively, measured in terms of the likelihood of identifying studies for a systematic review.
The second aim was assessed in two stages: i. the distribution of journal articles and grey literature within the results of Google Search searches, irrespective of relevance to a particular systematic review question, and ii. the distribution of journal articles and grey literature within the results which met the inclusion criteria for two pre-specified systematic review questions, including specifically journal articles and grey literature which were uniquely identified by Google Search.
The purpose of the first stage of the second aim was to ascertain whether studies in general are distributed evenly throughout the results or whether they are grouped within a specific section of results. This was undertaken in view of how Google Search indexes all web-crawler accessible content on the web, which makes it helpful to know whether studies are more or less likely to be identified throughout the results. The purpose of the second stage was to narrow this focus to the identification of studies for systematic reviews which the searches were intended to resource.
We achieved this by analysing Google Search results from two reviews which included a systematic search for studies: (1) a scoping review of qualitative studies on the perspectives of primary care clinicians on interacting with women patients with gynaecological conditions or symptoms suggestive of gynaecological conditions (hereafter, the Women's Health review); 15 (2) an umbrella review of effectiveness and cost-effectiveness systematic reviews which evaluate multi-disciplinary occupational health interventions that aim to help people return to work (hereafter, the Occupational Health review). 16 Searches for studies for both reviews used a variety of search methods, including bibliographic database searches, checking reference lists, forward citation searching, and web searching using the UK version of Google Search (www.google.co.uk).

| Data collection
For the Women's Health review, we carried out six searches for studies using Google Search on 3rd November 2021. This included five searches which aimed to identify qualitative studies relating to specific gynaecological conditions or symptoms included in the review (namely, endometriosis, menopause, menstrual disorders, polycystic ovary syndrome and chronic pelvic pain), and one search which aimed to identify studies relating to gynaecological conditions generically. For the Occupational Health review, we carried out two searches for studies using Google Search on 6th July 2021. This included one search which aimed to identify systematic reviews of multidisciplinary return to work interventions and one search which aimed to identify systematic reviews of multidisciplinary vocational rehabilitation interventions. However, because we did not document the results of the Occupational Health Google Search searches in sufficient detail for the present study, we re-ran the searches on 14th June 2022.
The search strings were constructed prior to the commencement of the two reviews, using an iterative process which attempted to adapt the complex searches used in the bibliographic databases for the more basic search interface of Google Search. This involved experimentation with different search terms, and ascertaining that the search operators worked as expected. The resulting search terms and basic structure of the searches reflected the bibliographic database searches for each systematic review, albeit in a simplified format which was appropriate for Google Search. 15,16 Ascertaining that the search operators worked as expected included: checking that the Boolean operator 'OR' retrieved the various terms that we had specified; checking that the use of quotation marks retrieved the specified phrases; and checking that at least one term from each set of terms within parentheses were retrieved in the search results. We did not use the AND Boolean operator as this is automatically applied in between search terms if OR is not specified. 17 All the operators we used are included on the Google Search help page except for parentheses. 18 There is discrepancy in unofficial guidance on whether parentheses are supported by Google Search. 11 However, there is consensus that the OR Boolean operator is prioritised over AND in the order of execution, and in all eight searches parentheses were used solely to group search terms which were combined with OR. 17 Thus, the logic of the search strings was the same whether or not the parentheses were functioning. We were able to confirm this by comparing the first pages of search results with and without parentheses. The full details of searches that were carried out are presented in Table 1.
In order to facilitate data analysis, we set Google Search to display 100 results per page as described in Briscoe and Rogers. 13 This involved using the slide-bar option in the 'See All Settings' submenu of the main settings menu (accessed via the 'gear cog' icon on the Google Search homepage) to increase the results per page from the default of 10 to the maximum of 100. Prior to searching we also ensured that we were logged out of our personal Google accounts and used the option in the Search Settings page to deactivate search customisation, which stops Google from using the user's search history to personalise the ranking of search results according to their previous searches. These are recommended measures when searching for systematic reviews in order to reduce the bias associated with the personalisation of search results. 8

| Feasibility of exhaustive screening
The feasibility of exhaustive screening was ascertained by calculating the viewable number of results for each search. We assumed that numbers of results which were of a similar order of magnitude to the numbers reported when using a stopping-rule, for example, the first 100 results, were feasible to screen in full. 5 The viewable number was also compared with Google Search's estimated number of results. The estimated number of results was taken from underneath the search bar on the first page of results. The viewable number was calculated using the following procedure described by Briscoe and Rogers. 13 First, we selected the final page of results using the page numbers at the bottom of the page. We then selected the option to "Repeat the search with the omitted results showing", which includes search results that are similar to the initial set of results, but which Google Search initially omits to avoid potential duplication. We selected the final page of results from this more exhaustive set and manually counted the number of results on this page. Finally, we multiplied the number of results per page (i.e. 100) by the total number of pages minus one; then, added the number of results on the final page to ascertain the total viewable number of search results.

| Desirability of exhaustive screening
The desirability of exhaustive screening was ascertained according to two criteria: i. the distribution of journal articles and grey literature within the results, irrespective of relevance to a particular systematic review question, ii. the distribution of journal articles and grey literature within the results which met the inclusion criteria for our pre-specified systematic review questions, including specifically journal articles and grey literature which were uniquely identified by Google Search.
To facilitate the analysis of these criteria we first copied and pasted the results of each search into Microsoft Word documents in 'chunks' of 100 results as displayed per page. To assess the first criterion, we counted how many journal articles and grey literature publications were retrieved within each page of 100 results per search. Journal articles were counted if they were empirical studies, commentaries or opinion pieces; letters and editorials were excluded. Grey literature publications were counted if they were conference abstracts, pre-prints, reports (typically, topical reports produced by charities or government) or theses; guidance documents were excluded unless they also reported in full the study on which the guidance was based, e.g. UK NICE guidelines with accompanying systematic review. 19 This criterion broadly reflects the inclusion criteria for the types of document eligible for inclusion in the Women's Health and Occupational Health reviews, although unlike these two reviews we did not limit according to study design for the purpose of Occupational health review searches Return to work "return to work" ("multi-disciplinary" OR multidisciplinary) (report OR review) Vocational rehabilitation "vocational rehabilitation" ("multidisciplinary" OR multidisciplinary) (report OR review) assessing the first criterion of our second aim. Furthermore, for a result to be counted it had to link directly to a journal article or grey literature publication, usually indicated by the publication title, author and/or the source website in the search result (for example, see Figure 1). Ambiguous links were investigated. SB counted studies per page in all eight sets of search results, documenting each type using colour coded highlighting in the Word documents. BA and HL each checked one set of search results to corroborate that they agreed with SB's decision about which results constituted published or grey literature studies.
To assess the second criterion, i.e. the distribution of journal articles and grey literature within the results which met the inclusion criteria for our pre-specified systematic review questions, we made a note of any such studies while assessing the first criterion. We only noted the first appearance of relevant journal articles and grey literature publications within the results, in order to ascertain how many results needed to be screened to see all relevant results. Furthermore, we extended the analysis to include links to webpages which provided 'hints' to relevant journal articles or grey literature publications which were not directly linked via the URL. For example, we documented webpages which were news items discussing ongoing or recently completed studies which on further inspection were relevant to the review question. This type of exploratory searching can be particularly valuable for identifying studies which are not retrieved by bibliographic databases. For example, recently completed studies which are not yet indexed in databases. In this respect, the analysis for the second criterion was different to the first criterion, for which the URL needed to link directly to a study. We only explored hints to studies for the second criterion because it would have been prohibitively time consuming to explore hints to any study throughout all eight sets of Google Search results. Thus, focusing on hints to potentially relevant studies provided a helpful boundary for this part of the analysis.
Due to re-running the Google Search searches for the Occupational Health review 1 year after the initial searches, we only included relevant studies in the analysis of uniquely identified studies if on inspection they were: (a) published online prior to the date of the initial searches (i.e. July 2021) and (b) would not be retrieved by bibliographic databases, due to lack of appropriate terminology in the bibliographic database search strategy or not indexed in the databases we searched.
The methods we used are summarised in a flow diagram in Figure 2.

| Feasibility of exhaustive screening
The estimated and viewable number of results for the eight Google Search searches are shown in Table 2. The mean number of estimated results for the six searches for the Women's Health review was 9,798,667 (range 342,000-16,800,000) and the mean number of viewable results was 324 (range 272-364). The mean number of estimated results for the two searches for the Occupational Health review was 36,318,500 (range 337,000-72,300,000) and the mean number of viewable results was 326 (range 319-332). Thus, the viewable numbers of results were of a similar order of magnitude to that typically screened when using a stoppingrule in Google Search, i.e. in the low hundreds of results, albeit three times more than the 100 results commonly reported. 3,5,7 On this basis, we suggest that the viewable numbers were feasible to screen in full for these eight searches; indeed, this was the approach used for the systematic reviews to which these searches contributed. 15 Table 3. Table 3 shows that on average the number of journal articles was highest on page one (i.e. the first 100 results), and gradually diminished throughout subsequent pages of results. Only one of eight searches did not conform to this trend, which was the 'vocational rehabilitation' search for the Occupational Health review. For this search, the number of journal articles per page was higher on page three (n = 78) than page one (n = 71). This partly explains the higher proportion of journal articles on page three of the Occupational Health review search results (61%) than in page three of the Women's Health review search results (16%). However, the 'return to work' search also retrieved proportionally more journal articles on page three (45%) than any of the Women's Health review searches. For both the Women's Health review and the Occupational Health review searches, the distribution of grey literature was more consistent across the search results than journal articles. A higher proportion of the search results were grey literature publications for the Occupational Health review searches than the Women's Health review searches (see Table 3). four relevant journal articles per search were identified on page one (mean = 2), whereas no more than one relevant journal article was identified per search on subsequent pages. Only one of the two Occupational Health review searches identified any relevant journal articles, namely, the 'return to work' search, and these were sparsely dispersed across pages one to three (see Table 3). Across both the Women's Health review and the Occupational Health review searches, the lowest ranking first appearance of a relevant journal article was on page three, ranked at 227 for the women's health review and 215 for the occupational health review (see Table 2). Thus, across both sets of searches, all included journal articles were identified within the first 65% of the search results (range 62.3%-64.8%, see Table 2). Across the other five women's health review searches, the lowest ranking first appearances of relevant journal articles were identified higher in the rankings of the results (range 1-106). Across the two reviews, Google Search retrieved two uniquely identified relevant studies, i.e. studies which were not identified by other search methods used in either review. One was the 74th result in the 'return to work' search for the occupational health review, which was also the only grey literature publication which met the inclusion criteria for either review. 20 The other was the seventh result in the endometriosis search for the Women's Health review, which was a hint to a recently completed study which had not yet been published. 21 Following correspondence with the authors we ascertained that the study was due publication in journal article format during the period that the review would be undertaken, and thus was included in the review later in the review process. 22 Although the occupational health review Google Search searches were re-run 11 months after the initial searches for the purposes of this analysis, we did not identify any previously unidentified relevant studies for the occupational health review which were not also identifiable by the bibliographic database searches, i.e. no additional uniquely identified studies were found by Google Search.

| Distribution of journal articles and grey literature relevant to women's health or occupational health reviews
Based on our criterion of identifying relevant studies, for the eight Google Search searches carried out across these two reviews it was most desirable to screen at least the first 100 results in order to identify the two uniquely retrieved studies. It was also desirable to screen until the third page of results to identify all first appearances of relevant studies, as a potentially useful strategy for ensuring that studies were not missed by bibliographic databases and other supplementary search methods. Of those Google Search searches which retrieved more than 300 results, not only were no relevant studies retrieved beyond this point, there were also fewer studies in journal article format (6%-7% of results in total) and no grey literature publications (see Table 3). Thus, there was no evidence that screening these results was useful for identifying relevant studies, nor that there was much chance of identifying a study at all, relevant or not. Therefore we suggest that the desirability of screening to the end of the search results was diminished, particularly where there were more than 300 results.

| DISCUSSION
This cross-case analysis of eight Google Search searches adds to existing evidence on the feasibility of screening the results of searches exhaustively. 13 We have also suggested that, for the two case studies in the present study, it was desirable to screen to the third page of results, where the lowest ranking first appearances of relevant studies were identified (approximately 65% of the retrieved results in total in both cases). The distribution of studies, relevant or not, was much lower on page four, which diminished the desirability of screening these results.
The feasibility of screening the results of Google Search searches exhaustively, in cases where there are relatively low numbers of viewable results, is important because it sets a new baseline for the development of appropriate approaches to screening the results for a systematic review. That is, historically, systematic reviewers and expert searchers have reported that high numbers of results in Google Search necessitate the use of a stoppingrule, or make screening impractical, 5,7,9,11,12 but the present study and Briscoe and Rogers challenge this assumption. 13 Thus, the rationale for developing an approach to screening will need to incorporate the desirability of screening exhaustively. However, this does not necessarily mean that when using Google Search it will always be feasible to screen in full. For example, there may be instances where searches retrieve higher numbers of results than are feasible to screen; and, if multiple Google Search searches are carried out per review, the sum total number of results to screen may not be feasible to screen in full. 23 But, on some occasions, our findings show that Google Search results are feasible to screen exhaustively.
It is unclear why there is a large discrepancy between the estimated and viewable number of results. As noted in Briscoe and Rogers, the relatively small number of viewable results is unlikely to account for all webpages indexed by Google Search that match a search query. 13 The difference may partly be explained by how search engines organise their indexes in "tiers and partitions", not all of which are scanned on every search. 24 Thus, for example, a webpage deep inside a website may not be retrieved by a general web search, but will be retrieved if the search is restricted to the website using the 'site' command. 24 However, this does not account for why the search engine would still report a number of results that is higher than that which is viewable. We also noted that the numbers of results retrieved by Google Search were not always what we would expect. In particular, we sometimes found that adding a term to a search string using the OR Boolean operator decreased the estimated number of results, whereas this ought to increase the number of results. For example, searching for women OR females within the chronic pelvic pain search string for the Women's Health review retrieved fewer results than searching solely for women (see Table 1). Despite this, we were satisfied that Google Search recognised the OR Boolean operator because we could see both the words "women" and "females" in the search results when combined using OR. However, it is important that systematic reviewers more familiar with searching bibliographic databases than search engines are alert to the hidden mechanisms that determine which results are retrieved, and do not place too much faith in the careful construction of Boolean searches for retrieving all potentially relevant studies. Furthermore, as undertaken for this study, and recommended by Gusenbauer and Haddaway, 25 and Briscoe et al., 11 the extent to which search operators are supported needs careful consideration when searches are developed.
The increased proportion of relevant studies in higher ranking search results in the Women's Health review searches is consistent with the commonly reported view that the value of screening diminishes for lower ranking results, i.e. those studies appearing higher in the list of search results are more likely to be relevant than those appearing lower in the search results. 5,8,11,12 We are aware of one other study to date by Cooper et al. which has assessed the distribution of studies within Google Search results, with a particular focus on comparing the results when searching in different geographical locations. 7 However, the search they used for analysis retrieved fewer than 100 results, thus they were not able to assess the desirability of screening more than this number. 7 Furthermore, assuming it is not common practice to use the "repeat the search with omitted results showing" function in Google Search, each 'chunk' of 100 results in our searches might look different to searches which are typically carried out without using this function (for which the number of viewable search results, as in Cooper et al., will be even fewer than the numbers reported in the present study). 7 In order for the findings of the present study be usefully applied to, or compared with, other Google Search results, it is necessary for searchers to apply this function before screening. We suggest this is valuable because it potentially increases the likelihood of identifying relevant evidence through increased exposure to search results. Nonetheless, as Cooper et al. found, we suspect that the ranking of search results would still be distributed differently depending on the geographical location of the searcher when using this function. 7 Although not part of our assessment of the feasibility or desirability of screening exhaustively, the potential difference in the distribution of studies within the results with and without using the repeat search function also makes it difficult to assess the appropriateness of using 'feedback' based stopping-rules, such as screening until one or two pages of results have been inspected without identifying any relevant content. 5 However, we did note that there were over 100 results between included studies in some searches, which suggests that feedback based approaches may not be effective if using the default setting in Google Search of ten results per page.
Theoretically, extensive screening might also be supported by the fact that search engine algorithms interpret relevancy differently to systematic reviewers. Whereas the latter assess relevance according to content and study design, search engines assess relevance according to an array of factors, including content (typically measured as frequency of search terms within webpages), age, length, and 'authority' based on number of links to a webpage. 24 Thus, it may be desirable to screen the results more extensively in order to see results which contain relevant content but are not prioritised by a search engine algorithm due to non-relevant factors. Relatedly, rather than routinely clearing a web browser's search history before searching (as recommended in some guidance), 8 it may be worth only doing this prior to the development of searches for new topics, thus potentially encouraging the retrieval of content that is missed on initial iterations of searches. Indeed, qualitative research on how expert searchers undertake web searching suggests that relatively rapid and repeated attempts at identifying relevant studies is sometimes preferred to the careful construction of an intricate search strategy. 26 This is particularly the case where the searcher is seeking to fulfil a clearly bounded information need (such as a known study or a sample set of a specific type of study), and is expecting to see relevant search results towards the top of the list. 26 In contrast, more exploratory or speculative searches can require more extensive screening. However, in either scenario (i.e. where multiple searches are used for the same search topic), it may be useful to retain a browser's search history in order to encourage the retrieval of search results which are similar to results from earlier attempts at searching.
Finally, we note that we copied and pasted the search results into Word documents for screening, rather than manually adding them into reference management software. The latter option, although preferred by some reviewers, would most likely be sufficiently time consuming to greatly reduce the desirability of screening a full set of search results from Google Search. 27 Thus, we suggest that screening is undertaken using an approach similar to that outlined in this study.

| Strengths and limitations
To the best of our knowledge, this is the first study to explore the distribution of studies within Google Search results which retrieve more than 100 results, i.e. the commonly reported screening limit applied when searching for studies for systematic reviews. We have focused only on Google Search to the exclusion of other search engines, such as DuckDuckGo (https://duckduckgo.com/) and Bing (https://www.bing.com/). However, this reflects current practice where generally search engines other than Google Search are not widely used for the purpose of searching for studies for systematic reviews. 11,28 By using eight searches for the analysis, we have avoided relying on a small set of data, although additional testing would be welcome to strengthen the evidence-base. We did not measure the additional time required to screen more extensively, but we have suggested that the feasibility of this approach is based on the search results being in the same order of magnitude as the number commonly screened, i.e. the low hundreds. We also note that the analysis of identified studies for the first criterion of the second aim reflects the types of document which met the inclusion criteria for the Women's Health and Occupational Health reviews, albeit not limited by study design. Other reviews might have narrower or broader inclusion criteria, particularly with respect to grey literature, editorials and letters. However, we are confident that most systematic reviews include published studies, for which the analysis we present will be informative.

| CONCLUSIONS
The feasibility of screening the results of Google Search exhaustively for some searches is now clear. Although the desirability of this is less apparent, this study has provided evidence that it may be useful to screen Google Search results more extensively than is often reported.
DATA AVAILABILITY STATEMENT Data available on request from the authors.