The deep web in institutional repositories in Japan

Authors


Abstract

Investigating the current status of the deep web is important for both the general public and researchers. Several deep web surveys have recently been conducted on institutional repositories (IRs). We calculate the extent of the deep web based on the content of searchable IRs in Japan, but using a more appropriate interval and exhaustive search with three major search engines (Google, Yahoo!, and Bing). The deep web is roughly 30% of major search engines' coverage.

INTRODUCTION

The deep web refers to web pages that cannot be accessed via ordinary search engines. Thus, the “deep web” can be seen as the antithesis of the “searchable web.”

Investigating the current status of the deep web is important for both general users and researchers. However, accessing the deep web is extremely difficult. It is unrealistic to make an exhaustive collection of the deep web by means of crawling, considering that even the crawlers of large-scale search engines cannot reach the area.

However, in recent years, two deep web surveys involving an exhaustive collection of content based on harvesting metadata from IRs through OMI-PMH (the Open Archives Initiative Protocol for Metadata Harvesting) were undertaken to investigate the searchable web and calculate the extent of the deep web. After exhaustive harvesting of OAIster in June 2005, and randomly extracting 1,000 content URLs from the metadata, McCown et al. conducted searches using Google, Yahoo! and MSN and showed that 21% of the OAI-PMH corpus was not indexed by a search engine. As a supplementary study in June 2008, Hagedorn et al. conducted a large-scale survey of the web using the Google Research API and demonstrated that Google's coverage of the web was 44.35%.

However, there may be some room for improvement in both these studies. First, the study by McCown et al. examined a number of search engines and analyzed the overlap of the respective engines, but the number of URLs included in their study was extremely small. Hagedorn et al., on the other hand, included a substantial number of URLs in their study, but were unable to analyze overlap, since they only used one search engine. Furthermore, it is not clear in either study whether there was an interval between collecting the metadata and invoking the search engines. Thus, when content that was previously not available via searches was subsequently found, it was not possible to determine whether it was new content that had been added during routine crawler runs, or content that was buried in the deep web.

METHOD

Harvesting of Data from IRs

The web content used as the subject of this study consisted of full-text PDF files handled in junii2 metadata format and available to the public via 92 IRs in Japan.

The reasons for focusing on full-text PDF files in IRs are the following. 1) By using a method other than a crawler to follow links, all the content can be collected via harvesting by OAI-PMH. 2) REP (Robot Exclusion Protocol) is not that strict, since the main purpose is to make public the results of the research. 3) The vast majority of items in IRs are full-text files with academic information.

The reason we limited the study to Japanese IRs was to undertake a collection of data that could accommodate junii2 metadata format. Compared with other formats, the volume of information in this format is considerably richer, thus enabling further analysis after collection.

Exhaustive metadata harvesting from the IRs was undertaken on April 11, 2009. URLs of full-text files were extracted from the metadata. As a result, a total of 404,431 items were obtained and used as subject URLs.

Investigation of Search Engines

The search engines used in this investigation were Google, Yahoo! Japan (although the engine was ostensibly “Japan,” the registered content was not limited to Japanese content), and Bing. These engines were selected based on the following conditions: 1) the search engines are top-notch engines, 2) the engines provide a worldwide service, rather than only a domestic service, 3) the search engine API to be used from outside programs is publicly available, and 4) an optional search with the URL as query is possible.

Search engine APIs used in the study were: the Google AJAX search API for Google, the web search API of Yahoo! Developer Network for Yahoo!, and the Bing API 2.0 for Bing. URLs of full-text files from IRs were input as the query to each search engine API. A URL is deemed to be unsearchable when no search results are returned for the given subject URL. A URL is deemed to be searchable when there is one or more search result.

We conducted a study of search engines using URLs of 404,431 full-text files as queries between September 6 and September 8, 2009.

Coverage and overlap

Calculation of the coverage, indicating the proportion of the web that is searchable using search engines, and the overlap between search engines would make it possible to determine the extent of the deep web.

First, the formula below was used to determine the coverage.

equation image

Next, the overlap ratio was calculated based on the no. of URLs which were searchable by more than one search engine.

RESULTS

Coverage rate of the Deep Web

As shown in Table 1, the coverage rate for each search engine was calculated from the number of searchable URLs in the subject URLs.

Table 1. Search Engine Coverage
 GoogleYahoo!BingTotal
No. of searchable URLs215,259174,805115,679291,024
Rate of coverage53.2%43.2%28.6%72.0%
No. of subject URLs in the study: 404,431

The search engine with the highest coverage rate was Google, but it was determined that even Google on its own was capable of covering a range of only 53.2% of the subject URLs. On the other hand, when Google, Yahoo!, and Bing were used in combination, they achieved a coverage range of 72.0%. In other words, 28.0% of the web could not be searched by the major search engines even when they were used in combination; this area can be identified as the deep web.

Rate of overlap of the search engines

Table 2 shows the results of calculating the overlap of the search engines. This table gives the overlap ratio of the coverage of the search engines. For example, Google has a 54.2% overlap with Yahoo! whereas Yahoo! has a 66.7% overlap with Google.

Table 2. Overlap of Search Engines
original image

Based on the values in Tables 1 and 2, Fig. 1 shows the extent of coverage of the search engines in relation to the public web according to the size of the circles, while the ratio of overlap of the engines is denoted by the overlap of the circles. This figure clearly demonstrates that a single dominant search engine is not capable of searching the entire web. It also shows that while there is overlap in coverage of the respective search engines, there are areas without overlap.

Figure 1.

Coverage and Overlaps of Search Engines

SUMMARY

This study investigated the current status of the deep web using a full-text URL collection harvested from IRs. The coverage rate was highest with Google at 50%, and when used in conjunction with two other major search engines, coverage increased to about 70%. If this part is considered the searchable web, the deep web is assumed to be roughly 30%.

Ancillary