Liwen Vaughan is Professor in the Faculty of Information and Media Studies at the University of Western Ontario. Currently, the main focus of her research is the Web, including Web data mining, evaluation of Web search engines, and Webometrics (the analysis of Web-related information using quantitative methods).
Address: Faculty of Information and Media Studies, University of Western Ontario, London, Ontario, N6A 5B7, Canada
Faculty of Information and Media Studies University of Western Ontario
Yanjun Zhang is a Ph.D. candidate in the Program of Library and Information Science at the University of Western Ontario in London, Ontario. His major research interests are Web search engine evaluation, quantitative studies of information dissemination on the Web, and use of Semantic Web related technology to improve retrieval and evaluation of Web-based information, including online quality health information.
Address: Faculty of Information and Media Studies, University of Western Ontario, London, Ontario, N6A 5B7, Canada
The study examined search engine coverage of websites across countries and domains. Websites in four domains (commercial, educational, governmental, and organizational) from four countries (U.S., China, Singapore, and Taiwan) were randomly sampled by custom-built computer programs and then manually filtered for their suitability for the study. Representation of the 1,664 sampled sites in four major search engines (Google, Yahoo!, MSN, and Yahoo! China) was examined in terms of whether the site was covered and the number of pages indexed by the search engines. The study found that U.S. sites received higher coverage rates than their counterparts in other countries. The language of a site did not affect the site’s chance of being indexed by search engines. Sites that were more visible had a higher chance of being indexed, but this factor did not seem to explain the differentiated coverage across countries. Yahoo! China provided better coverage of sites from China and surrounding regions than its global counterpart, Yahoo!. The poor coverage of Chinese commercial and governmental sites is noted and the implications are discussed in light of the tremendous development of the Web in China.
The significance of the Web in various aspects of our life is well known and well documented. Less well known is the impact of Web search engines on what information we find and consume. Given that 75% of Web users rely on search engines as the primary means to traverse the Web (Bruemmer, 2001), the significance of search engines in an information society should not be underestimated. The sheer size of the Web makes it impossible for a single search engine to cover all Web pages in existence. Different search engines cover different parts of the Web. As websites that are covered by search engines are much more visible and accessible to Web users, the selective coverage of the Web by search engines has great social, political, cultural, and economic implications.
For example, e-commerce sites that are not covered by search engines are economically disadvantaged. Government sites that are not indexed by search engines have a reduced chance of communicating their political messages to their target audience. In fact, the political significance of Web search engines has long been recognized by various governments and critics. Search engines have been banned in China for political reasons and, indeed, controvery still persists over what types of sites should be censored or removed from search engines (Asiasearch, 2004; McHugh, 2003). The recent dispute over Google China’s censorship of certain political topics is further proof of the political implications of search engines (Thompson, 2006; USINFO, 2006).
Given the significance of search engines, it is thus important to ask: What types of websites (language, country, and content) are more likely to be covered by search engines? Which search engines give a more balanced representation of different types of websites? Do local search engines for a particular country provide a more balanced representation than U.S.-based search engines? These are among the research questions examined in this study.
It is important to note that this study intends to find out whether search engines provide an equal representation of websites from different countries. The concept of “equal” is measured by the proportion of websites that are covered. By equal representation, we mean that the same proportion of websites from different countries is covered, so as to ensure that websites from different countries have an equal chance of being presented to Web users. Conversely, deviation from this situation, a situation that we consider to be ideal and that search engines should strive to achieve, is interpreted as unequal representation. However, this unequal coverage does not imply an intentional or purposeful under-representation. Search engines all use Web crawlers to build up their databases, so that a “biased” representation is most likely to be caused by technical reasons, except in certain cases such as Google China’s current (early 2006) policy of suppressing certain political Web pages. Regardless of the cause of (un)equal representation, the situation itself is worth exploring.
This study focused on search engines and websites from a few countries, as it would be impossible to examine all search engines and all countries in existence. Google, Yahoo!, MSN, and Yahoo! China are the chosen search engines, while the United States, China, Singapore, and Taiwan are the chosen countries. The rationale for the choices is as follows. The search engine market is dominated by a few giants who take the lion’s share of the market. A recent study (Sullivan, 2006) showed that the first three search engines selected for the study accounted for 81.8% of the search engine use in the U.S. in May 2005 (47.3%, 20.9%, and 13.6% for Google, Yahoo! and MSN, respectively). These search engines are all based in the U.S. but are used worldwide. However, international Web users do not rely exclusively on U.S.-based search engines. Many country-specific search engines have been developed. Yahoo! China (cn.yahoo.com) is one such example. It was the result of a merger between Yahoo! and a major Chinese search engine 3721 around the end of 2003, and it currently retains its position as the second largest search engine in China, as measured by amount of use. The largest is Baidu (www.baidu.com), a Chinese search engine that has Google as a shareholder, but that is not Google China, the Chinese version of Google (discussed below). Baidu was not included in this study, as it is purely a Chinese search engine, and the focus of the study was global search engines.
A local version of a global search engine can take different forms. For example, Google has a Canadian version (www.google.ca) but this is just a Canadian interface for the global Google (www.google.com). The two have the same underlying databases, so the search results are the same. However, not all local versions of a global search engine work this way. A search on Google China (www.google.cn) will retrieve very different pages from those of global Google, particularly with respect to politically sensitive search queries. Google China imposes political censorship on its search results and explicitly states in its results screen that part of the search results are omitted to comply with local laws and regulations. Google China was not included in the study because at the time of data collection (summer 2005), Google China did not exist. This can be confirmed by a search of URLs www.google.cn and www.google.com.cn in the Internet Archive (www.archive.org). All records in the archive under these two URLs (October 2000 to April 2005) show that a Chinese organization unrelated to Google was using these URLs.
Yahoo! China has a different database from that of global Yahoo!, so the same query will retrieve different results in the two search engines. To distinguish the two Yahoo! search engines, the regular Yahoo! engine (www.yahoo.com) will be called global Yahoo! throughout this study. It is the underlying database that determines the content and thus the search results of a search engine, and it is the database, rather than the interface, that is the focus of the current study. If a website is indexed in the database of a search engine, then it will be searchable through that engine. The words “indexed” and “covered” are used interchangeably throughout this article. Yahoo! China was chosen for the study to serve as a comparison with these U.S.-based search engines. The comparison will also give us a sense as to whether a regional search engine provides a better representation of websites from the region and if it under-represents sites from other regions.
The U.S. and China were chosen for the study due to their political and economical significance in world affairs, and due to the enmity or rivalry that exists between the two countries. These two countries, it could be argued, are the two most dominant players on the Web. Very few studies have looked at the dominant players in this regard, let alone minor ones, so it is natural to start this kind of study with the two giants. Another reason for choosing the U.S. is that U.S.-based search engines are more likely to give the U.S. favorable coverage if a country bias does exist. Taiwan was chosen as a comparison with China, as the two have the same language but very different political and social structures.1 If the two countries are represented very differently in search engines, then the difference cannot be attributed to the language factor of the website (i.e., the difference is not caused by the possible technical difficulties of processing the Chinese language). To further examine the language factor, Singapore was also chosen. Although English is only one of the official languages of Singapore, the vast majority of Singapore websites are in English. They thus serve as a contrast with the English-language U.S. sites, a contrast parallel to that of China and Taiwan.
To determine whether websites of different countries or languages are proportionally represented by search engines, it would be useful to know the distribution of websites over countries or languages. However, this information is difficult to obtain, due to the sheer size of the Web and its constant change. Grefenstette and Nioche (2000) showed that non-English language sites were growing more rapidly than English sites. In 2002, 72% of Web content was in English (OCLC, 2002). However, it was estimated that by 2005, less than 50% of the Web content would be in English (Peters, 2002). The Online Computer Library Center (OCLC) carried out large-scale studies to estimate the size of the Web and to obtain country and language statistics (OCLC, 2002). However, the study has not been carried out in recent years, perhaps due to the difficulty of such an effort. A UNESCO report by Paolillo, Pimienta, and Prado et al. (2005) has somewhat more up-to-date estimates. China’s tremendous development on the Web in recent years has changed the landscape of the Web and Web population users. Although data from different sources show slightly different figures, they are consistent with the claim that China’s Web user population size is currently second only to the U.S. (Chua, 2004).
Given the importance of search engines in information access and information freedom, it is ironic that accurate and reliable data on search engines are often not publicly available. Not only are search algorithms proprietary commercial secrets; so, too, is the size of their databases. The recent controversy over the size of the Yahoo! database shows the difficulty of obtaining accurate information on search engines. A Yahoo! insider hinted that Yahoo! indexed over 20 billion items (Mayer, 2005). The claim was immediately questioned and followed by a debate over the size of Yahoo! vs. Google (Cheney & Perry, 2005). To compare the database sizes of Yahoo! and Google, Cheney and Perry (2005) submitted randomly selected uncommon words to the two search engines and then compared the number of hits returned. Although this method provided a rough comparison of the two databases, it is not scientifically rigorous. The method involved only uncommon words that may not be a fair representation of the database content. Another limitation of the method is that it was limited to the use of the English language, a method that definitely cannot be applied to the present study.
The incidence of unequal visibility of websites has been well documented (Albert et al., 1999). Concerns have been expressed regarding the danger posed by search engines emphasizing certain websites while making others essentially disappear (Introna & Nissenbaum, 2000). Thelwall (2000) found that commercial sites from various countries received different coverage according to the search engine in question. Mowshowitz and Kawaguchi (2002) examined search engine bias, defined as “the degree to which the distribution of items in a collection (provided by a search engine) deviates from the ideal.” Their method of comparing a particular search against “a control group of search engines” is not applicable to the present study, as the present study examines three major search engines rather than one particular site, as was the case in the Mowshowitz and Kawaguchi (2002) study.
Vaughan and Thelwall (2004) took a different approach from that of Mowshowitz and Kawaguchi (2002) to analyze search engine coverage bias. The present study follows the model used by Vaughan and Thelwall (2004) and conceptually defines bias as non-proportional coverage of particular groups of websites—for example, sites from a particular country or in a particular language. The Vaughan and Thelwall (2004) study focused on commercial websites, while the present study extends the scope to include four major types of websites: Commercial, educational, governmental, and organizational sites. The search engine market has changed significantly since 2003 when Vaughan and Thelwall (2004) collected data. While Google remains strong, two other search engines examined in that study, AltaVista and AllTheWeb, were purchased by Yahoo! in early 2004. MSN was not a major player at that time, and Yahoo! China did not exist. The present study not only provides up-to-date information on the search engine coverage issue, but also sheds light on the question of whether search engines have changed over time.
The overall process of the study involved several steps. First, random samples of websites from each country for each type of site were generated and verified by computer programs. Each of these sites was then manually checked to select only those that were suitable for the study. Next, all four search engines in the study were queried to find out if the engine covered a site and the exact number of pages of the site that were indexed. For all sites in the study, the number of links pointing to the site (inlinks) was also determined by querying search engines. This inlink count was analyzed as a possible factor that affects the chance of the site being indexed by search engines. The details of the methods are described below.
Types of Sites (Domains)
As different types of sites (e.g., commercial sites, educational sites) have different characteristics and sizes (e.g., university sites are typically larger than organizational sites in that they have more pages), it is important to examine different types of sites separately. Four of the most common types of sites were studied: Commercial sites, educational sites, governmental sties, and organizational sites. Websites with domain name designations of .com, .edu, .gov, and .org were sampled to represent these four types of sites, respectively (details of sampling are reported below). Sites of less common interest, such as military sites, were omitted from the study.
As there is no existing list of websites from a particular country for ready use, the study had to generate the lists from the ground up. Two methods have been used in the past to generate a random list of websites: Random generation of domain names (e.g., Vaughan & Thelwall, 2004) and random generation of IP addresses (e.g., OCLC, 2002). The latter method is not feasible for this study as it would involve reverse lookup of IP addresses to determine the country location for each site. This would be highly ineffective, given the number of countries in the world. Although a script could be written to do this automatically, the large number of lookup queries in a short time could be mistaken as a denial of service attack by DNS servers.
This method also has other problems. For example, a U.S. site could share the same IP address with multiple sites in other countries. Random generation of domain names is particularly suited for the purpose of this study as it allows for the specification of countries (e.g., .cn for China) and types of sites (e.g., .com for commercial sites), so this method was used. Specifically, a domain name was randomly generated that consists of four components: www.randomletter.type.country. The meaning of the last two components is self-explanatory (i.e., .com.cn for Chinese commercial sites and .gov.sg for Singapore government sites). U.S. sites do not usually have a country designation so they only have the first three components. In very rare cases, U.S. sites have the country designation of “us” in their URL. However, these sites are the exception rather than the rule and there is no reason to believe that the exclusion of these sites from the study will cause a systematic bias in the sample. The component ‘randomletter’ is a randomly generated string with varying length of one to six letters. A Java computer program was written specifically for this project that can generate, for example, any possible combinations between www.a.gov.cn and www.zzzzzz.gov.cn for Chinese government sites.
The domain name generated in this way may not be a valid URL, that is, such a website may not exist. Therefore each domain name generated needed to be verified. A Java HTTP connection program was written specifically to perform this task. If the attempted connection to a randomly generated URL was unsuccessful after three tries, then the URL was discarded and another URL was generated and the connection tested again. The process continued until a sufficiently large number of valid sites was found for a further filtering process (described below).
The decision to limit the randomletter part of the URL to no more than six letters was based on the practical consideration of completing the project in a reasonable amount of time. If we set a higher limit on the number of letters, the possible permutations of the domain names would increase exponentially and the vast majority of these randomly guessed domain names would not exist. This means that the vast majority of the HTTP connection attempts would fail and the program would keep running with low chances of finding a valid URL. Thus the time for data collection would increase greatly. For similar reasons, previous studies have also traditionally set a limit on the length of domain names—for example, the Vaughan and Thelwall (2004) study set a limit of four letters.
It is acknowledged that this sampling method excluded websites that have more than six letters in their domain names, introducing a bias favoring abbreviated domain names. However, this bias should not undermine the findings and conclusions from the study, as the focus of the study is to make a relative comparison of coverage of sites from different countries. Sites with longer domain names from all countries were excluded. There is no reason to believe that search engines tend to over-represent sites with long names from country A and under-represent sites with short names from country B. Thus the comparison of websites with short names will not cause the conclusion to be invalid. Sites with short or abbreviated names are not uncommon, e.g., IBM, HP, or UWO (the authors’ university). A sample of websites used in the study is shown in Appendix 1.
Each valid site found by the computer programs was manually examined to filter out sites that were not appropriate for the study. The first criterion of filtering was that the site must be the type that it was supposed to be (e.g., a .com site must be a commercial site, not a personal site). Sites that were under construction or password protected were filtered out. In addition, sites that did not fit the language requirements were filtered out. This means that Singapore sites must be in English while Chinese sites must be in Chinese. Exceptions include Taiwan government sites, some of which are bilingual in English and Chinese. If we only allowed pure Chinese language sites, there would not be enough sites for this type. As .com, .org. and .gov domain names can be used by countries other than the U.S., the non-American .com, .org, and .org sites were moved to the corresponding country group or discarded altogether if the country was not in the study.
The sampling of educational sites used a different procedure. To ensure that the sites in the .edu type were comparable, we decided to study only university/college sites. Taiwan and Singapore had only 68 and 7 universities/colleges, respectively. This was too few data points to take a random sample and do statistical analysis, so these two countries were excluded from the .edu part of the study. Since there are reliable listings of universities and colleges in the U.S. and China, we used those sources for sampling instead of using the method of random domain name that would exclude universities with long domain names. American Universities and Colleges (American Council on Education, 2001) was used to sample U.S. universities and colleges, while World List of Universities and Other Institutions of Higher Education (International Association of Universities, 2004) was used to generate the Chinese sample. An online source (Viron, 2005) was used to complement and verify data from these two print sources. The numbers of universities/colleges in these listings were 4,064 and 610 for the U.S. and China, respectively. A random sample of 202 universities/colleges from each country was taken.
A total of 1,926 sites were manually examined and 1,664 sites were finally included in the study. See Table 1 for the breakdown by countries and types of sites.
Table 1. Number of sites in the study
Search Queries Used
The second column of Table 2 shows the queries that were used to find out the number of pages that a search engine indexed for a hypothetical site www.xyz.gov. If the number returned was zero, it means that the site was not covered by the search engine.
The third column of Table 2 shows the query syntax to find out the number of inlinks to the hypothetical site. There are two types of inlinks: total inlinks and external inlinks. Total inlinks include all links pointing to a particular site, while external inlinks include only links coming from websites outside the site in question (Vaughan, 2005). In other words, external inlinks do not include links within the site itself, such as the “back to home” type of navigational links. The study collected data on external inlinks (also called inlinks later in this study), because the number of internal inlinks is not an indicator of online visibility, the factor that the study was examining. Google cannot perform external inlink searches. Although Google has the “link” command, this command cannot be combined with the “site” command to filter out internal links. In other words, Google can find all links to a page but cannot separate internal links from external links.
In searching for the inlink count, partial domain names were used in the “site” portion of the query in Table 2 (i.e., “www” was not included). The partial domain name search can do a more complete capture of all inlinks pointing to a website, because it is conceivable that a government named www.xyz.gov uses mail.xyz.gov for its mail website.
Data Analysis and Results
Data were analyzed in the following manner. First, the percentages of sites covered by each search engine for each country and each type of domain were compared to determine if there was a differential representation across countries. Then site visibility, as measured by the number of inlinks to the site, was examined as a possible factor that affects the chance of a site being covered. Finally, for sites that were covered by all search engines, a further analysis was carried out on the number of pages of the site being covered. This was again done by country and by type of domain to ascertain if a particular search engine tended to under-represent a particular type of sites by indexing fewer pages.
Percentages of Sites Covered
The percentages of sites covered by each search engine for each type of site are shown in Figure 1 to Figure 4.
Although there are differences among the four figures, general patterns are evident. Various statistical tests were carried out to examine patterns by search engines and countries. Below is the summary of findings:
• On average, Google indexed more sites than other search engines did (chi-square test, p<0.01). This is consistent with the general perception that Google has the largest database of all search engines. Google’s coverage of Chinese sites was even better than that of Yahoo! China.
• Academic sites from the two countries (U.S. and China) received very high coverage rates—all were well over 90% with the exception of MSN, which indexed 79% of Chinese University sites.
• Yahoo! China provided much better coverage of Chinese sites than did global Yahoo! (chi-square test, p<0.01). In fact, its coverage of sites from other countries in the region (Taiwan and Singapore), regardless of the site language, was also better than that of global Yahoo!.
• U.S. sites were more likely to be covered than sites from other countries. (Chi-square tests for all four search engines were carried out. Country and coverage are two variables, p<0.01 for all cases.) When the domains of .com, .org., and .gov were combined (.edu was excluded as only two countries were studied and the coverage there was higher, as discussed above), the average coverage rate for U.S. sites was 91.8%, while that for China, Taiwan, and Singapore were 74.9%, 83.4%, 87.5% respectively; see Table 3. The average coverage rate for China is the lowest among the four countries studied. Given that both U.S. and Singapore sites used the English language, while China and Taiwan sites used the Chinese language, the favorable coverage of U.S. sites could not, at least in this study, be attributed to the language factor.
• Table 3 shows that the commercial sites from China and Singapore received particularly poor coverage compared to their U.S. counterparts. When all search engines were combined, an average of 94.8% of U.S. commercial sites was indexed, while the average for Chinese commercial sites was only 67.6% and for Singapore 67.9%. Another strong contrast was for the governmental sites: 98.3% and 77.9% for the U.S. and China, respectively. Still, there are variations among the three global search engines. Google’s coverage of Chinese government sites was relatively good (89.5%), while that of Yahoo! was only 64.8%.
Table 3. Percent of coverage—all search engines combined
Average of the above three domains
A possible technical factor that could affect the chance for a site being indexed by a search engine is the visibility of the site on the Web, as measured by the number of inlinks pointing to the site. This is because all search engines use crawlers or spiders to build up their databases. A crawler finds Web sites by following links on the sites visited, so the more inlinks a site attracts, the more visible the site is to the search engine and the more likely it will be indexed. To determine if the number of inlinks does affect the chance for a site being indexed, a two-way analysis of variance test was carried out. The dependent variable was the inlink count and the two independent variables were country and coverage (a binary variable of covered or not). The analysis was carried out for each search engine separately, as different engines index different websites and thus will retrieve different numbers of inlinks. The results of all these tests reached the same conclusion: Sites that were covered had significantly (p<0.05) more inlinks than those not covered. However, there is no significant difference generally among countries in this regard. This would imply that inlink count is a factor contributing to the chance of being indexed, but does little in the way of explaining the differentiated coverage of sites across countries.
Number of Pages Covered
The above analysis examined the coverage issue through a binary variable that measured the extent to which sites were or were not covered. For those covered by all search engines, a further analysis was carried out on the number of pages that were indexed in order to determine if the representation is equal across countries on this measurement. Four two-way analysis of variance tests were carried out for the four types of domains. The dependent variable was the number of pages indexed, while the two independent variables were country and search engine. The frequency distributions of the dependent variable for all four domains were skewed, which violated the normality requirement of a two-way analysis of variance test. A logarithmic transformation (Howell, 2002; Judd & McClelland, 1989) was carried out, and the resulting frequency distributions were no longer skewed. Figures 5 to 8 present the average number of pages indexed for the four types of domains. The vertical axes in these figures represent the logarithm of the average number of pages covered. The higher a data point is on the vertical axis, the more pages indexed. Because the exact number of pages of a site was unknown, we did not know exactly what portion (i.e., what percent of pages) of the website was indexed. However, for any given site, we can compare the number of pages indexed by different search engines to know which engine covered a larger portion of the site. We can extend this comparison from a single site to sites in a country. For a given country in Figures 5 to 8, the search engine that is located higher on the vertical axis covered a larger portion of the sites in that country. For example, in Figure 5, MSN is located much higher than global Yahoo! for Chinese sites, which means that MSN covered a much larger portion of the Chinese commercial sites.
The two-way analysis of variance tests revealed significant differences among the search engines. As is shown in Figures 5 through 8, MSN and Google indexed a larger portion of these sites in general, except for the Chinese academic sites where Yahoo! China fared much better. Yahoo! China indexed a larger portion of the Chinese sites than did global Yahoo!. The coverage of the Chinese government sites by global Yahoo! was particularly poor. There were larger discrepancies among search engines in indexing non-U.S. sites than there were in indexing U.S. sites (this is shown by the larger discrepancy in the vertical axis for non-American sites). Relatively speaking, MSN and Google had better coverage of the non-U.S. sites.
Discussion and Conclusions
The study found that websites from different countries received unequal representation in major search engines. U.S. sites had a statistically significantly higher chance of being indexed than did sites from other countries. Chinese sites had the lowest average coverage rate among the four countries studied. When all types of sites and all search engines in the study were combined, an average of 93.6% U.S. sites was indexed, whereas only 79.8% sites from China were indexed. The favorable coverage of the U.S. sites could not be attributed to the language of the sites, as both the U.S. and Singapore sites in the study used English. However, the Singapore sites had lower rates of coverage than their U.S. counterparts. The same phenomenon was found in a study that collected data in 2002 on commercial sites (Vaughan & Thelwall, 2004). Google was the only search engine that was in both the current study and the Vaughan and Thelwall study. The comparison between data from 2002 (i.e., the Vaughan & Thelwall study) and 2005 (current study) for the U.S. and China is shown in Table 4. The coverage rates for both countries had increased, perhaps as a result of the increasing size of the Google database. However, the rate of increase for Chinese sites seemed lower, which caused the gap between the two countries to widen.
Table 4. Google’s coverage of commercial sites
Increase from 2002 to 2005
The visibility of a site as measured by the number of inlinks pointing to the site did affect the chance that the site was indexed by a search engine. Sites that were indexed by search engines had a larger number of inlinks than those that were not indexed. This is true across sites from different countries, regardless of the types of sites. However, the inlink factor did not seem to explain the differentiated coverage of sites across countries.
Yahoo! China, the Chinese counterpart of global Yahoo!, provided better coverage of sites from China and the surrounding regions. This is true when the coverage is measured either by the percentage of sites covered or by the portion of a site indexed. An interesting finding is that Yahoo! China provided very good coverage of U.S. sites. When all four types of websites were combined, Yahoo! China indexed 93.1% and 81.9% of U.S. and China sites, respectively. This, however, should not be interpreted to mean that Yahoo! China is totally unbiased toward China. The large number of U.S. sites in Yahoo! China could be the legacy of the global Yahoo! database that Yahoo! China inherited (China Daily, 2004). It remains to be seen if Yahoo! China will continue to have a high coverage rate of U.S. sites as it evolves independently from global Yahoo!.
The smaller coverage of Chinese sites relative to their U.S. counterparts was more pronounced in the areas of commercial and governmental sites. When data from all four search engines were combined, the contrast between the two countries for commercial sites was 94.8% for U.S. vs. 67.6% for China, while that for governmental sites was 98.3% vs. 77.9%. Given that websites not indexed by a search engine are effectively invisible to people who use the search engine to find sites, the under-representation of commercial sites could result in economic disadvantages, while the under-representation of governmental sites may have political implications. One may suspect that the higher inclusion rate of U.S. commercial sites in search engines could be the result of the pay-for-inclusion practice (companies can pay search engines to have their sites indexed). It is unlikely, however, that this practice can explain the 27.2% discrepancy between the two countries, as the vast majority of .com sites are not pay-for-inclusion sites. In this sense, the pay-for-inclusion practice is the exception that proves the rule in the .com world.
A limitation of the study is that websites without explicit country designation in their URLs were not included. However, this limitation is not likely to undermine the conclusions of the study, as it is not likely that websites without a country designation are significantly different from those that have one. Another limitation of the study is that only four countries were included. Although the study found an advantage for U.S. sites in search engine representation, whether this advantage exists over countries not in the study is unknown. It would be very useful to extend the study to other countries and languages. For example, how do French sites from France compare with those from French speaking African countries? How do sites in a less commonly spoken language fare in search engine representation? Conclusions from the current study cannot be generalized to other countries before and unless such studies are carried out.
One factor that was not investigated in the study is the age of the websites. Could the differentiated coverage found in the study be attributed to the age factor—that is, could it be possible that the better coverage of U.S. sites was the result of those sites being on the Web for a longer period of time and thus having more chance of being discovered by search engines? We considered the possibility of examining this factor but decided that there was no reliable way to investigate this question empirically.
An objective way to collect data on the age of a website is to search the Internet Archive (www.archive.org) to find out the first time that the site in question appeared in the Archive. However, an earlier study (Thelwall & Vaughan, 2004) found that the Archive’s coverage was biased toward the U.S., in that U.S. sites were more likely to be included. This bias renders the Internet Archive data inappropriate for our study. Although no reliable conclusion can be drawn on the age factor, we conjecture that the differentiated coverage cannot be totally attributed to website age. China lagged behind the U.S. in Web development, particularly in the late 1990s. However, its development in recent years has been extraordinary. MSN is a new search engine that was released in February 2005. Its database was built from the ground up by Microsoft (PRNewswire, 2005). The delayed development of Chinese websites should not be a major cause of their under-representation in this new search engine’s database.
Although Web search engines all use crawlers to find Web pages to build up their databases, not all pages found are indexed. Search engines use their own preparatory algorithms to determine the “importance” of sites or pages as the basis for indexing. Since the algorithms are commercial secrets, we do not know exactly what factors are involved in building up the databases, but we can speculate that they are likely to be technical rather than political in nature. This study explored the language and inlink factors to some extent and conjectured about the age factor but did not find a convincing cause that could explain the differentiated coverage. While it is difficult to determine what other factors caused the differentiated coverage and what remedy, if any, should or could be made, the differentiated representation itself is a cause for concern, given the political and economical significance of search engines, as discussed at the beginning of this article.
As the Web is constantly changing, so are the databases of search engines. A clear conclusion from this study is that search engines have undergone significant changes over the years. Compared to findings from Lawrence and Giles (1999), where major search engines of the time covered no more than 16% of their sample, the current generation of search engines covered a much larger portion of the Web in general. How will search engines continue to evolve? Will the unequal representation found in the present study persist? Future studies are needed to monitor these important issues.
This study was funded by the Initiative on the New Economy (INE) Research Grants program of the Social Sciences and Humanities Research Council of Canada (SSHRC). Research assistant Karl Fast helped with the programming work.
Taiwan is not recognized as an independent country by the United Nations and other international organizations. It is therefore more accurate to refer to it as a country/region. However, for convenience of reading and writing, we use the wording “country” throughout this article.
About the Authors
Liwen Vaughan is Professor in the Faculty of Information and Media Studies at the University of Western Ontario. Currently, the main focus of her research is the Web, including Web data mining, evaluation of Web search engines, and Webometrics (the analysis of Web-related information using quantitative methods).Address: Faculty of Information and Media Studies, University of Western Ontario, London, Ontario, N6A 5B7, Canada
Yanjun Zhang is a Ph.D. candidate in the Program of Library and Information Science at the University of Western Ontario in London, Ontario. His major research interests are Web search engine evaluation, quantitative studies of information dissemination on the Web, and use of Semantic Web related technology to improve retrieval and evaluation of Web-based information, including online quality health information.Address: Faculty of Information and Media Studies, University of Western Ontario, London, Ontario, N6A 5B7, Canada
Table Appendix1. Examples of websites in the study