Although designed for general Web searching, Webometrics and related research commercial search engines are also used to produce estimated hit counts or lists of URLs matching a query. Unfortunately, however, they do not return all matching URLs for a search and their hit count estimates are unreliable. In this article, we assess whether it is possible to obtain complete lists of matching URLs from Windows Live, and whether any of its hit count estimates are robust. As part of this, we introduce two new methods to extract extra URLs from search engines: automated query splitting and automated domain and TLD searching. Both methods successfully identify additional matching URLs but the findings suggest that there is no way to get complete lists of matching URLs or accurate hit counts from Windows Live, although some estimating suggestions are provided.
Commercial search engines like Google, Yahoo! and Windows Live constantly crawl the Web and maintain huge searchable databases of the pages that they have found. Search engine results are now widely used for measurement purposes, not only by information researchers in Webometrics (Almind & Ingwersen, 1997; Bar-Ilan, 2004b), and related fields (Foot, Schneider, Dougherty, Xenos, & Larsen, 2003; Park, 2003; Pennock, Flake, Lawrence, Glover, & Giles, 2002) but also by commercial activities such as Web analytics and search engine optimisation. Hence, there is a need for research into the reliability of the results that search engines deliver and two relevant issues are discussed here.
First, search engine hit count estimates (e.g., 119,000 in “Results 1–10 of about 119,000”) are often used in Webometrics research, for example to determine how many pages in one country link to another (Ingwersen, 1998). These hit count estimates are normally reported on each results page and can vary between results pages (e.g., the second results page might state: “Results 11–20 of about 116,000”). Hence, it is logical to question which estimate is the most reliable: that on the first page of result or that on a subsequent or the last page of results? Nevertheless, despite the continued use of search engines in Webometrics research, there has been no systematic study of how hit count estimates vary between results pages. Such a study could shed light on reasons for differences and any systematic biases as well as providing simple best practice advice.
Second, instead of hit count estimates, some Webometrics research requires lists of URLs matching a query, for example if the individual URLs need to be visited or their country of origin determined (Thelwall, Vann, & Fairclough, 2006). This is often problematic because search engines normally stop at about the 1000th result, with all other matching URLs remaining hidden from the user (Jepsen, Seiden, Ingwersen, Björneborn, & Borlund, 2004). It is currently not known whether it is possible to use other methods to extract all of the remaining URLs in such cases. Moreover, search engines employ unreported methods to select which URLs they return, such as their page ranking algorithms (Chakrabarti, 2003), and so it is unclear whether their results are representative of their databases. Of course, because search engines do not index the whole Web, it is not possible to get a complete list of all pages matching a query.
In order to address the two issues above, this article introduces new methods to obtain extended lists of URLs for a search, including the initially hidden URLs, and to evaluate the hit counts reported by a search engine for queries with multiple pages of results. Previous research (reviewed below) has already developed several methods to assess various aspects of search engine performance but, surprisingly, none has fully investigated whether the hit count estimates are reliable reflections of the number of matching URLs in a search engine's database. We apply these methods to a case study of Windows Live, via its search service, and also present similar results for Google and Yahoo!. No previous study has evaluated Windows Live for Webometrics, and this is an important omission because it is currently the best for some types of investigation, as described below. Note that this article is concerned with extracting results from a single search engine and is not concerned with methods to obtain more complete URL lists or more comprehensive hit count estimates, such as through the use of multiple search engines (cf., Lawrence & Giles, 1998).
Webometric Methods for Search Engine Evaluation
This section briefly reviews research evaluating search engines to set the background for the current study. An important issue in the early years of the Web was to discover the percentage of the Web in the databases of major commercial search engines. A method has been developed to assess this: submitting a set of queries to search engines and comparing the results (lists of URLs from each search engine) to discover their degree of overlap. This method was also used to make inferences about the percentage of the whole Web (however defined) that each one indexed (Lawrence & Giles, 1998, 1999). The research showed that the search engines of the day covered up to 16% of the “indexable Web:” i.e., the pages that search engines could retrieve, in theory, by finding all Web site home pages and following their links recursively. From the Lawrence and Giles research we can be confident that no search engine today indexes the whole Web, and it also seems that any two unrelated search engines are likely to overlap by less than 50%.
Although there is no perfect method to evaluate search engine coverage, related research has continued. For example, a Web site sampling method has been used to show that search engine coverage has an almost inevitable international bias against Web newcomers, caused by the link structure of the Web (Vaughan & Thelwall, 2004). Others have focussed on the ranking of search engine results in an attempt to propose an alternative ranking system that is not too biased towards popularity and against page quality (Cho & Roy, 2004; Cho, Roy, & Adams, 2005).
A separate research strand has focussed on the consistency of the results reported by search engines. Even though search engines do not cover the whole Web, the numbers that they report as hit count estimates for any query are interesting for at least two reasons. First, Webometric research has used these hit counts as the raw data for many studies of Web information (e.g., Aguillo, Granadino, Ortega, & Prieto, 2006; Ingwersen, 1998). Second, from an information retrieval perspective, it is useful to know how reliable the estimates reported by search engines are. In response, several researchers set out to systematically analyse variations in the results reported by commercial search engines. First, a comparison of results for the same query over short periods of time showed that fluctuations of several orders of magnitude could occur and also that sets of related queries could give inconsistent results (Snyder & Rosenbaum, 1999). Second, Rousseau (1999) tracked the variation over time of specific queries in NorthernLight and AltaVista, showing that the results tended to be quite stable but were subject to large fluctuations, presumably due to software or hardware upgrades. Bar-Ilan (1999) investigated the results of six search engines in more detail, discovering that they forgot information, in the sense that URLs were occasionally not reported in results, and that these URLs pointed to information that was not available elsewhere in the search engine results returned. Subsequent research encompassed Google and tracked the coverage of a large set of Web sites, finding a pattern of stability but with occasional sudden changes (Thelwall, 2001). The research of Mettrop and Nieuwenhuysen (2001) also used a time series approach but used a set of controlled seed URLs in order to get more detailed information on search engine performance. They confirmed that search engines sometimes did not report a page even when it matched a query and was in their index (Mettrop & Nieuwenhuysen, 2001). Bar-Ilan describes search engines as “concealing” pages when they do not report them, despite matching a query and the page being in their database (Bar-Ilan, 2002).
In conclusion, search engines should be viewed as engineering products, designed to produce fit-for-purpose results but not as mathematical “black boxes” that deliver logically correct results. Search engines may take shortcuts when estimating and returning results in order to improve their speed or efficiency of operation. For example, they may only search a fraction of their index for a query, stopping when they run out of time or have found enough results. See also Arasu, Cho, Garcia-Molina, Paepcke, & Raghavan, 2001; Bar-Ilan, 2004a2004b; Brin & Page, 1998 for technical issues that may impact on search engine results.
New Methods for Extracting URLs from Search Engines
This section contains details of two new methods for automatically extracting URLs from search engines for queries with more results than the search engine would normally return. First, however, the software environment and facilities used to implement the methods are described and a case is made for the value of Windows Live in Webometric research.
Data Collection—The Microsoft Search Web Service
The three major families of search engines: Google, Yahoo!, and Windows Live (Microsoft) all maintain at least one service to allow computer programs to automatically submit search queries and retrieve results. These are variously called Application Programming Interfaces (APIs) or Web services, and they operate in different ways but perform the same function. Each restricts the number of searches that can be made per day and has some conditions of use, although all of these conditions broadly allow them to be used for research purposes. Of the three, the Microsoft Search Web Service is the most useful for investigations and large scale deployment in applications for two reasons. First, its results are the same as those of the online search engine, which is not the case for the Google API (Mayr & Tosques, 2005) and the Yahoo! Web Search API (e.g., through our own testing and as claimed at http://bsd119.ib.hu-berlin.de/˜ft/index_e.html on 8.2.2006). Second, the number of queries that can be submitted per day is 10,000, with each giving up to 50 results (http://search.msn.com/developer/). This is much more than the Google API (1,000 queries of 10 results, http://code.google.com/apis/soap-search/, support partially discontinued on Dec 6, 2006, in favour of a less powerful Ajax Search API) and the same as the Yahoo! Web Search API (5,000 queries of 100 results, http://developer.yahoo.com/search/Web/).
For this research, we extracted all data automatically using the Microsoft Search Web Service, submitting queries through the free LexiURL Searcher software, which was upgraded for this article with the two new methods described below (http://lexiurl.wlv.ac.uk). For each query submitted, a list of up to 50 URLs is returned, together with the title of the page and a description of the page (a snippet, as displayed in Windows Live). With LexiURL searcher a complete list of up to 1,000 URLs matching a query can also be obtained (the maximum allowed by Windows Live). This is achieved by submitting up to 20 separate queries, each giving a different starting point for the URL list (i.e., 1, 51, 101, … 951) and then merging all of the lists. Because the search service sometimes returns duplicate URLs and often returns identical pages of URLs without warning for the last pages of results, LexiURL Searcher automatically checks for duplicate URLs in all the results and discards any found.
Method 1: Automated Query Splitting
Query splitting is a technique previously used to increase the number of URLs returned by a search engine as matching a query (Thelwall, Vann, & Fairclough, 2006; as previously used by Bar-Ilan and Peritz, 2004 and elsewhere on an ad-hoc basis, e.g., http://www.Webmasterworld.com/forum7/1379.htm from November, 2005). For example, if a search engine reported 2,000 matches for the query “Webometrics,” then it would return the just the first 1,000 results (e.g., in 20 pages of 50) and then stop. In order to get the full set of 2,000 results, it might be possible to split the query “Webometrics” into two separate logically disjoint queries by adding and subtracting the same word. For example, if the word was Cybermetrics, then this would give “Webometrics Cybermetrics” (i.e., all pages containing both the words Webometrics and Cybermetrics) and “Webometrics Cybermetrics” (i.e., all pages containing the word Webometrics but not the word Cybermetrics). If both of these queries gave 1,000 results, then the complete list of URLs matching “Webometrics” could theoretically be obtained by submitting the two subqueries and obtaining all 1,000 matching URLs for each one. In theory, any query, no matter how many URLs it matched, could be split into a large set of logically separate queries by recursively adding and subtracting extra words, and the results combined to create a complete list of matching URLs (e.g., see Appendix, Table 5). In practice, however, this process has a limit because of query length restrictions (see below).
The following example illustrates query splitting using the query “Ingwersen”, for which Windows Live reported 33,779 results. Figure 1 shows the results of the first two query splitting levels. First, the query was split by adding and subtracting the word “peter”. Then each new subquery was split differently, the first by adding and subtracting “j” and the second by adding and subtracting “und.” In all cases, the total number of results was still above 1,000, so the splitting had to continue (not shown). This process eventually generated 39 separate queries (each with up to 20 results pages), terminating when each query returned under 1,000 results. For example, one of the leaf queries was “ingwersen -peter- und- information -de -geranium -1” with 627 results. Combining the results of all the leaf queries (i.e., those with less than 1,000 results) gave a new estimated result count (10,702) and combining the URLs extracted from the leaf queries gave a list of 10,702 URLs matching the query “Ingwersen.”
Although query splitting has been previously implemented manually with human intervention to decide which words to add and subtract (Thelwall et al., 2006), here we propose a simple automatic method. This method is to create a list of all words used in the title or description of the matching URLs for a query and then to choose a “splitting” word that occurs in 15% of them. The value, 15%, was determined through testing as being the value that was most likely to produce an approximately even split of results, even though it often gives the most common word in the list that is not already in the query. In addition, very common words should not be used (we produced a list of 54) because they are ignored by Windows Live when conducting a search. Of course, words already in the query should clearly be excluded. This method normally works well but is occasionally poor because it selects a very common word that is rare in titles and descriptions (e.g., pronouns, common verbs) or a rare word that is common in titles and descriptions (e.g., “untitled”).
Query splitting is fast because it only uses the data provided by the Microsoft Search Web Service. A more effective method would be to download the complete text of each page and choose a word that was in 50% of these pages, but this would take considerably longer and consume much more computing power and network bandwidth. A practical limitation is that it does not work for very large queries because search engines employ query length restrictions (e.g., on January 15, 2007, the Windows Live limit was 150 characters http://help.live.com/help.aspx?project=wl_searchv1&market=en-gb). In practice we apply it recursively up to nine times so it can, in theory, generate complete lists of up to 1000 × 29 = 512,000 URLs, but because the method is imperfect, 100,000 URLs is a more practical upper limit. Note that the recursive application of the method means that a new word list and query splitting word must be generated for each split query.
Query splitting can also be used for a second purpose: to gain revised estimates of the number of URLs matching a search by splitting the search into a set of multiple searches that combine to be equivalent to the original search.
Pseudocode for query splitting is given below in the form of a function designed to return an estimate of the total number of results available for a query based upon query splitting to produce constituent searches with a maximum of 950 results each (as required for investigation 2, below). The minimum number of results to trigger query splitting used was 950 because in some search engines the 1,000 maximum is not quite reached, even for queries with hundreds of thousands of results. Calling the function also produces a list of URLs matching the query, including URLs returned by recursive function calls, and this is the extended URL list (as required for investigations 3 and 4, below).
In the function below, 950 is the minimum number of results to trigger query splitting, 0.15 is the proportion of results, 150 is the maximum query length permitted by Windows Live, and 9 is the standard maximum query splitting level. For the queries in the investigations for this article, level 9 was not reached and the maximum query length 150 was not reached and so the return value QUERY-SPLIT(q, 9) is an estimate for the total number of results based on the last page estimates of all queries that were not split (i.e., the bold numbers in Table 5). In addition, a second type of hit count estimate can be obtained by counting all the URLs in the extended URL list produced in the execution of QUERY-SPLIT(q, 0), after discarding duplicates.
function QUERY-SPLIT(Query q; Level v)
Submit query q to Windows Live Web Search Service, record the description and title of the r <= 1,000 results and the total results estimate e obtained from the last results page.
(Add all of the URLs to a global-scoped URL list).
ifr < 950
if v <9
Find a word w that is closest to occurring in 0.15 * r of the URLs' descriptions or titles, excluding words already used and those in the common words list.
Table 5 in the Appendix gives the results of the query splitting for the word “kavli,” i.e., calling QUERY-SPLIT(“kavli,” 0). To illustrate one call of the function, for the initial query “kavli” a total of r = 987 URLs were returned by the call with an estimate of e = 50,362 matches, as reported by the final results page. The ideal word for query splitting would occur in 15% of the returned URLs' titles or descriptions, i.e., 0.15 * 987 = 148. Although there was a total of 9,547 different words found in the titles or descriptions of the 987 URLs, none occurred in exactly 148 URLs' titles or descriptions. The word “science” occurred in 147, making it the closest match—a closer match than the next closest “b” (151). Hence, the calls QUERY-SPLIT(“kavli science,” 1) and QUERY-SPLIT(“kavli -science,” 1) were triggered.
Method 2: Automated Top Level Domain (TLD) and Domain Searching
An alternative method to gain additional results from a search engine is to use the site: command to refine a search in the hope of gaining additional results from specific domains. This command restricts the results to a single domain name or domain name ending. For example, to gain additional Danish results matching “Ingwersen” the query “ingwersen site:dk” could be submitted, which would match all URLs of pages containing the word “Ingwersen” with domain names ending in .dk. Site searching can be automated in two ways.
Automated domain searching: Conduct a normal search for a query q (with or without query splitting). For each URL in the results, extract its domain name d and submit the query “q site:d” (with or without query splitting). If multiple URLs have the same domain name d then of course the query “q site:d” only needs to be submitted once.
Automated TLD searching: This is the same as automated domain searching except that the Top Level Domain (TLD) t of each URL (e.g., com, net, edu, uk, dk) is extracted and the query “q site:t” is submitted.
Automated domain searching is intuitively promising since search engines often report a maximum of one or two results per page from the same Web site. Automated TLD searching cannot be given a similar justification but is potentially useful as an additional method of identifying URLs with unknown domain names.
There are two different types of reasons why accurate or complete results may be needed from search engines: to have an accurate count of the number of URLs matching a query or to have an accurate list of all the URLs matching a query. The latter may be needed for investigative purposes or because the URLs are to be processed, for example, to count or list all domain names or TLDs represented.
This research has two overall aims: (a) to introduce a method to assess the internal consistency of search engine results estimates, and (b) to introduce an automatic method to gain listings of URLs for queries that generate large numbers of results, including the hidden URLs that are not returned in the first 1,000 results. Note that we use hit count estimate here to refer to the estimated number of URLs reported by a search engine and URL count to refer to the actual number of URLs returned by a search engine or the number of URLs in any other list. The following two specific research questions are addressed in the remainder of the article through two sets of linked investigations.
1.Are any of the hit count estimates returned by Windows Live reliable estimates of the number of matching URLs in its database, and, if not, how can the most reliable estimate be found?
2.Can query splitting and domain and TLD searching produce additional or complete lists of URLs in Windows Live matching a query?
Consistency of Hit Count Estimates: Investigations and Results
Investigation 1: Hit Count Estimate Changes in the First 20 Results Pages
We submitted a set of queries to Windows Live and recorded the estimated hit count on each page of results in order to discover how often these changed and if there was a systematic pattern to the variations. For this we chose 4,000 words of varying usage frequencies; for convenience, we chose these words from a set of about 68,000, mainly English language blogs collected during 2005–2006, and the queries were submitted in December, 2006.
Figure 2 reports the frequency of changes in hit counts over the results pages. This shows that for queries with initial hit count estimates of up to 200, changes are extremely rare (there was only one change). Also, for hit counts of 8,001 or above, changes were uncommon in the sense that the numbers did not vary in over 80% of cases. In contrast, for initial hit counts of between 300 and 2,000 the average number of changes was one per set of results, with some queries changing more than once. In summary, the variability of “mid-range” hit count estimates is significantly greater than that of others.
Figure 3 shows the average relative changes in results between different results pages, broken down by the size of the initial hit count estimate. When the count estimates change, sometimes the count increases and sometimes it decreases. For small initial hit counts (up to about 300), the average is close to 1 because few results change at all. For hit count estimates of 8,001 and above, the tendency is for the changes to be decreases, although most do not change. In contrast, for results between about 501 and 2,000 the results tend to change and result in a net reduction in the hit count estimates of about 50%. In summary, for most queries with initial estimated count above 300, when a changed estimate is produced, it tends to reduce the estimate by about 50%. (See the changed queries line in Figure 3.)
Figure 4 reports the location in the results sets where the page numbers first changed, if they did. It shows that the changes tend to occur in the second page for queries with up to 950 results (as initially estimated). The apparent anomaly around 4 million pages is a single query rather than a trend. For queries of 8,001 results and above, the locations of the initial changes were much more spread out and could occur anywhere, as Figure 5 shows.
Investigation 2: Hit Count Estimate Changes From Automated Query Splitting
In order to assess the extent to which the numbers of URLs extracted using query splitting differ from the final (page 20) estimates for the original queries, we conducted an investigation to measure the difference for queries with a large number of original hits, as these would be a realistic test. We chose a set of 30 words from investigation 1 with approximately 50,000 hits, as estimated by the final hit count estimates (i.e., results page 20). Similar numbers of hits were chosen so that any variability in the results could not be attributed to size-specific factors. We applied the full query splitting technique to each word and compared the hit count estimate produced by this method with the hit count estimate for the original query. Two of the queries returned under 1,000 hits in this test (“kuali” and “chemlab,” surprisingly, because only a few weeks passed between the two investigations), but the remaining 28 words returned similar hit counts in both investigations (with a maximum 6% difference), and so these 28 were used. The data collection was conducted during one week in December, 2006.
The results of investigation 2 are given in Table 1, and note that the figures reported are all based upon the estimated hit counts on the final results pages rather than counts of unique URLs returned (unique URL lists are discussed below in investigation 4). The hit count estimate obtained by the query splitting method was always less than the original estimate, varying from 14% to 79% with a mean of 51% and a standard deviation: 20.9%.
Table 1. Search results for investigation 2 with and without query splitting.
Estimated hit count on page 20 of original query
Estimated hit count using query splitting at 950 (and % of column 2)
To follow up investigation 2, we investigated the overlaps with split queries to find out why the differences might occur. In particular, we investigated the URLs in the original query to find out if any were not present in the subqueries that logically should contain them. In all except one of the 28 queries, there was at least one URL returned from the original query that was not present in any of the split queries. We investigated these apparently missing URLs in some of the queries and other similar queries to find out why they were not reported in the split queries that should have contained them. The results showed that in almost every case either (a) the missing URL was of a page that was identical to that of another URL in the results (sometimes from a different Web site) or (b) another URL from the same Web site was present in the results. For example, the query “Ingwersen” included the URLs http://www.umcpartners.org/5161.cfm and http://www.manitoulin-island.com/algonquin/position_paper.htm in which no spit queries were reported, although the query “ingwersen peter” returned another URL from the same site as the first: http://www.umcpartners.org/5234.cfm, and several split queries reported http://www.algonquin-eco-watch.com/position_paper.htm, which is the same position paper as the second URL. Recall also that search engine results have been shown to not always be logical (Smith, 1999; Snyder & Rosenbaum, 1999), and so we investigated the Boolean consistency of the results: whether two logically nonoverlapping queries ever reported the same result) but found no such problems.
The investigation suggests that duplicate page elimination and the reduction of multiple results from the same site are the two main reasons for the lower numbers of results for the query-splitting method, and is thus consistent with the results of investigation 1. (See the expanded description in investigation 3 below, however.) An interesting corollary from this is that query splitting is not an effective technique to gain accurate numbers of URLs matching a query (e.g., Aguillo et al., 2006) or to ensure that all relevant information was retrieved (e.g., Bar-Ilan, 1999) or to count by site or domain name rather than by individual URLs (e.g., Björneborn, 2006; Henzinger, 2001; Thelwall, 2002a2002b).
Hit Count Estimates Compared to URL Counts: Investigations and Results
Investigation 3: Low frequency Hit Count Estimates Compared to URL Counts
It is not possible to directly compare the URLs returned with and without query splitting for queries with over 1,000 results because only the first 1,000 are returned. As a proxy for this, however, we assessed query splitting for words with less than 1000 results. From our original word list, we selected words with between 501 and 1000 hits and applied query splitting with a threshold of 500, which were processed in January 2007. Note that these words are all rare and most are also hyphenated or spelling mistakes.
The results of investigation 3 are given in Table 2. The number of unique URLs returned for each search was within 4% of the hit count estimate on the final page in all cases, so this small difference is not reported. The number of URLs obtained by the query splitting method was always different from the original estimate, varying from 71% to 187% with a mean of 108% and a standard deviation of 17.8%. The URL counts should be read as minimum numbers of matching URLs, suggesting that the estimated hit counts are probably almost always significant underestimates of the total number of matching URLs available for the search engine. To test this, we chose the worst-case example, “outranged,” which gave 644 URLs through query splitting at 500 and applied automated domain and TLD splitting to it to gain extra matches, finding 1,432 in total, i.e., many more than the original estimate.
Table 2. Search results for investigation 3 with and without query splitting.
Estimated hit count on page 20 (or final page) of original query
URLcount using query splitting at 500 (and % of column 2)
We investigated the reasons for changes in numbers between the original and split number of URLs for the query “pyrolized:”
Original query: pyrolized (632)
Split query 1: pyrolized c (366)
Split query 2: pyrolized −c (204)
Estimated total results: 570 (=366 1 204)
Total unique URLs: 700 (=632 + 366 + 204-#overlapping URLs)
Of the 700 URLs, 492 were returned by the original query and either one of the two split queries, 136 were returned by the original query but neither of the split queries, and 72 were returned by one of the split queries but not by the original query. Of these 208 (=136+72) anom-alous URLs, 104 had the same domain name as at least one other URL represented in the results; 28 came from the same site as at least one other URL represented in the results (as identified by the ending of the domain name); 4 came from sites with multiple equivalent domain names, with another URL from the same site represented in the results; and 2 had identical content but completely different URLs to another page in the results (both were electronic publications available on a university Web site and in an archive or journal Web site).
A significant number of the URLs (70) could not be accounted for by the identification of duplicate contents or sites, however. Perhaps these are local near duplicates: URLs representing documents that are similar to another document in the results set in terms of the snippets returned by the search engine. Search engines apparently to remove results that are not duplicates if they would seem to be very similar from the snippets describing them in the results pages or from the overall contents of the pages. Because the snippets created are dependant upon the query terms, it is possible that URLs are local duplicates for one query but not local duplicates for another. This could explain why query splitting sometimes produces fewer results and sometimes produces more results. As an example, one of the 70 URLs was “Papers by JP Vacanti and the papers citing JP Vacanti” (http://garfield.library.upenn.edu/histcomp/vacanti-jp_citing/index-8.html) created by Eugene Garfield's HistCite software. In our results, this was the only page from the University of Pennsylvania and it is only hosted by the University of Pennsylvania Web server and so there is no page or site duplication issue to explain why it was present in the results of the query “pyrolized” but neither for “pyrolized –c” nor “pyrolized c.” We generated an artificial snippet for “pyrolized c” from Windows Live by submitting the query “pyrolized c site:upenn.edu;” but, the snippet was different from all of the other snippets in our ‘pyrolized c’ results, and so we could not confirm that near duplicate elimination was a cause of its omission for this query. An alternative explanation for some or all of the 70 anomalies is that they may be caused by logical inconsistencies or time-saving shortcuts in the search engine software. For example, there may be an implicit requirement for multiple query terms to occur near each other in matching documents.
Finally, note that the necessary choice of rare words for this small investigation may result in an overestimation of the extent of duplication in results because it seems that rare words are more likely to be significantly affected by the removal of multiple URLs from the same site. This is because any copying or repeated reuse of a rare word in a Web site would reflect a greater proportion of the word in use than similar copying of a common word.
Investigation 4: Higher Frequency Hit Count Estimates Compared to URL Counts
The fourth investigation (a) investigates whether Automated TLD and domain searching can be used to discover significant numbers of new URLs and domains, (b) investigates the value of using different query splitting levels in an attempt to identify additional matching URLs, and (c) compares hit count estimates of over 1,000 with URL counts to assess the accuracy of both.
A small number of queries were used because the method employed requires a large number of separate queries. The queries are all Webometrics-related and were selected to give a range of different sizes of results. The following procedure was implemented, and the results are shown in Table 3 for the query “Kousha.”
1.A standard query for “Kousha” was submitted, extracting all URLs from all 20 pages of results.
2.Automated query splitting was applied at the default 0.15 level (15%).
3.Automated TLD searches were run on all the TLDs extracted from all the URLs obtained in step 2. The results were added to the results of step 2.
4.Automated domain searches were run on all the domains extracted from all the URLs obtained in step 3. The results were added to the results of step 3.
5.Steps 2 to 4 were repeated for the split levels 0.05, 0.1 and 0.2.
Table 3. Unique URLs and Domains extracted for the query “Kousha” using various methods.
Note: The initial hit count estimate was 24,563 and the page 20 estimate was 992. Percentages given are of all the results combined.
Note that similar figures were obtained for the other six queries, although there was a trend in that for lower frequency queries the percentage of URLs returned by the initial query was higher. Note also that although the set of domain-specific queries normally return results from a single domain; occasionally they return results from multiple related domains.
Table 4 summarises more briefly the overall results. To cut down the number of queries required, for the remainder of the words we completed split level 0.15, as in Table 3, but only query splitting for the other three split levels, followed by a combined TLD search and domain search for all TLDs and domains not already searched for at the 0.15 level. The search was conducted in January, 2007.
Table 4. Unique URLs and Domains extracted for several queries using various methods.
Query splitting (0.15)
and TLDs (0.15)
and domains (0.15)
All 4 levels combined
Initial hit count estimate
Page 20 hit count estimate
[University home page
Many useful facts can now be derived from Table 4.
The three 0.15 columns of Table 4 clearly demonstrate that, following query splitting, TLD searching can produce a significant number of extra URLs and domains, and domain searching can produce even more URLs.
A comparison of the “All 4 combined” column with the previous column shows that the query splitting at three additional levels (0.05, 0.1 and 0.2) produced only a minor increase of 4%–6% in the results.
The page 20 hit count estimate is clearly not an accurate estimate of the number of matching URLs in two cases (“kousha” and “Webometrics”). The initial hit count estimates are all larger than the number of URLs found, but it is possible that there are additional URLs not found, especially considering that query splitting at different levels was able to discover new URLs, albeit a small percentage. Similarly, we have no concrete evidence to prove that they are too small. It seems reasonable to assume, by default, that they tend to be approximately correct or over-estimate by up to 50%. The cut-off point between giving accurate estimates and an underestimate is probably not related to the page of the results from which the estimate comes, but the size of the estimate, with higher estimates tending to be more reliable approximations of the number of matching URLs.
The methods used were able to find an unknown proportion of the matching domains, although probably a higher proportion than the (also unknown) proportion of matching URLs: perhaps above two thirds in all cases.
A possible explanation for all of the facts discovered so far would be that the initial hit count estimates given by Windows Live are reasonable estimates of the number of matching URLs as long as there is not a significant amount of page duplication or local duplication in the first few pages of results returned. Queries with high initial page count estimates, say >8,000, may tend to give reasonably accurate estimates of the total number of matching URLs, at least on the first results page. Queries with very low numbers of matching URLs (as reported on the first results page, e.g., < 200) probably return reasonably accurate estimates of the total number of matching URLs after duplicate elimination and local near-duplicate elimination because the duplicate elimination process will cover a significant proportion of the URLs. The hit count estimates returned by all other queries are likely to often be a mix of the two types of estimate and hence be unreliable for all purposes.
Other Search Engines
Although the main purpose of this article is to investigate Windows Live, for comparison purposes a cut-down version of investigation 1 (2,000 words) with the Google API and the Yahoo! Search Service was run in March 2007 to assess how hit count estimates changed between different results pages for the same queries. Recall that these two APIs do not give the same results as the normal Web interface.
The Google API hit count estimates behaved differently from those of Windows Live. Typically, the estimates were the same in all results pages for a single query or fluctuated between two different values. Out of 2,000 searches submitted, 620 returned the maximum number of pages of results (100) and, of these, in 399 cases the first hit count estimate was the same as the last. The remaining cases comprised 154 reductions and 67 increases. The changes in hit counts seemed to always be the result of fluctuations and never the result of a reevaluation of the number of hits, which was in contrast to the Windows Live results. Nevertheless, Google frequently stopped returning results significantly before the reported maximum had been reached. The results are consistent with Google having two independent servers for the API, each capable of giving different estimates, and for these estimates, once made, to remain unchanged thereafter for each server. The estimate may well be for the total results in the database rather than the ones that the API will return, e.g., the estimates may ignore the effects of near duplicate elimination. This phenomenon was also present in the normal Google interface—at the start of this research but seemed to have disappeared by the end (April 2007)—when the Web interface for Google did seem to revise its hit count estimates in response to the number of matches that it returned.
The Yahoo! Search Service behaved differently than both Google and Windows Live. For initial hit count estimates above ten million, the hit counts tended not to change (152 out of 202 stayed the same). For initial hit count estimates between 1,000 and ten million, the hit counts normally changed (only 98 out of 1,335 stayed the same). Between 2,500 and ten million the change tended to be a decrease (1,075 out of 1,217) and between 1,000 and 2,500 the change tended to be an increase (100 out of 121). For searches that returned less than the maximum number of results, the hit count estimate on the final page was accurate. In summary, the Yahoo! hit count estimates seem to be genuine estimates of the total number of URLs that could be returned rather than the total number of matching URLs, although it is not clear why these estimates tend to increase for searches with lower initial hit counts and to decrease for searches with higher initial hit counts.
After the initial research for this article had been completed, Microsoft introduced a new Windows Live feature: disable host collapsing. This feature blocks the default option to normally return at most two results per domain. We repeated the investigations with the disable host collapsing option selected and found some differences in the results. The main difference was that the hit count estimates were less likely to change for mid-range queries (investigation 1). This suggests that a significant amount of the reduction in hit count estimates reported for Windows Live searches was due to the elimination of multiple results from the same Web site. Presumably multiple results from the same Web site are not easily identified either for searches with very large initial hit counts or for the initial estimates of searches with medium-size initial hit count estimates.
On the basis of the results here, it is clear that for many Webometric purposes the Windows Live results are imperfect. In general terms, it seems that the search engine does not deliver complete lists of results, whatever combination of queries is used, and its hit count estimates tend to be approximations of the number of matching URLs if the estimates are high or approximations of the number of matching URLs after duplicate and near-duplicate elimination if the estimates are low. The following recommendations are designed to help researchers to make get the most value from Windows Live.
To get an approximate number of URLs matching a query the results are not clear but the hit counts given by the first page of a Windows Live search may well tend to be reasonable estimates, if they are at least 8,000. It would thus be reasonable to use these for high frequency queries, however, and to avoid using any of the additional techniques described here. For low frequency queries, it seems that a method to extract and count as many URLs as possible should be used, such as query splitting combined with TLD searching and domain searching. Because this method will produce underestimates, if compared with the results of high frequency queries obtained from Windows Live hit count estimates, then the low frequency query results should be increased, for example, multiplied by 1.5 to compensate.
To get an expanded list of URL matching a query, query splitting, TLD searching, and domain searching are recommended, although these return incomplete results. If it is essential to get as many URLs as possible, then query splitting with parameters other than 0.15 could give an increase of about 5%, but some URLs will inevitably remain hidden.
To get an approximate number or list of domains of URLs matching a query, query splitting alone is recommended as likely to give reasonable results. If it is essential to get as many domains as possible, then TLD searching could also be used and query splitting and TLD searching with parameters other than 0.15.
Finally, note that this entire article is concerned with gaining results that accurately reflect the contents of a search engine database rather than the whole Web. Further research is needed to assess the validity of using the figures derived from the methods described here as unbiased estimates (presumably significant underestimates in virtually all cases) for the Web itself.
Permission to use the Microsoft Search Web Service (Microsoft Live, subsequently renamed Live Search) free for noncommercial purposes is gratefully acknowledged (http://msdn2.microsoft.com/en-us/library/ms813980.aspx, last updated 9/6/05). This article should not be interpreted as criticism of the Windows Live search engine, or its hit count estimates, because it is designed to give useful information to general searchers rather than comprehensive URL lists or accurate URL counts. The referees are warmly thanked for their fast and insightful comments.
Table 5. The sequence of queries submitted to get a hit count estimate for the word “kavli.”
Note: Bold figures (under 950) are not split and count towards the final consolidated estimate.
kavli science theoretical
kavli science theoretical california
kavli science theoretical california director
kavli science theoretical california -director
kavli science theoretical -california
kavli science -theoretical
kavli science -theoretical national
kavli science -theoretical -national
kavli science -theoretical -national fred
kavli science -theoretical -national -fred
kavli -science theatre
kavli -science -theatre
kavli -science -theatre physics
kavli -science -theatre -physics
kavli -science -theatre -physics og
kavli -science -theatre -physics og f
kavli -science -theatre -physics og -f
kavli -science -theatre -physics og -f har
kavli -science -theatre -physics og -f -har
kavli -science -theatre -physics -og
kavli -science -theatre -physics -og m
kavli -science -theatre -physics -og m z
kavli -science -theatre -physics -og m -z
kavli -science -theatre -physics -og m -z 1
kavli -science -theatre -physics -og m -z -1
kavli -science -theatre -physics -og -m
kavli -science -theatre -physics -og -m 1
kavli -science -theatre -physics -og -m 1 institute