Data Collection—The Microsoft Search Web Service
The three major families of search engines: Google, Yahoo!, and Windows Live (Microsoft) all maintain at least one service to allow computer programs to automatically submit search queries and retrieve results. These are variously called Application Programming Interfaces (APIs) or Web services, and they operate in different ways but perform the same function. Each restricts the number of searches that can be made per day and has some conditions of use, although all of these conditions broadly allow them to be used for research purposes. Of the three, the Microsoft Search Web Service is the most useful for investigations and large scale deployment in applications for two reasons. First, its results are the same as those of the online search engine, which is not the case for the Google API (Mayr & Tosques, 2005) and the Yahoo! Web Search API (e.g., through our own testing and as claimed at http://bsd119.ib.hu-berlin.de/˜ft/index_e.html on 8.2.2006). Second, the number of queries that can be submitted per day is 10,000, with each giving up to 50 results (http://search.msn.com/developer/). This is much more than the Google API (1,000 queries of 10 results, http://code.google.com/apis/soap-search/, support partially discontinued on Dec 6, 2006, in favour of a less powerful Ajax Search API) and the same as the Yahoo! Web Search API (5,000 queries of 100 results, http://developer.yahoo.com/search/Web/).
For this research, we extracted all data automatically using the Microsoft Search Web Service, submitting queries through the free LexiURL Searcher software, which was upgraded for this article with the two new methods described below (http://lexiurl.wlv.ac.uk). For each query submitted, a list of up to 50 URLs is returned, together with the title of the page and a description of the page (a snippet, as displayed in Windows Live). With LexiURL searcher a complete list of up to 1,000 URLs matching a query can also be obtained (the maximum allowed by Windows Live). This is achieved by submitting up to 20 separate queries, each giving a different starting point for the URL list (i.e., 1, 51, 101, … 951) and then merging all of the lists. Because the search service sometimes returns duplicate URLs and often returns identical pages of URLs without warning for the last pages of results, LexiURL Searcher automatically checks for duplicate URLs in all the results and discards any found.
Method 1: Automated Query Splitting
Query splitting is a technique previously used to increase the number of URLs returned by a search engine as matching a query (Thelwall, Vann, & Fairclough, 2006; as previously used by Bar-Ilan and Peritz, 2004 and elsewhere on an ad-hoc basis, e.g., http://www.Webmasterworld.com/forum7/1379.htm from November, 2005). For example, if a search engine reported 2,000 matches for the query “Webometrics,” then it would return the just the first 1,000 results (e.g., in 20 pages of 50) and then stop. In order to get the full set of 2,000 results, it might be possible to split the query “Webometrics” into two separate logically disjoint queries by adding and subtracting the same word. For example, if the word was Cybermetrics, then this would give “Webometrics Cybermetrics” (i.e., all pages containing both the words Webometrics and Cybermetrics) and “Webometrics Cybermetrics” (i.e., all pages containing the word Webometrics but not the word Cybermetrics). If both of these queries gave 1,000 results, then the complete list of URLs matching “Webometrics” could theoretically be obtained by submitting the two subqueries and obtaining all 1,000 matching URLs for each one. In theory, any query, no matter how many URLs it matched, could be split into a large set of logically separate queries by recursively adding and subtracting extra words, and the results combined to create a complete list of matching URLs (e.g., see Appendix, Table 5). In practice, however, this process has a limit because of query length restrictions (see below).
The following example illustrates query splitting using the query “Ingwersen”, for which Windows Live reported 33,779 results. Figure 1 shows the results of the first two query splitting levels. First, the query was split by adding and subtracting the word “peter”. Then each new subquery was split differently, the first by adding and subtracting “j” and the second by adding and subtracting “und.” In all cases, the total number of results was still above 1,000, so the splitting had to continue (not shown). This process eventually generated 39 separate queries (each with up to 20 results pages), terminating when each query returned under 1,000 results. For example, one of the leaf queries was “ingwersen -peter- und- information -de -geranium -1” with 627 results. Combining the results of all the leaf queries (i.e., those with less than 1,000 results) gave a new estimated result count (10,702) and combining the URLs extracted from the leaf queries gave a list of 10,702 URLs matching the query “Ingwersen.”
Figure 1. An illustration of the start of query splitting for the query “Ingwersen.” The number of matches for each query is in brackets.
Download figure to PowerPoint
Although query splitting has been previously implemented manually with human intervention to decide which words to add and subtract (Thelwall et al., 2006), here we propose a simple automatic method. This method is to create a list of all words used in the title or description of the matching URLs for a query and then to choose a “splitting” word that occurs in 15% of them. The value, 15%, was determined through testing as being the value that was most likely to produce an approximately even split of results, even though it often gives the most common word in the list that is not already in the query. In addition, very common words should not be used (we produced a list of 54) because they are ignored by Windows Live when conducting a search. Of course, words already in the query should clearly be excluded. This method normally works well but is occasionally poor because it selects a very common word that is rare in titles and descriptions (e.g., pronouns, common verbs) or a rare word that is common in titles and descriptions (e.g., “untitled”).
Query splitting is fast because it only uses the data provided by the Microsoft Search Web Service. A more effective method would be to download the complete text of each page and choose a word that was in 50% of these pages, but this would take considerably longer and consume much more computing power and network bandwidth. A practical limitation is that it does not work for very large queries because search engines employ query length restrictions (e.g., on January 15, 2007, the Windows Live limit was 150 characters http://help.live.com/help.aspx?project=wl_searchv1&market=en-gb). In practice we apply it recursively up to nine times so it can, in theory, generate complete lists of up to 1000 × 29 = 512,000 URLs, but because the method is imperfect, 100,000 URLs is a more practical upper limit. Note that the recursive application of the method means that a new word list and query splitting word must be generated for each split query.
Query splitting can also be used for a second purpose: to gain revised estimates of the number of URLs matching a search by splitting the search into a set of multiple searches that combine to be equivalent to the original search.
Pseudocode for query splitting is given below in the form of a function designed to return an estimate of the total number of results available for a query based upon query splitting to produce constituent searches with a maximum of 950 results each (as required for investigation 2, below). The minimum number of results to trigger query splitting used was 950 because in some search engines the 1,000 maximum is not quite reached, even for queries with hundreds of thousands of results. Calling the function also produces a list of URLs matching the query, including URLs returned by recursive function calls, and this is the extended URL list (as required for investigations 3 and 4, below).
In the function below, 950 is the minimum number of results to trigger query splitting, 0.15 is the proportion of results, 150 is the maximum query length permitted by Windows Live, and 9 is the standard maximum query splitting level. For the queries in the investigations for this article, level 9 was not reached and the maximum query length 150 was not reached and so the return value QUERY-SPLIT(q, 9) is an estimate for the total number of results based on the last page estimates of all queries that were not split (i.e., the bold numbers in Table 5). In addition, a second type of hit count estimate can be obtained by counting all the URLs in the extended URL list produced in the execution of QUERY-SPLIT(q, 0), after discarding duplicates.
function QUERY-SPLIT(Query q; Level v)
Submit query q to Windows Live Web Search Service, record the description and title of the r <= 1,000 results and the total results estimate e obtained from the last results page.
(Add all of the URLs to a global-scoped URL list).
Find a word w that is closest to occurring in 0.15 * r of the URLs' descriptions or titles, excluding words already used and those in the common words list.
ifq2 contains at most 150 characters
return QUERY-SPLIT (q1, v+1) + QUERY-SPLIT (q2, v+1)
Table 5 in the Appendix gives the results of the query splitting for the word “kavli,” i.e., calling QUERY-SPLIT(“kavli,” 0). To illustrate one call of the function, for the initial query “kavli” a total of r = 987 URLs were returned by the call with an estimate of e = 50,362 matches, as reported by the final results page. The ideal word for query splitting would occur in 15% of the returned URLs' titles or descriptions, i.e., 0.15 * 987 = 148. Although there was a total of 9,547 different words found in the titles or descriptions of the 987 URLs, none occurred in exactly 148 URLs' titles or descriptions. The word “science” occurred in 147, making it the closest match—a closer match than the next closest “b” (151). Hence, the calls QUERY-SPLIT(“kavli science,” 1) and QUERY-SPLIT(“kavli -science,” 1) were triggered.