In this paper we report the results of a user study that compared different country-specific search results of Google and Live search. The users were Israelis and the search results came from six sources: Google Israel, Google.com, Google UK, Live Search Israel, Live Search US and Live Search UK. The users evaluated the results of nine pre-selected queries, created their own preferred ranking and picked the best ranking from the six sources. The results indicate that the group of users in this study liked most the local Google interface, i.e. Google succeeded in it its country-specific customization of search results. Live.com was much less successful in this aspect. However search engines are highly dynamic, thus these findings have to be viewed cautiously. The main contribution of the paper is a two-phase methodology for comparing and evaluating search results from different sources.
Searching for information on the Web has become a major activity. According to the Pew Internet and American Life surveys (Fallows, 2008) 89% of the American Internet users have used search engines to find information, and almost half of them used a search engine on the day before the survey was conducted. Searching for information is the second most frequently mentioned Internet activity after email. By far the most popular search engine is Google (see for example Nielsen Online, 2009 or comScore, 2009) followed by Yahoo! and Microsoft's Live Search. Google seems to be even more popular outside the US than in the US — according the Nielsen Online, in December 2008, the Google's reach was 61.29% in the US, and was above 80% in all European countries for which data were provided. In Israel the Google's exposure is similar to its reach in the European countries, as of December 2008; it was 88.8% (Parag, 2009).
Currently Google has local sites for 168 countries or territories (Google, 2009). The localized Google versions allow displaying country-specific sponsored results, and in addition they allow Google to provide country-specific display of the organic results as well. As an example, consider Figures 1 and 2: in both cases we searched for Brown on January 27, 2009 — once on google.com and once on google.co.uk. Only on google.co.uk we see among the top-three results a result related to Prime Minister Gordon Brown. On google.com there was no mention of Gordon Brown among the top-ten results for the query Brown. Similarly, at Live Search, the user can select a region to “discover a search experience tailored to your part of the world” (Live Search, 2009). Currently one can choose from 58 regions.
The aim of the current study was to investigate whether users prefer the country-specific results over results obtained from other versions of the search tool. More specifically, the results from the Israeli, British and US versions of the two above-mentioned search engines Google and Microsoft's Live Search were compared.
We are not aware of previous evaluation studies of country-specific results. There is a large body of literature on personalization of search engine results. Most of these papers propose personalization algorithms and then some means for evaluating these algorithms. For example, Pitkow et al. (2002) based their evaluation on the number of interface actions of users (mouse clicks and keyboard entries) and Dou, Song and Wen (2007) on the clickthroughs in the search engine log. Teevan, Dumais and Horvitz (2005) evaluated their system with the help of 15 users and the 50 top-ranked results for ten queries (some self-selected and some pre-selected). To evaluate the ranking quality of the different personalized result sets, they used the discounted cumulative gain (DCG) measure, which gives higher weight to highly ranked documents (Järvelin & Kekäläinen, 2000). Shen, Tan and Zhai (2005) used TREC topics, asked six users to formulate queries and showed them 30 results for each query (some of the results came from Google and some for their system) and asked them to evaluate the relevance of the results.
The study setup
Twenty-four Israeli students participated in a two-phase study. The first phase lasted for three weeks in July 2008. Each week the students received three search scenarios and the results of a pre-selected query for each scenario. The scenarios and the queries appear in Table 1.
Table 1. The queries and scenarios
The search results were collected from six search interfaces: google.com, google.co.uk, google.co.il (Google Israel), live.com (United States), live.com (United Kingdom) and live.com (Israel). The query for each scenario was submitted to all six search interfaces and the first page of organic results (10 results) was collected from each source. After eliminating the duplicates, between 21 and 40 different results were identified for each query. Each user received a file with the randomly ordered list of results for each query. The users were told that the results came from various search engines and appear in random order, and they were asked to choose the 10 best results from each list and to rank these results from 1 to 10 (ties were not allowed). The results list contained for each result its clickable title, snippet and URL, similar to the regular search result display (see Figure 3). The queries were hand-picked so that the results from the different search interfaces differed considerably.
The scenarios and most of the queries were in Hebrew, because we wanted to eliminate the effects of a language barrier. There was only a single query in English (Israel), and the query BMI (an abbreviation). Three additional queries contained terms both in English and in Hebrew (queries q1, q6 and q9). The users provided their rankings in July 2008, i.e., before the Beijing Olympics and before the US elections.
In the second phase, for each query the users received two three-column tables. In one table the three Google SERPs were assigned randomly to a column, and in the second table the Live Search SERPs were displayed similarly. The two tables appeared in random order, i.e. sometimes the Google results were first and sometimes the Live Search results. The users were not told anything about the sources, but they were told that this time each column is a ranked list of results for the specific query. They were asked to choose the best ranking from each table and the best ranking overall. Figure 4 illustrates what the users saw in the second phase of the study.
The search engine and the user rankings were compared using three complementing measures. The measures were developed by Bar-Ilan, Keenoy, Levene and Yaari (2007). The measures are the size of the overlap, the Spearman's footrule, and the M-measure. The overlap (OC) is the number of search results common to two lists. The footrule (F) compares the orderings of the overlapping elements, and the M measure extends it and takes into account the placement of the non-overlapping elements as well. Both measures are normalized such that their values are between 0 and 1, where 0 means no similarity and 1 means identical orderings. The F measure is only defined when the size of the overlap is at least 2. Details of these measures can be found in Bar-Ilan et al. (2007).
Search engine-search engine similarities
We were a bit surprised that even for queries in Hebrew the results are country-specific. This is not the case for all queries, but it was not difficult to find queries where the results are not identical when searching google.co.il (abbreviated in the paper as GI), google.com (GU) and google.co.uk (GU). There were only two cases where the results from the US and UK Google sites were identical (q5 and q6), and in all nine cases they differed from the results of the displayed at google.co.il. In none of the cases the ranked lists at the different live.com sites were identical. The abbreviations used for the different Live sites in the paper are LI for Live Israel, LU for live US and LK for Live UK.
We computed the similarity measures for all search engine (SE) pairs and for all the queries. Table 2 displays the averages and the minimum and maximum values for all the google-google and all the live-live pairs and some of the cross-engine pairs (for the other cross-engine pairs the values are similar). We see that at least for the sample queries, the results of google.com and google.co.uk were most similar. The top results produced by Google and Live are very different, as can be seen from the very low values of OC and M. The F values are sometimes very high — this is because either there is at most one overlapping element (in which case F is not defined) or there are two overlapping elements. If these two overlapping element appear in the same order in both lists the F value will be 1, otherwise it will be 0.
Table 2. SE-SE similarities
SE-user similarities based on the rankings
Based on the users' individual rankings, we produced a ranking called AVG (see also Bar-Ilan et al., 2007), an ‘average’ ranking of the users. AVG was defined as follows: each item not ranked by the user was assigned 11 as a virtual rank and for each item i we calculated Σ(11-rank(j)) over all the j users. The item with the highest sum was assigned rank 1, the second highest was assigned rank 2 — and the tenth highest was assigned rank 10. In Table 3 we see the similarity measures between this average ranking and the search interfaces. Google.co.il was most similar to the average user ranking, but the values are highly similar to the values for the other Google sites. On the other hand, the similarity values are considerably lower for the live.com sites. Note that the users had no information about the source of the items they ranked. It seems that the extremely high popularity of Google in Israel is justified. The results indicate that users like the non-Israeli interfaces of live.com a bit more than the Israeli interface when looking at the M measure, even though the highest average overlap among the live.com interfaces is for live.com (Israel).
Table 3. Similarity measures for the search engines versus the average ranking
For each query and each user we also calculated which SE ranking is most similar to the ranking provided by the user. Here we only considered the M as a similarity measure. The results are displayed in Table 4. Even though google.co.il is most similar to the users' rankings overall, we observe that the result are query specific. For example for the fourth query, ‘Israel’, the ranking of google.com was most similar to the users’ rankings. This can be explained, because the scenario described a user who does not live in Israel and wants to gather some information about the country. Google.com was most similar also for the last query, ‘Google new developments’. Again, a possible reason could be that this query is somewhat US-oriented. However, for the other cases, where the most similar ranking was not the country-specific ranking, we do not have an explanation. Google.co.uk was most similar to the users’ ranking for the second query, ‘Hilary Clinton‘. Live.com (UK) was most similar to the users’ ranking twice, for ‘skin cancer prevention’ and ‘WHO’. Note that there was no query for which the ranking of live.com (Israel) was most similar to the users' ranking.
Table 4. Number of times the user created ranked list was most similar to the ranked list retrieved by the specific search interface
We also checked whether we can observe user-specific differences. The rankings of fourteen users (58.3%) were most often most similar to google.co.il, and in three additional cases google.co.il was tied with another google interface. Thus it seems that the results are quite consistent both across users and across queries.
Best SE chosen by the users
When we asked the users to pick the best ranked list among the six ranked lists. This time the users showed an even greater preference for google.co.il than in the previous phase, as can be seen in Table 5. In the second stage of the study, Google Israel was the most preferred ranked list for 7 out of the 9 queries. When comparing rankings based on similarities, google.com was the most similar ranking to the users' ranking for 62.5% of the users for the query Israel; however in the second phase only 33.3% of the users chose it as the best ranking. In a few cases the users were not able to choose a single ranking (or there were two identical ranked lists among the six ranked lists).
Table 5. Number of times users chose the specific search engine's results as the ‘best ranked list'
Discussion and conclusions
In the first phase of the study the users were asked to evaluate individual search results, and to provide their own ranked lists from results retrieved from several sources. This way they formed their indirect opinion about the rankings of the sources. Their opinion was measured by computing the similarity between their ranked list and the ranked list provided by the sources. In the second stage of the study the users were directly asked to choose the best ranked list. Overall we see that the results of the first phase agree with the results of the second phase, but we observed some inconsistencies.
We believe that employing multiple methods for assessing user satisfaction with search results is a promising direction. At this point it is not clear what influences the user more: a few excellent results or the overall quality of the search results. In the first phase of our study the users concentrated on the best individual results, while in the second phase they were asked about their opinion about the overall quality of specific ranked lists.
More specifically the results indicate that users are satisfied with the country-specific results provided by Google, and they are much less satisfied with the results provided by live.com. For live.com, they were more satisfied with country-specific results not from their own country. This was a user study conducted at a specific time with a limited number of queries. Search engine results and rankings are highly dynamic, and specific results based on searches carried out in July 2008 do not necessarily reflect on the situation at any other time. However, the methodology introduced here can be employed for testing user satisfaction with search results at a given time.