Abstract
- Top of page
- Abstract
- Introduction
- Related Research
- Research Methods and Design
- Research Results
- Conclusion, Limitations, and Future Research
- Acknowledgments
- References
- Appendix
An eye tracking experiment revealed that college student users have substantial trust in Google’s ability to rank results by their true relevance to the query. When the participants selected a link to follow from Google’s result pages, their decisions were strongly biased towards links higher in position even if the abstracts themselves were less relevant. While the participants reacted to artificially reduced retrieval quality by greater scrutiny, they failed to achieve the same success rate. This demonstrated trust in Google has implications for the search engine’s tremendous potential influence on culture, society, and user traffic on the Web.
Introduction
- Top of page
- Abstract
- Introduction
- Related Research
- Research Methods and Design
- Research Results
- Conclusion, Limitations, and Future Research
- Acknowledgments
- References
- Appendix
Finding online information using search engines has become a part of our everyday lives (Gordon & Pathak, 1999). Currently the search engine serving the largest percentage of queries (at 47.3%) is Google, with an index of around 25 billion Web pages and 250 million queries a day (Brooks, 2004; Search Engine Watch, 2007). Google now provides search functions on handheld devices and smart phones (Google Inc., 2005a). With the ubiquitous presence of mobile devices, anytime and anywhere access to the information world has become a reality. Other popular search engines include Yahoo, MSN, AOL, and Ask.com, and all serve the pervasive need of finding pertinent information within the enormity of the Web. Despite the popularity of search engines, most users are not aware of how they work and know little about the implications of their algorithms (Gerhart, 2004).
All of the search engines noted above respond to a query with a ranked list of 10 abstracts in their default setting. The ranking reflects the search engines’ estimated relevance of Web pages to the query. Individual search engines vary by both underlying ranking implementation, and also by various characteristics of how they display the ranked results, and any additional support they provide for finding related Web pages. Users can evaluate the abstracts, or other information displayed about a given result, before deciding whether to visit any of the suggested pages by clicking on a hyperlink. In this study, we chose to use Google because of the frequency of its use, the simplicity of its display of query results (which can serve as a common basis when compared to many other search engines), and our prior experience studying Web search on Google (Granka, Joachims, & Gay, 2004). We also confined the study to only one search engine to ensure a constant visual display on which we could analyze and interpret the subjects’ eye movements.
The information search process is made possible through three parties: Web authors, the search engines themselves, and the users of search engines. The Web authors put their Web pages online with appropriate linking to other pages. The link structure has been used by popular search engine algorithms (Brin & Page, 1998) that can take advantage of this structure to rank relevant Web pages. Users of search engines enter various keywords (sometimes with Boolean commands) according to their understanding of the task and the functionality of the search engine, and they evaluate the results returned by the search engine, making a decision on whether or not to select one of the returned results or reformulate the query. Search engines act as an information intermediary that facilitates the information seeking process.
However, how well a Web page actually reflects the users’ search intentions is hard to measure. For example, the ranking algorithm of Google uses a page’s measure of in-links to help inform its quality and relevance (Pandey, Roy, Olston, Cho, & Chakrabarti, 2005). Some have argued that those algorithms, such as PageRank, simply set up a rich-get-richer loop, whereby a relatively few sites dominate the top ranks (Hindman, Tsiotsioliklis, & Johnson, 2003). Retrievability and visibility represent only part of the search process. We wondered what role the user plays in perpetuating this rich-get-richer dynamic. Particularly, we wondered how much of the correlation between Web traffic and site popularity (Hindman et al., 2003) is due to the alleged efficiency of these algorithms as opposed to users’ tendency to simply trust the ranked output displayed by a search engine and forego any in-depth analysis or comparisons of the retrieved results. More importantly, Google’s imperfect algorithm is open to abuses such as Google bombing (Tatum, 2005; see also Bar-Ilan, this issue). This might deliver erroneous messages to a large population when the searchers trust Google without questioning its underlying ranking mechanism.
Our curiosity regarding this question was piqued by an earlier study conducted by Granka et al. (2004). Their results indicated that most student subjects only view and click the top two results returned by Google. The design of this earlier study did not tease apart whether those choices were the result of the top positions of the two abstracts as influenced by Google’s ranking algorithm, or if those were truly the most relevant results as evaluated by the subjects. We were interested in finding out whether a user’s choice of a particular abstract was based on the position of that abstract, the user’s evaluation of the relevance of that abstract, or a combination of the two.
In order to explore the relative contribution of relevance and position, we employed eye tracking as the methodology in the current study. Eye tracking devices are able to record eye movements and reveal subjects’ attention and cognitive process. In areas of cognitive psychology, human-computer interaction, and marketing, eye tracking methods have been used for decades (Rayner, 1998). We use eye tracking to investigate how users make decisions when confronted with returned Google results following a query. Eye tracking adds meaning to the more traditional log file or click behavior analysis. It allows for a more complete assessment of the information-seeking process by revealing which query result abstracts users looked at, or were aware of, before selecting a query result or refining their query. This article provides behavioral evidence that sheds light on the influential factors in the evaluation process of search engine uses.
Research Methods and Design
- Top of page
- Abstract
- Introduction
- Related Research
- Research Methods and Design
- Research Results
- Conclusion, Limitations, and Future Research
- Acknowledgments
- References
- Appendix
In the present study, rank refers to the original sequence of abstracts returned by Google. Lower ranks indicate that the Web pages are less relevant as judged by Google’s algorithm and thus placed later in the sequence; position represents the actual physical locations of the abstracts on the Google results page, for example, from top to bottom (1 to 10) on the first Google result page; relevance represents the subjective judgments of the likelihood that the information piece is related to the answer of the question or the goal of a search task. In this study, we obtain relevance through human judgments of the abstracts returned by Google (abstract relevance), as well as the pages associated with those abstracts (Web page relevance). Judgment data were important to ensure that all of the 10 results were not equally relevant.
Based on the findings from our previous study (Granka et al., 2004), the current work was designed to exploit Google’s ranking function in order to investigate how much the subjects rely on Google’s ranking to make their decisions about relevance. Unbeknown to the subjects, we manipulated the order of Google’s returned results in some cases, such that abstracts of actual lower ranked Web pages appeared higher in position and vice versa. Thus, choosing a lower ranked abstract that is in a higher position in the Google results page but is evaluated to be less relevant by human judges would be evidence that the subjects have assigned priority to Google’s “expertise” over the actual relevance of the abstract.
This section introduces eye tracking as a methodology and introduces the details of the research methods and procedures used in the current study. A laboratory setting was necessary to capture all aspects of the search sessions and related eye movements, in order to compare the variables across all subjects systematically. Although some scholars have argued that external validity is compromised in a laboratory setting, previous studies have shown that in laboratory settings and Web settings there are few or no differences in the subjects’ behavior on information search, especially on those tasks using keywords (Epstein, Klinkenberg, Wiley, & McKinley, 2001; Schulte-Mecklenbeck & Huber, 2003).
The Subjects
In this study, participants were undergraduate students with various majors (including communication, engineering, and arts and sciences) at Cornell University (U.S.A.). All students were given extra class credit for their participation in the experiment. Twenty-two subjects were recruited, and 16 complete data sets, including 11 males and 5 females, were obtained. Attrition was due to random recording difficulties and the inability of some subjects to be calibrated precisely.2 The average age of participating subjects was 20 years and 4 months. All subjects reported that they used Google as their primary search engine and had a high familiarity with the Google interface (all scored 10 out of 10); when asked about the levels of trust in Google, they reported an average of 7.9 (out of 10). Thus, our subjects, in general, are savvy users of Google and tend to trust Google to a high degree.
Search Tasks
Ten search tasks were included in this study, each of which addressed a unique aspect of the information retrieval experience. Half of the searches were navigational in nature, asking subjects to find a specific Web page or homepage. These were definitive searches, meaning that only one correct Web page would provide an acceptable answer. The other five tasks were informational, asking subjects to find a specific bit of information (Broder, 2002). Much of the content for the tasks was generated according to the content of top searches listed on Google Zeitgeist (Google Inc., 2005b). Our purpose was to ensure that the tasks in this experiment represented the various genres of searches that the general population uses on a regular basis, including travel, movies, current events, celebrities, and local issues. These tasks were also pre-tested to ensure that the most intuitive queries would not always result in top-ranked results; therefore, the findings should be interpreted in light of the fact that these queries are on average more difficult than a subject’s typical query. The following table is a brief description of the 10 search tasks included in the experiment and the correct answers to these tasks (Table 1).
Table 1. The 10 information search tasks | Task Type | Task | Correct Answer |
|---|
| Navigational | Find the homepage of Michael Jordan, the statistician. | http://www.cs.berkeley.edu/~jordan/ |
| Find the page displaying the route map for Greyhound buses. | http://www.greyhound.com/maps/ |
| Find the homepage of the 1000 Acres Dude Ranch. | http://www.1000acres.com/ |
| Find the homepage for graduate housing at Carnegie Mellon University | http://www.housing.cmu.edu/graduatehousing/ |
| Find the homepage of Emeril—the chef who has a television cooking program. | http://www.emerils.com/emerilshome.html |
| Informational | Where is the tallest mountain in New York located? | The Adirondacks OR High Peaks Region |
| With the heavy coverage of the democratic presidential primaries, you are excited to cast your vote for a candidate. When are/were democratic presidential primaries in New York? | March 2, 2004 |
| Which actor starred as the main character in the original Time Machine movie? | Rod Taylor |
| A friend told you that Mr. Cornell used to live close to campus—near University and Steward Ave. Does anybody live in his house now? If so, who? | Members of Llenroc, the Cornell chapter of the Delta Phi Fraternity live in the mansion. |
| What is the name of the researcher who discovered the first modern antibiotic? | Alexander Fleming |
Experimental procedure
All participants were required to give informed written consent prior to the start of the experiment. Before the actual experiment, the eye tracker was calibrated using a nine-point standard calibration procedure for each subject (Duchowski, 2003). Participants were instructed to search for the 10 different tasks through the Google interface. Subjects were told to view the Web pages and search as they typically would under normal conditions, with the opportunity to scroll up and down the page at their leisure.3 The experimenter sat to the right of and behind the subject, where she was able to watch the subject, the subject’s eye, and also the corresponding eye movements on the two control monitors. If the experimenter recognized that the eye tracking system temporarily lost a subject’s eye path due to extreme movements, she could re-center and if appropriate, perform a quick recalibration fix. This happened rarely and randomly; it did not interrupt the experimental session since the experimenter could perform the quick fix in a few milliseconds.
The 10 search tasks were read aloud to the subject by the experimenter to eliminate unnecessary eye movements away from the computer monitor; such eye movements could potentially hinder the accuracy of the ocular calibration. Typically, due to the monitor size, scrolling was required to view abstracts ranked seven and higher on the Google results pages. To eliminate the potential bias from question order effects, all search questions were completely randomized for all subjects. The maximum time for completing each task was restricted to three minutes. As before, the time constraint allowed for a sufficient amount of eye tracking data to be collected for each task and also minimized the total time required of each subject. The most important data for this study come from how each subject responds and interacts with the 10 results following each query and not from the full completion of the task itself.
Design
Because Google is continuously updating its search algorithms, one specific query will not produce the same exact results on two separate occasions. Because much of the data analyses were to occur after the experimental sessions, it was necessary to cache the Web pages with which the subjects actually interacted. A proxy server was set up to mediate the interaction between the subjects and the Google Web server. The proxy script was run on the subject’s computer and stored every search query typed by the subjects, as well as all links and Web pages that were viewed, along with the corresponding times that they were accessed and viewed. When the subject typed in a query, the query was sent to the proxy server, and the proxy server relayed it to Google. After receiving the results from Google, the proxy server manipulated the results and passed on the modified results to the subject’s Web browser.
Results were modified in two ways. First, the proxy server removed the advertisements on the Google results page to avoid distraction and ensure consistent stimulus exposure across all subjects. This also saved the authors from having to filter out eye movements on ads, since the goal of the study was concerned with which query results were viewed and selected. Second, in order to explore the relative contribution of relevance versus position to the decision making process, the results were further manipulated for each subject in one of three ways. In the “Normal” condition, the proxy server returned the results in their original ranked order; in the “Swapped” condition, the proxy server swapped the positions of the first ranked abstract with the second ranked abstract, keeping the rest of the ranking intact; and in the “Reversed” condition, the proxy server reversed the positions of the abstracts on the first result page as follows: The first ranked abstract was swapped with rank 10 abstract, the rank 2 abstract was swapped with rank 9 abstract, and so on.
Eye Tracking Indices
During an eye tracking experiment, several measurements are typically recorded that are relevant for studying college students’ interactions with search engines. ‘Fixation’ refers to a relatively stable eye-in-head position within some threshold of dispersion (typically ∼2°) over some minimum duration and with a velocity below some threshold (typically 15–100 degrees per second). In this study, we set the minimum duration as 50 milliseconds, as suggested in the ASL504 eye tracker manual (Applied Science Laboratories, 2005). Eye fixations are the most relevant metric for evaluating information processing in online search. Fixations represent the instances in which most information acquisition and processing occurs (Rayner, 1998). The total number of fixations is often used as an indicator of processing difficulty, with fixation density related to the complexity and informativeness of the visual stimulus (DeGraef, De Troy, & d’Ydewalle, 1992; Friedman, 1979; Henderson, Weeks, & Hollingsworth, 1999), such that as informativeness increases, so too does the number of fixations in that area. In the current study, we also used measures of the average number of fixations. A higher number of fixations on an abstract will represent intensified information processing.
‘Pupil Dilation’ refers to widening of the pupil. It has long been known that pupils dilate in response to emotion-evoking stimuli (Beatty, 1982). While it is also the case that pupil size is affected by light, the lighting remained constant in our experiment. As Rayner and others have pointed out (Rayner, 1998), using only a single indicator of processing difficulty may result in an oversimplification of the relationship between the indicator and processing difficulty. Hence, both fixation and pupil dilation measures are frequently used as corroborating measures of cognitive workload (Hess, 1965; Just & Carpenter, 1980; Kahneman, 1973). Last, a ‘scanpath’ is the spatial arrangement of a sequence of fixations, or simply the sequence of LookZones that a subject views, as in the present study.
Definition of LookZones
In addition to logging the clickstream and Web page data of subjects, the script also constructed ‘LookZones’ around key content regions. The script utilized a feature inherent to the GazeTracker software system that automatically creates LookZones around links and pictures, which the software recognizes within the HTML tags. (For more information on the GazeTracker software system and the eye tracking apparatus itself, see Appendix A.) Thus, the script enabled the creation of distinct LookZone regions around each of the ten displayed results (Figure 1). For the analysis, each of these displayed results on Google—abstracts in rank #1, rank #2, rank #3, to rank #10—is given its own set of LookZones, from which we can then compare eye tracking behaviors across all queries, relative to these zones. LookZones were not visible during the time participants were engaged in the experiment.
Judged Relevance
As stated above, we considered rank, position, and judged relevance in this study. For all queries and results pages that were encountered in the study, we gathered relevance assessments of the abstracts, which allowed us to look at the choices made by subjects as a function of the positions and judged relevance of the page chosen by the subject in case Google’s rank did not reflect what other humans might consider relevant. Five non-participants were chosen as the judges in the study. For each results page, we randomized the order of the abstracts and asked judges to weakly order the abstracts (ties were allowed) by how promising they looked for leading to relevant. Each of five judges assessed all results pages for two questions, plus 10 results pages from two other questions, for inter-judge agreement verification. The set of abstracts/pages we asked judges to weakly order were not limited to the (typically 10) hits from the first results page, rather the set included all results encountered by a particular subject for a particular question. The inter-judge agreement on the abstracts was 82.5%. Furthermore, we also collected relevance judgments for the actual Web pages those abstracts represent. The inter-judge agreement on the relevance assessment of the pages was 86.4%.
Hypotheses and Analysis
We analyzed a combination of mouse click behaviors and the various ocular indices on three levels. On the task level, the analysis aggregates the behavior of a subject over all queries corresponding to one task; on the page level, the choice of clicks and ocular behavior on all pages were analyzed; on the abstract level, we analyzed the determinants of whether an abstract was viewed or clicked.
Our analyses were guided by the following hypotheses:
H1: At the page level of analysis, ocular data would differ among the three conditions.
Particularly, we expected that participants in the Reversed condition would demonstrate greater scrutiny of returned results when compared with controls, and this would manifest itself through longer fixation times overall, greater overall total fixations, how many abstracts were looked at per page, and visual backtracking or regressing to earlier viewed abstracts. In addition:
H2: At the abstract level of analysis, the eye data from participants in the Reversed condition would indicate explicit trust for Google’s ranking, as evidenced by a lack of significant difference among the three conditions in the number of fixations per abstract on the top two positioned abstracts. Furthermore, subjects would look at the last two positioned abstracts (the number one and two Google ranked abstracts) more than in the other two conditions, indicating an implicit awareness of their significance, either from confusion or interest. This implicit awareness would also be demonstrated in subjects’ pupil dilation data.
However, we suspect that the subjects tend to trust Google as an authoritative search engine since users mostly clicked on the first returned result in the first eye tracking study (Granka et al., 2004) They may still click on the abstracts higher on top even though the real relevance of those is low. Thus:
H3: Participants in both the Swapped and Reversed conditions would still choose abstracts of actual lower rank more often than subjects in the control condition (those who viewed Google results in their actual ranked order).
Thus, we anticipated dissociation between the ocular data, which would indicate some implicit conflict between the position and the actual Google rank, yet that the subjects would still choose a higher positioned abstract based on a greater trust in Google’s algorithms than in their own judgment. This can be validated through regression analysis of whether or not an abstract was clicked on, the relevance, and the position of all abstracts.
Conclusion, Limitations, and Future Research
- Top of page
- Abstract
- Introduction
- Related Research
- Research Methods and Design
- Research Results
- Conclusion, Limitations, and Future Research
- Acknowledgments
- References
- Appendix
In summary, the findings here show that college student subjects are heavily influenced by the order in which the results are presented and, to a lesser extent, the actual relevance of the abstracts. These subjects trust Google in that they click on abstracts in higher positions even when the abstracts are less relevant to the task. When looked at in combination, the behavioral data (clicked choices) and the ocular data indicate that while there might be some implicit awareness of the conflict between the displayed position and their own evaluation of the abstracts, it is either not enough, or not strong enough, to override the effects of displayed position.
Trust is one mechanism humans use to reduce the complexity of decision making in uncertain situations (Luhmann, 1989) and may be viewed as a fast and frugal heuristic that exploits the regularity of the information environment (Gigerenzer & Selten, 2002; Simon, 1956). Google’s information retrieval algorithm sorts the results by an estimate of the probability that it will fulfill the user’s information need, thereby potentially reducing the cognitive effort and time costs for searchers. In order to determine whether trust does, in fact, lead users to the most relevant documents, we further asked human judges to rate the relevance of the actual pages instead of the abstracts on Google results pages. Further analysis showed that under the Normal condition, 51% of the time participants clicked on the abstracts that represented the most relevant Web pages. If the subjects were to have simply clicked on Google’s number one ranked abstract, in 43% of the cases that would have resulted in them finding the most relevant Web page. Thus, trust does help the subjects reduce time and effort costs to locate the most relevant abstracts successfully some of the time.
What happens when users’ trust extends beyond their individual evaluation of a set of returned results? To the extent that the PageRank algorithm works well on any given query, the imbalance between trust and evaluation may simply mean greater or lesser search costs, in terms of, say, time and/or effort. More critical, perhaps, is an increased probability of misinformation, particularly in circumstances of topic naiveté. More insidious, however, is the potential for misguided trust to exacerbate what others already fear regarding the non-egalitarian distribution of information (Hindman et al., 2003; Introna & Nissenbaum, 2000), whether as a result of economic resources, indexing policies, or algorithms.
Combining users’ proclivity to trust ranked results with Google’s algorithm increases the chances that those “already rich” by virtue of nepotism get “filthy rich” by virtue of robotic searchers. Smaller, less affluent, alternative sites are doubly punished by ranking algorithms and lethargic searchers. A study conducted by Cho and Roy (2004) using simulation experimented with the popularity of new Web pages under different information access methods. They compared how one page can become popular under two conditions: assuming all users do random surfing online versus assuming all users access information through a popularity-based search engine like Google. Their results demonstrated that it takes 66 times longer for a page to become popular under the search-engine model.
Users, as a whole, are not familiar with how search engines “find” what they are looking for (Introna & Nissenbaum, 2000). The present results suggest that some users might benefit from having more information regarding the mechanisms by which Google and other search engines “crawl” the Web and determine how a website is ranked. Raising awareness through design is a promising direction. This might be accomplished on the results page of a search through a short explanation provided to the user on how the results were ranked, or perhaps through the visualization of the inbound and outbound link relationships that a website has fostered. As in social network analysis, users would be able to view central and peripheral sites, and more importantly, trace the connections or lineage between sites to determine the interest and relevance of a site based on its similarity to other sites and provenance. However, this requires a delicate design balance between maintaining simplicity and adding information content on search engine interfaces.
The limitations of this study lie primarily in its experimental nature. Research shows that information searching in a lab setting can be generalized to a larger context (Epstein et al., 2001; Schulte-Mecklenbeck & Huber, 2003). However, the subjects in the current study were all undergraduate students, in the same age group, who used Google as their primary search engine, and who trusted Google greatly; they conducted information searches in a lab setting on artificially designed search tasks. The applicability of the study results to a broader range of users and contexts is thus limited.
Google was chosen as the test search engine for the reasons described earlier; the generalization to other search engines is also limited. In a follow-up study we intend to investigate the generalizability of these effects with different existing search engines as well as with one that we contrive on our own, with no known history or a-priori influence on subjects. In this way we will be able to explore the pervasiveness of this trust and whether (and if so, how much) user trust transfers to other search engine contexts. In addition, we intend to recruit participants from different age groups, cultures, and search expertise, since our current study was limited to a group of college students and may not extend outside of that group.
Another promising direction is more refined analysis of the eye tracking data within the framework of information foraging theory (Pirrolli & Card, 1999), especially with respect to the cost of clicking and viewing lower ranked abstracts, as well as the value of information sent. In the meantime, as one of the first studies to explore evaluation processes in Web search, this study makes a significant theoretical and empirical contribution.