Exploration of dynamic query suggestions and dynamic search results for their effects on search behaviors

Authors


Abstract

While search behavior using dynamic query suggestions is understudied, it is virtually non-existent for dynamic search results (as currently experienced with Google Instant). We report results from a controlled lab study aimed at exploring the effects of these recent search interface developments – dynamic query suggestions and dynamic search results – on users' search behaviors. Based on the availability of these two features, 36 participants were assigned to three conditions and were asked to complete an exploratory search task. Analyses on user behaviors were conducted based on log data, screen videos, and eye tracking. Our results showed that while the dynamic search results feature exposed the participants to more search results pages, shorter querying time and shorter queries, such a functionality did not change users' general search process transition, as well as number of search sites, queries, and visited webpages. The findings also indicate a need to evaluate search interface features in the broader context of task completion rather than information searching and query running only.

INTRODUCTION

Search engine services have attempted to improve effectiveness and efficiency of online searching in both the behind the scene algorithmic advancements and front-end interface enhancements. Two recent interface features include dynamic query suggestions, offered by almost every major search engine now, and dynamic search results, currently available as the Google Instant service. However, little research has been conducted to evaluate the usefulness of dynamic query suggestion, and to the best of our knowledge, no work has tried to evaluate dynamic results features from a user's perspective with a focus on search in the context of task. Google claims that their Instant feature (dynamic search results) saves 2-5 seconds per search.11 But a serious question is whether this scales beyond individual queries within a search session, or leads to better results with respect to the task that led the person to engage in information seeking. Even if a few seconds are saved for running a query, does it translate to being more efficient and effective in collecting more or better information, or in general affect a user's search behavior? To investigate these questions, we designed a user study in which we gave the participants a simulated work task scenario that involved searching for and collecting information. This allowed us to study users' search behaviors in different conditions, defined based on the search features available to the user, in the context of a task. The present paper provides details of this user study, as well as its implications for future evaluation and design of search systems.

To test the effects of the system features, the study had participants randomly divided into three conditions:

  • C1Standard: Query-suggestions off, Google Instant off

  • C2Suggestion: Query-suggestions on, Google Instant off

  • C3Instant: Query-suggestions on, Google Instant on

In this work, we were interested in examining and comparing between conditions multiple aspects of users' behaviors, including users' physical actions, state transitions, time spent in different states of search events, and query formulation. More specifically, we aimed at exploring the following questions:

  • RQ1. Are there differences in users' logged search behaviors in the three conditions? If so, what are they?

  • RQ2. Are the state transition patterns in the three conditions different from each other? If so, how?

  • RQ3. Do users spend time differently in the three conditions, at the search engine homepages, search engine result pages (SERPs), and query formulating and reformulating phases?

  • RQ4. Is query length different in the three conditions? If so, how?

  • RQ5. Is the number of concepts covered in queries different in the three conditions? If so, how?

RELATED WORK

Our research relates to three areas in the literature: (1) dynamic search interface techniques – specifically, query suggestion features since there has been no work so far on dynamic search results that we know of, (2) time spent on searching, and (3) evaluating information retrieval (IR) systems. Below we provide a brief summary of important works relating to these three topics from the literature.

IR System Query Suggestion

The literature of IR has seen quite a lot of effort spent on improving search system interfaces to enhance search results and user experiences. Query suggestion is one feature that attracts much attention. Two types of query suggestion methods are (1) query expansion, in which the system modifies the searchers' queries automatically without the searchers being involved or even realizing that this has happened, and (2) interactive query suggestion, in which users are able to select query suggestions. Research (e.g., Beaulieu, 1997; Koenemann & Belkin, 1996) has shown that users would like some kind of control with query expansion features, and that increasing the level of user control improves search effectiveness. Recently, IR systems have begun to employ dynamic (also called real-time) query suggestions, which suggest queries while users type their own query terms.

Some studies have examined the effect of real-time query suggestions, either query terms or whole queries. White & Marchionini (2007) conducted a user study aimed at examining the effectiveness of real-time query expansion by suggesting query terms. The study compared three interfaces: a baseline interface with no query assistance feature, an interface that provides a query expansion option, by suggesting query terms after users submit their queries, and an interface that provides a real-time query expansion option that suggest query terms while users type queries. Users were given two types of tasks: the known-item task which has one single correct answer, and the exploratory task which does not have a single correct answer. Their results showed that in the real-time query expansion interface, users had higher-quality initial queries (measured by the number of concepts in queries), more engagement in search, and were more likely to take the query suggestions, but users did not show any significant differences among three conditions for the quality of all queries, task completion time, and result quality (as measured by the correctness in known-item tasks, and precision@10 in exploratory tasks). While users were more satisfied with the real-time query function with respect to engagement and enjoyment, they were more satisfied with the baseline system with respect to effectiveness and usability.

Leroy et al. (2007) compared three medical search engines with query term supporting tools. Two of them dynamically suggested terms based on users' keywords, and one used a static method. They also used different thesauri as term sources. The three systems required different amount of user effort in selecting and using suggested terms. A user study was conducted with 23 participants coming to search on assigned topics. They were also asked to record subtopics and supporting abstract. It was found that the dynamic query supporting tool led to higher efficiency, specifically, users performed fewer searches and found more documents per query. However, the three systems did not show significant differences in the number of useful subtopics recorded by the users. In summary, different term supporting tools showed significant differences in efficiency but not in effectiveness.

Through the analysis of a large-scale dataset of 4 days' logs over a period of 4 months, Anick & Kantamneni (2008) examined users' engagement with the real-time query assistant (i.e., query expansion) feature in Yahoo! This assistant offered suggest-as-you-type expansions, showing suggested whole queries to the users in a display tray under the query input box. The study found that users showed an increase in the attention paid to the query assistant and the interaction with it. However, the study did not assess how this query assistant helps users in finding information and finishing their information tasks.

Time Spent on Searching

Time spent in different states of the search process, such as in search engine homepage, SERP, and content pages, among others, has been frequently used as a way to observe and analyze users' behaviors. The proportion of time spent in each state has been examined as an indicator of task difficulty. Aula, Khan, & Guan (2010) found that in unsuccessful tasks, users spent a larger proportion of the task time on SERPs than in successful tasks. Dwell time in each state has also been examined by previous studies to look at factors such as task type. For example, Liu et al. (2010) found that decision time (defined as time users take during the search process to decide whether a document is useful) was longer in tasks in which a webpage as a whole is judged for its usefulness than in tasks in which part of a webpage is judged for its usefulness. In addition, dwell time on webpages has been studied as an indicator of the usefulness or relevancy of the webpage (e.g., Morita & Shinoda, 1994; Kelly & Belkin, 2004; Liu & Belkin, 2010; White & Huang, 2010). However, time spent in different states has rarely been used to examine the effect of different system conditions.

Figure 1.

A snapshot of Coagmento system used for the user study.

IR System Evaluation

The traditional approach of system evaluation as represented by the Cranfield/TREC (Text REtrieval Conference)22 evaluation paradigm uses document relevance as a criterion to assess system responses to single queries. With research in IR expanding to take a broader perspective of the information seeking process to explicitly include users, tasks, and contexts in a dynamic setting rather than treating information search as static or as a sequence of unrelated events, the traditional evaluation approach is not appropriate in many cases. Recently there have been alternative approaches that put evaluation in a broader context and employ multiple criteria and measurements (e.g., Belkin, Cole, & Liu, 2009; Cole et al., 2009; Vakkari, 2010).

As Belkin, Cole, & Liu (2009) and Cole et al. (2009) stated, information seeking takes place in the circumstances of having a goal to achieve or a task to complete, and an information seeking episode consists of a sequence of interactions between users and the information objects that are provided by information systems. Each interaction has both an immediate goal and a goal in respect to the accomplishment of the general goal; therefore, evaluation of IR systems should be modeled under the goal of information seeking and should measure a system's performance in fulfilling users' goals through its support of information seeking. Through the analysis of the characteristics of exploratory searching, Vakkari (2010) suggested ideas on how to evaluate exploratory search systems, and based on that, proposed that system evaluation criteria should be set in accordance to the objectives of the systems. This also extends the evaluation paradigm from a focus only on the output of the system onto the whole search process. In their study evaluating user efforts following search trails, i.e., a series of webpages starting with a search query and terminating with an event such as session inactivity, White & Huang (2010) employed multiple aspects of evaluation criteria including relevance, topic coverage, topic diversity, novelty, and utility. The multi-dimensional evaluation criteria offer a more comprehensive understanding of IR systems, their performance, and their support/assistance to information seekers' search process as well as their overall task accomplishment or the meeting of their information needs.

METHODOLOGY

To address our research questions stated above, we conducted a controlled laboratory study involving 36 participants in three different system conditions, defined based on the kind of interactive features available to the users. The current section provides details of this study.

System

The experiment was run on a Dell workstation, with an average system configuration. We developed an add-on for the Firefox web browser, called Coagmento (Shah, 2010; González-Ibáñez & Shah, 2011) (Figure 1) that allowed the participants to capture relevant information effectively without leaving the browser. The add-on included a toolbar with the following buttons:

  • Home: to bring the participant to the study page for completing questionnaires.

  • Bookmark: to bookmark a webpage.

  • Snip: to highlight a text passage on a webpage and collect a snippet. While saving a snippet, one also has an ability to rate that snippet on a scale of 1-5 for its usefulness for the given topic.

In addition, the add-on provided a sidebar in the browser that included a chat-box as well as collected bookmarks, snippets, and issued queries by the current participant in the given task. The chat-box was used to instruct the participant about starting/ending a task and filling in questionnaires.

We also removed the search toolbar from Firefox, so that the participants would have to visit a search engine's website for running a query. This was done to have better control over the three conditions defined later.

The browser was set to remove history data after each session, ensuring a participant's searching and browsing are not affected by the previous participant's actions. Coagmento, while providing various tools to help the participants in their information seeking, also recorded various actions, such as webpages visited and searches performed. In addition to logging with Coagmento, we also used a Tobii eye-tracker33 for recording screen videos with eye-gazing and eye-fixation information.

Participants

Participants were recruited randomly from Rutgers University by sending experiment advertisements to email lists and posting on post boards. A total of 36 participants were recruited, whose ages varied between 18 and 36. Of the 36 participants, 28 were female and 8 were male. Four of them were graduate students and the rest were undergraduates. They all indicated their search experience at moderate to high levels. 32 out of 36 (88.89%) declared their primary search engine to be Google.

The participants were given $10 for their participation. To encourage them to take the experiment seriously, there was a prize (an iPod Shuffle) for the best performing participant. The participants were told that their performance would be measured by the amount as well as the quality of the information they collected.

Procedures

The participants came to our usability laboratory one at a time, and were randomly assigned to one of the three conditions. This assignment was done using a uniform randomizer, resulting in the same number of participants in each condition. Once the participants signed the informed consent, they were briefed about the study, shown how to use the system for collecting snippets and using them, and then left in the room with the study computer to surf the web freely to any sources at their wish and work on their tasks.44 A typical session was about 40 minutes, with the following steps:

  • Greetings and consent form (2 minutes)

  • System demonstration (3 minutes)

  • Pre-task questionnaires (2 minutes)

  • Task (30 minutes)

  • Post-task questionnaires (2 minutes)

System conditions

Since we were interested in studying two distinct search features (dynamic query suggestions and dynamic results), we defined three different experimental conditions for the study under a between-subjects design. Following is the description of the three conditions.

C1Standard: query-suggestions off, Google Instant off

In this condition, we disabled JavaScript on three major search engines (Bing, Google, and Yahoo!), and gave the participants plain versions of search engines that offered no query suggestions, and no support for dynamic results.

C2Suggestion: query-suggestions on, Google Instant off

In this condition, we enabled JavaScript support in the browser, so that the participants had search engines with query suggestions, but turned Google Instant off using the settings at google.com.

C3Instant: query-suggestions on, Google Instant on

This condition featured query term suggestion as well as dynamic results changes by enabling Google Instant.

Note that due to the fact that both dynamic query suggestions and Google Instant are implemented using the same Asynchronous JavaScript And XML (AJAX) technologies, it was not possible to create a condition with Google Instant on and dynamic query suggestions off.

Task

Typical IR evaluations are done to measure the effectiveness of search and retrieval process only, but we decided to evaluate search features in the larger context of work-task. We picked a topic from the INEX 2006 Interactive Track topics relating to different revolutions in history.55 We chose this topic since it had the highest number of aspects (four) among the topics defined for that year's INEX. We conducted a couple of pilot runs with this task-topic combination and found them to be suitable for our study.66

Specific language for the task description provided to the participants is given below.

As a history buff, you have heard of the quiet revolution, the peaceful revolution, and the velvet revolution. For a skill-testing question to win an iPod you have been asked how they differ from the April 19th revolution.

Search and visit any websites that help you find information on this topic. As you find useful information, highlight and save relevant snippets. Make sure you also rate a snippet to help you in ranking them based on their quality and usefulness.

You have 25 minutes to read this task description, and collect information (snippets) from any online sources, and using any search engines you wish.

Note that while C3Instant was defined based on Google Instant, we decided to let our participants use any search engine they wanted to not force them into using Google only. We also had a justification for this design decision from our past studies of similar nature where it was revealed that Google is the choice of student participants consistently at least 85% of the times (Shah & Gonzalez-Ibanez, 2011).

RESULTS

The data of the reported study were obtained as log data, screen videos, and eye tracking data. We employed both qualitative and quantitative methods for analysis, as presented below.

Search Behaviors (RQ1)

Table 1 provides a summary of the participants' search behaviors as expressed using running a search, visiting webpages, and collecting information (bookmarks or snippets). Note that after doing one-way ANOVA, we did pairwise comparison between the three conditions if ANOVA was significant.

Table 1. Summary of various physical actions. Bold values indicate significant difference at p<0.05 using one-way ANOVA.
 F(p)
 C1StandardC2SuggestionC3Instant
Number of pages33.25 (14.10)33.25 (12.37)49.67 (10.82)6.90 (0.003)
Number of content pages21.67 (12.32)20.33 (7.28)27.17 (6.24)1.94 (0.160)
Number of SERPs11.58 (6.22)12.92 (6.97)22.50 (7.55)8.85 (0.001)
Number of queries10.00 (5.61)10.25 (5.46)9.58 (1.93)0.063 (0.940)
Number of unique queries8.67 (5.61)8.75 (4.02)7.08 (1.93)0.61 (0.550)
Number of search sources (websites)7.50 (3.58)8.00 (3.57)10.50 (3.43)2.50 (0.100)

Before we discuss the results in this table, it is important to clarify various terms used in it. Pages or webpages refer to a page that one visits by typing a URL or clicking on a link. Some of these webpages are search engine homepages (e.g., google.com), some are SERPs, and the rest are content pages. Search sources are defined as websites that one visited, which are different than the webpages, since one could visit multiple pages under the same site. For instance, Wikipedia has several pages about various revolutions and if one visits 10 of those, we would consider the number of webpages to be 10 and the number of sources/websites to be 1.

As evident from Table 1, those in C3Instant were shown more SERPs and visited more webpages than those in C1Standard or C2Suggestion. This may be an artifact of Google Instant providing and revising SERPs (a result page with up to 10 results) as soon as one starts typing a query. However, no significant differences were found in the number of queries tried or content pages visited. In other words, no clear difference seemed to exist among the three conditions for the physical actions taken.

At this point, it is important to bring up the details of how SERPs were recorded, especially for C3Instant. As described before, we used our own logger to record various activities of interest during the study. The logger runs on a one-second clock, which means it logs activities happening at the turn of a second and ignores those within a second. Thus, it does not capture every keystroke. According to the literature, an average computer user types 50 words per minute (Ostrach, 1997). Also, Google Instant does not provide a new SERP for every keystroke. Combining these two facts, we can safely assume that our logger did not create a SERP entry for each keystroke. To verify this, we watched all of the screen videos recorded for C3. We found that either a participant typed the whole query right away, generating one or two SERPs in 3-5 seconds, or they made a few pauses. In the latter case, Google Instant generated a larger number of SERPs, and our logger captured most of them as legitimate SERPs. However, the fact that the participant paused indicates that they may have looked at the results intermittently. We also did our own sanity check to get an idea of the reasonableness of our approach. For the query “quiet revolution” (very common in our study), we get one SERP if the query is typed quickly, and two SERPs if there is a pause after quiet. This is consistent with what the logger recorded. Regarding the number of pages, if a user stayed on a page for less than a second, it may not have been recorded by our logger.

Search Process Analysis (RQ2)

Even though the participants were allowed to use any search engine they wanted, about 95% of the time they chose Google. This was found to be consistent across all the participants in each condition. Since C3Instant is based on the Google Instant service, we decided to look at only those search sessions that were initiated with Google for each condition to keep the comparison fair.

Figure 2.

Video coding interface.

To gain a deeper understanding of the search process for each condition, we watched screen videos of every participant and coded various states. The search phase is defined as the part that begins with one visiting a search engine homepage (here, google.com), and ends with one leaving the SERP by either selecting a link or switching to an existing content page in a different tab or window. The search portion of user activity was then segmented in three states, namely: Google home page, SERPs, and Query Reformulation. In addition, we coded the possible transitions derived from these states such as: switch to a different tab, select a suggested query, select a search from the SERPs, move to a different SERP, and stop typing a query due to the SERP provided through Google instant, to name a few. We performed the video segmentation using Avidemux (http://www.avidemux.org). Figure 2 shows the interface that we used for watching participants' screen videos, overlaid with eye-tracking information, and coding their search states. Note that the eye-tracking data was used for segmentation purpose only, and not for other forms of analyses that are done with such data, including studying gaze and fixation behaviors of the users.

Following are the states that we coded:

  • Google: when the participant visits google.com.

  • Start typing: when the participant starts typing a query in the search box.

  • Take a suggested query: selecting a query that is suggested by Google (only in C2Suggestion and C3Instant).

  • Finish query: when the query is finished. In C1Standard and C2Suggestion, this happened when a participant hit the “Enter” key or clicked on the “Search” button, whereas in C3Instant, doing so was optional. In other words, those with the Google Instant feature (C3Instant) could finish a query without pressing Enter or clicking a button. Since such information for C3Instant cannot be easily identified from the logs, we used screen videos with eye-fixation overlay exclusively.

  • SERP: search engine results page is displayed.

  • Click on a result: selecting a result from a SERP.

  • Content page: visiting a content page that is opened by clicking on a result in the SERP or typing the URL, or is already open in another tab/window.

  • Collect useful information: bookmarking or collecting a snippet while in a content page.

Using the codes generated for each participant, we created state-transition diagrams for each condition and each task, depicted in Figures 3 to 5. Here, the ovals represent the states described above, the arrows indicate observed transitions, and the numbers on the arrows provide transition probabilities. For instance, in C1Standard (Figure 3), after seeing a SERP, 45% of the time the participants clicked on a result, 29% of the time reformulated their queries, 18% of the time went to a content page, and 8% of the time visited the next page of search results.

Figure 3.

C1Standard's search process state transition diagram.

4

Figure 4.

2Suggestion's search process state transition diagram.

Figure 2.

C3Instant's search process state transition diagram.

As can be seen, the transition patterns of the three conditions are in general similar to each other, except that C2Suggestion and C3Instant showed one more state (i.e., take a suggested query). The transition probabilities among states are also in general comparable among the three conditions, except for those with the new state in C2Suggestion and C3Instant. Due to the inherent mismatch in search behaviors exhibited among the three conditions, it would be difficult to compare these state-transition diagrams with statistical methods. Nevertheless, there are a couple of observations one could make from these diagrams:

  • In general, searchers were almost five times more likely to reformulate their queries than to visit the next page of results. In other words, if they did not find useful information in the first 10 results, they would rather change their search request than see the next 10 results.

  • Using eye-tracking data, we learned that users with Google Instant available stopped typing their queries about half the time and looked at the search results. This indicates that running partial queries did lead participants to switch their attention from creating a query to viewing the results.

Since no apparent differences other than the above mentioned were found by simply looking at state-transition diagrams for each of the conditions, we decided to conduct a deeper analysis of the amount of time they spent on critical states, and the queries constructed by the participants in each condition.

Table 2. Summary of average time spent (in seconds) at three search states.
Time spent in seconds
C1StandardC2SuggestionC3Instant
Google homepage8.03 (4.86)12.71 (9.26)4.76 (4.75)
Query reformulating5.09 (3.61)8.28 (7.58)5.77 (3.37)
SERP7.68 (5.82)7.90 (8.72)6.06 (5.07)
Table 3. Pairwise comparison between conditions of time spent (in seconds) at three search states. Bold values indicate significant difference at p<0.05. Bold and italic values indicate significant difference at p<0.01.
original image

Time Spent (RQ3)

We calculated the time users spent in each state using our segmented data. The results are reported in Table 2. They show that those in C2Suggestion spent more time on the Google homepage than those in C1Standard or C3Instant. In addition, C1Standard participants also spent more time on google.com than those in C3Instant (Table 3). It is also observed from these results that those in C1Standard and C3Instant spent less time than those in C2Suggestion when it comes to reformulate queries. In regards to the time users spent on SERPs, it was found that those in C1Standard and C2Suggestion spent more time than those in C3Instant.

Query Length (RQ4)

Query length is defined as the number of terms in a query. We looked at four types of query length, which varied along two aspects: one is based on the analysis unit being each query or being each user/session; the other is based on the query collection being all queries or unique queries (duplicate queries in a user/session were removed). These four types were as follows:

  • query length per query level for all queries (query-all)

  • query length per query level for unique queries (query-unique)

  • query length per task level for all queries (task-all)

  • query length per task level for unique queries (task-unique)

Table 4 summarizes the mean and standard deviations for the four types of query length for each condition. In each condition, the distribution of all four types of query length was normal, so one-way ANOVA was used to test the statistical differences of query length among the three condition groups.

Table 4. Summary of query lengths. Bold values indicate significant difference at p<0.05.
 F(p)
 C1StandardC2SuggestionC3Instant 
Query-all3.66 (2.05)3.62 (2.33)2.68 (0.83)10.32 (.000)
Query-unique3.80 (2.09)3.80 (2.42)2.80 (0.89)7.78 (.001)
Task-all3.25 (0.92)3.32 (1.12)2.70 (0.39)1.81 (.180)
Task-unique3.38 (0.88)3.44 (1.20)2.77 (0.39)2.07 (.142)

As can be seen from the results, there were statistically significant differences among the three conditions on query length at the query level (query-all and query-unique), but not at the task level (task-all and task-unique). Post-hoc analysis using the Tukey test found that query length in C3Instant was significantly shorter than that in C1Standard and C2Suggestion, while that in C1Standard and C2Suggestion did not have differences.

Query Concept Coverage (RQ5)

In addition to query length, we also looked at the number of concepts in queries, using a method similar to what is used by White & Marchionini (2007) for measuring query quality. Although this does not necessarily represent queries' quality in terms of leading to better search results or not, it reflects the concept coverage in queries.

All queries were judged for the numbers of concepts following a scheme that was generated for that particular task based on discussion among study designers and experimenters. The rating scheme was on a 5-point Likert scale, according to the number of concepts in queries, i.e., the number of concepts in the queries that are related to search task topics. The coding scheme is shown in Table 5.

Table 5. Query concepts judging scheme.
Query topic coverage scoreNumber of concepts in queryConcept examples
10Terms unrelated to search task concepts
21Revolution
32Revolution + quiet; or revolution + peaceful; or revolution + velvet; or revolution + April 19
43Two revolutions mentioned above, or one revolution + another concept
53+For any with concepts of 3 and above, such as two revolutions mentioned above + differences (or others)

Two assessors (two of the authors) judged a sample of about 10% (31 queries out of the total of 299) of all queries, and the inter-coder reliability was quite high, with a Kappa value of 0.812 (p<.01). Then one of the authors judged all remaining queries. Table 6 summarizes query concept coverage for each condition. As seen here, no significant differences were found for the average number of concepts in queries among three conditions.

Table 6. Summary of query concept coverage.
 F(p)
 C1StandardC2SuggestionC3Instant
Avg. Rating2.80 (1.01)3.28 (0.41)2.72 (1.10)1.48 (0.240)

DISCUSSION

Following the previous section where we presented our results, here, we present a discussion of the findings that incorporates our research questions relating to the effects of dynamic search features on users' search behaviors.

Physical Actions

Our results (see Table 7 for a summary) showed that the three system conditions did not exhibit differences in users' physical actions, except that C3Instant lead to more pages and SERPs visited. The more SERPs in C3Instant could possibly be a result of the system returning more SERPs given that it kept refreshing the search result lists while the users issued their queries. However, this does not mean that the users indeed read all the SERPs, especially those appearing while the users were typing the query terms.

Table 7. Summary of results.
original image

Time Spent in Search Phases (RQ3)

The fact that on Google homepage, participants in C3Instant spent shorter time than in C1Standard and C2Suggestion is not surprising, since the moment they started typing their queries, they were taken to a SERP by the Google Instant service. The finding that those in C2Suggestion spent longer time than C1Standard, we think, could possibly be due to the reason that participants paused to look at the suggested queries provided by the systems, with the dynamic query suggestion feature on, and this time was longer than participants typing to finish their own queries. This could indicate that participants' attention was indeed drawn by the dynamic queries, although those queries did not lead to more topic coverage (as discussed above).

The participants of C2Suggestion were found to be taking longer in query reformulation than those in C1Standard, possibly due to paying attention to the search suggestions being offered in C2Suggestion. In short, those in C3Instant do appear to be faster during the search stage (being at Google homepage, typing a query, viewing SERPs) of their task.

Queries

One may expect that the system's automatic query suggestion/completion feature would help users formulate better queries. Our results show that when query length was analyzed at the query level (not the task level), it was affected by the two system features. Meanwhile, query topic coverage did not vary between conditions.

A further examination of the queries found that they were mostly the names of the different revolutions as described in the task. System suggested queries did not expand the number of concepts about these revolutions in this case. Query length in C3Instant was shorter than that in C1Standard and C2Suggestion, while that in C1Standard and C2Suggestion did not have differences. This is reasonable considering that at times, C3Instant provided SERPs while users were pausing in their issuing of queries, and those intermediate SERPs were based on the incomplete queries, which would be shorter than those completed by the users. As reported, C3Instant participants also abandoned their queries and moved attention to a SERP half the time (transition from ‘Start typing’ to ‘SERP’), possibly resulting in shorter queries. In summary, our findings about queries indicated that automatic query suggestions did not lead to longer queries or more concepts in the queries, but Google Instant led to shorter queries at the query level.

In summary of all our findings, although our study was limited to a single task, the findings suggest that for an information seeking activity that is not as simple as running a parsimonious query, the query suggestion feature did not change users' behavioral transition pattern in the search process, the number of queries, search sources, and content pages viewed, query length and query concept coverage, but it led to shorter users time spent on Google homepage and in reformulating queries. The dynamic result feature did not change other examined behaviors than providing more SERPs, leading to shorter queries, and shorter time on SERPs.

We mainly examined users' behaviors in the current paper, and the results showed that the dynamic search features did not change much, especially the general transition pattern, except for a few behaviors. Future study will examine if these features affect users' task performance with regard to the effectiveness and efficiency from total task completion time, the useful pages found, etc., as well as how users favor these features.

CONCLUSIONS

We explored the issue of understanding the impacts of front-end search system features with dynamic interactions on users' behaviors. Through our study, we learned that systems with dynamic query suggestion or dynamic result features may not really change users' general behavioral transition pattern in the whole process, save effort in the number of search sources, queries, and content pages viewed, or formulate queries with more concept coverage than systems without those features could. The only benefit we found them providing is that users could have less querying time, more SERPs, and webpages in general, with dynamic search results. This could be helpful for querying process, but may not create a significant difference in general search behavior in the context of a task.

A valid question to ask here is what these results mean to interactive IR and human-computer interaction (HCI) research. Despite the non-significant influences of the dynamic front-end features on users' overall search behaviors, our results do show that dynamic search interface features get the information seekers exposed to more information relating to their search requests (by suggesting queries – C2Suggestion and C3Instant), and chances of providing more search result pages (by providing constant result-sets – C3Instant). The users spent more time looking at the suggestions offered, which may not help in a time-limited information retrieval task, but could be very valuable for learning and sense-making. Having dynamic search features did not result in saving time or constructing queries with more concepts, but the results could be different for simple search task (e.g., known-item retrieval), or unfamiliar task/topic, where the users have difficulty formulating search requests. While the study reported here was exploratory in nature, the method and results provided could help in future research around dynamic search suggestions, dynamic search results, and other such related front-end features of an interactive IR system.

Acknowledgements

We are thankful to Abhijna Baddi and Kanan Parikh for their invaluable efforts in helping to run the user study described here.

Footnotes

  1. 1

    http://www.google.com/instant/

  2. 2

    http://trec.nist.gov/

  3. 3

    http://www.tobii.com/en/group/about-tobii/eye-tracking-by-tobii/

  4. 4

    Despite this freedom, Google was used as the choice of search engine 95% of the times as described later in the Search Process Analysis section.

  5. 5

    http://www.iva.dk/binaries/041_Malik-Tombros-Larsen-INEX2006-iTrack-LNCS4518.pdf

  6. 6

    We also tried a few other topics during these pilots, but rejected them since they were found to be either too easy or too narrowly defined.

Ancillary