Evaluating a metadata-based term suggestion interface for PubMed with real users with real requests
This paper reports results of an evaluation study of MAP (Multi-faceted Access to PubMed), a metadata induced query suggestion interface for PubMed bibliographic search.
A novel evaluation methodology was used to address the challenges involved in evaluating an IIR (Interactive Information Retrieval) system such as the MAP interface. The most significant aspect of this methodology is that, instead of using assigned tasks common in traditional IR evaluation, it asks real users with real search requests to search with real systems in an experimental setting. Several performance measures were created based on which comparisons were made between MAP and PubMed baseline. MAP was shown to perform better in several of these measures, especially when the search requests had not been attempted before.
The finding pointed to search characteristics as an important intervening variable in IIR evaluation. The advantages of and potential threats to our methodology were also discussed.
Arguably one of the most difficult tasks in IR is the representation of users' requests (Belkin, 1980). Search engine users are known to submit very short and ambiguous queries (Jansen, Sprink & Saracevic, 2000). The shallowness of the representation of user request stands in direct contrast to the thoroughness of document representation. This disparity often results in unmanageable search results for users of a heterogeneous and massive bibliographic database such as PubMed. Several system features of PubMed (e.g. default “explode” function and free-text indexing in title and abstract fields) that aim at facilitating end-user searching tend to increase indexing exhaustivity and therefore favor search recall at the expense of precision. Faced with the unmanageable amount of returned results, users are often left with few options but to hastily browse the first few returned pages.
The enormous size of the returned set creates at least two barriers to a successful user-system communication. Firstly, there is no telling whether there might be documents relevant to users' needs buried deep down in the returned set. Secondly, the skimming of the surface of the returned set gives inadequate feedback to meaningfully help users' judgment about the query performance.
The breakdown of user-system communication is especially severe in search situations where users are searching for unfamiliar topics or only have vague information needs. Without timely feedback from the system, users are unlikely to be able to interact effectively with the system, and refine their searches.
To address the communication breakdown, an interface was created for PubMed search that utilizes MeSH co-occurrence information to provide support for users' query construction. The article reports the functionality of the interface and results of an experiment designed to assess its effectiveness. A novel approach was proposed to the evaluation of interactive information retrieval (IIR) systems such as ours. In the following sections we first elaborate on the functionality of the interface, followed by the research design and procedures, and conclude with the results of the quantitative analyses.
To alleviate the mismatch of representational exhaustivity at the two ends of IR process, various techniques have been proposed to expand users' queries, either automatically or interactively. An approach that has recently gained wide adoption is “metadata guided search”, which involves dynamically extracting metadata from the initial returned set for users to augment their queries (Hearst, Elliott, English, Sinha, Swearingen, & Yee, 2002; Lin, 1999; Pollitt, 1998). For literature searches in health sciences specifically, several attempts have been made to exploit term co-occurrence relationship for term suggestion purpose, with terms extracted either from a controlled vocabulary (Doms & Schroeder, 2005), or free-text in the article abstracts (Goetz & von der Lieth, 2005; Perez-Iratxeta, Bork, Andrade, 2001; Plikus, Zhang & Chuong, 2006).
Multi-faceted Access to PubMed
Our approach involves extracting faceted metadata to guide user exploration of the information space similar to that proposed in Hearst et al. (Hearst et al, 2002), with a few modifications made for the specific circumstances of PubMed search. An interface (MAP) was built that delivers the query submitted to PubMed and generates MeSH terms for users to refine their queries. For a massive database like PubMed, it become less feasible to adopt the browse and select method proposed in Hearst's Flamenco system. Instead of relying solely on browsing, the proposed interface preserves the search mode of access while providing the browsable faceted category in support of searching.
At the implementation level, dynamically extracting metadata at search time could be problematic as the amount of computing time needed might cause a delay. Therefore instead of generating a term co-occurrence matrix on the fly during searching time from the search results, a database of term occurrence data was built beforehand. The term co-occurrence table can be updated regularly to better represent the conceptual relationships in the published literature. The database included descriptors from MeSH term, author and journal title fields extracted from all the PubMed bibliographic records in 2006 and 2007.
As the user submits her/his query though the MAP interface, the top two hundreds terms that co-occurs most frequently with the query term will be indentified and display for browsing. As users are likely to submit non-MeSH terms as their queries, users' queries have to be first mapped to an appropriate MeSH term in the prebuilt database. This is done by utilizing PubMed's automatic translation table function (See more on automatic term mapping in “PubMed Help” ). Thus as the user submit her/his query, proper MeSH terms interpreted by PubMed's automatic translation table are also retrieved in order to identify terms co-occurring with the initial term from our local database. In cases where no MeSH term is returned by the translation table, another search mechanism will be activated where descriptors are extracted directly from the top two hundreds returned Medline records.
Two approaches to term display have been attempted, one simply ranks the terms without categorization (List); the other organizes the terms by the MeSH top-level categories (Faceted-category). Terms are ranked in both methods by their co-occurrence frequency multiplied by its inverse terms frequency.
An empirical study was conducted to investigate the usefulness of the faceted-category version of the interface. Specifically, we would like to know, firstly, how users with genuine information needs will interact with the proposed interface. Secondly, whether the MAP interface performs favorably, compared to the regular PubMed interface. Of particular interest is under what search situations will MAP be most effective.
Evaluation of IIR systems
The evaluation of an IIR (Interactive IR) system such as MAP poses a serious challenge to evaluation methodology. In traditional IR evaluation, other than the system components being compared, searchers, search topics, and their interaction effects with the systems are regularly treated as random variance the experimenters strive to systematically control and minimize. The “Laboratory Model” (Kekäläinen, & Järvelin, 2002) of evaluation is very efficient for comparing the effectiveness of algorithms. However, it becomes inadequate for today's interactive systems whose effectiveness depends largely on active users' engagement (Belkin et al. 2004). TREC interactive track signifies the early effort to include human subject into the modeling of IR performance (Dumais, & Belkin, 2005). The inclusion of the users in the loop, however, also increases the difficulty in evaluation experimental design and analysis. It has been shown in TREC interactive track data that “topic effect” accounts for the greatest variance in models that includes searchers, search topics, systems and their interactions. To make the main system effects comparable, replicated Latin square design has been adopted where all the treatment levels (i.e. different systems) have equal chance of been “crossed” by the same searchers and search topics. Yet the threat of topic-system interactions remains. In the non-interactive test environment, the issue of topic effects and topic-system interactions biasing the systems comparison has been addressed by averaging performance criteria over a sufficient number of topics (Lagergren & Over, 1998), which is unfeasible where human searchers are involved.
Another inherit constrain in the traditional IR evaluation paradigm is that the systems compared are conceptualized as general purposes tools, without considering for what kinds of search requests it might be more effective. Therefore the assigned tasks are created mostly in an ad hoc manner, without theorizing task characteristics and how these characteristics might interact with the system features.
To better understand how real users with real search requests might interact with MAP, and its effectiveness under different search situations, a novel approach to IIR evaluation was adopted in our study, most notably the sampling of real users' genuine search requests. Instead of assigning the participants a uniform set of search requests, it was decided to allow the participants to conduct their own search requests in a controlled experimental setting. This was done for the following reasons: firstly, it was feared that, had we used pre-constructed requests, the participants would simply grab terms from the task narrative as query terms, which renders the interface, which is designed for facilitating query construction, less useful, if not entirely useless. Secondly, as pointed out earlier, one of our research questions is to look into the usefulness of the interface under different search situations. To do so using pre-constructed requests entails operationalizing search characteristics with topic narratives, which is difficult to pull through in a highly specialized domain such as health sciences without help from domain experts. Asking the participants to characterize their search requests with chosen attributes such as domain familiarity and whether it is a new or revisited search therefore affords us a rare opportunity to investigate the interactions between these attributes and interface on various performance criteria. Specifically, it hypothesized that, the MAP interface, because of the vocabulary support it provides, is more effective when the users are new to a research area and lack the necessary domain knowledge and terminology to conduct effective searches.
The decision to use real users searching for real information needs on real systems entails several thorny methodological issues that need to be addressed. First of all, without a set of predefined tasks we do not have the benefit of objective relevant judgment that serves as the benchmark for traditional IR evaluations. Therefore it is important to come up with valid performance criteria other than recall and precision based on which system performance can be compared. Secondly, the use of real user requests poses further challenges to creating a research design capable of controlling the confounding factors.
Table 1. Graphic representation of the experimental design
In our design, participants were asked to search their requests with both interfaces (the regular PubMed interface and MAP), which makes it a repeated measure design where each request serves as its own control. The repeated measure has the advantage of reducing the error term thereby giving more power to the statistical test. However, it also comes with its own risks, most significantly carry-over effects that might confound the results. To control for possible carry-over effect, the requests were randomly assigned to alternate treatment order, so that any given request would have an equal chance of being searched first with either interface.
Table1 shows the eventual mixed factorial design adopted in this study with the interface as the within-subject factor and type of search requests as the between-subject factor. Participants were randomly assigned to one of the four groups in which they would search their own requests, first on either of the interface, then move on to the other (See Table 1). Notice that the treatments for the same request were never received in immediate sequence; this is done so in the hope of further minimizing the carryover effect as the there was always a “wash-out” period between the two treatments to the same request.
It is recognized that, despite the alternation of the treatment order, carry-over effect might still persist as terms picked up from the preceding interface would wind up unduly advantaging the present one. The risk seems larger in group 3 and 4, when the MAP interface was used first, as the participants were likely to be exposed to more terms in these sessions. It is therefore crucial to instruct the participants to start their search with the same query and not to use terms they have learned from a previous session when the request was searched the second time.
A total of 44 regular PubMed/Medline users were recruited to participate in the study, all of whom were graduate students in health and biomedical sciences. They were told to prepare two search requests of their own prior to coming to the laboratory.
The research procedure was as follow: Upon their arrivals, the participants were asked to give consent for the study and fill out a brief entry questionnaire where data regarding their subject expertise, educational status and PubMed/Medline search experiences were collected. An online video tutorial was given to explain how to operate the interface.
Before the search for each request began, a pre-search questionnaire was first administered in which the participants were asked to write down a search statement for the quest they were about to search. They were also asked to provide what they believed to be the most ideal query for the request at this point. Scaled data were also elicited on the characteristics of the request such as their familiarity with the problem area, thoroughness needed for the request, whether the request has been searched before and so on.
As noted in the methodology session, each participant would search for two requests of their own alternately on two interfaces, resulting in four search sessions. As the search session began, the participants first input the original query they had given in the pre-search questionnaire and then were asked to retrieve ten useful records using the interface they were assigned in that particular session. After the initial input, they were allowed to revise their queries based on feedback from the interfaces and the search results. In other words, they were able to interact with the interfaces as they would normally do when conducting PubMed searches.
The participants were told they had 15 minutes for the task, but could stop whenever they had finished. After each session, they would again write down what they consider to be the best query terms at this time. They were then asked to evaluate the goodness of their pre-and post-search ideal queries on a 0-6 scale. This information allowed us to compare, for each search request, the participant's perceived goodness of his/her query before and after interacting with either interface. Scaled data were also elicited on satisfaction with the results and perceived usefulness of the interface in post-search questionnaire. All their interactions with the interfaces were logged and recorded by screen capture software. Of particular interests are numbers of iterations, number of terms selected and number of records viewed, which should give us a clearer idea about how users might interact differently with different interfaces (See Table 2 for a summary of the research procedures).
Table 2. Research procedure
After both search sessions for the same request had finished, the participants were asked to indicate the degree of relevance of each of the 20 records (10 from each session) on a 0-6 scale and single out records they had seen before. This allowed us to compare the quality of the search results retrieved by the two interfaces. Other performance criteria collected in the post-questionnaire include users' assessment of how well their requests were represented by their queries (“goodness” of the query), as well as their satisfaction with the search results. The participants were also asked to comment on the usefulness of the term suggestion function and in what search situations they thought the function might be helpful. In the following section the initial analysis of the results will be reported, as the qualitative analysis of the query formulations are still underway, the results reported here will mostly derived from quantitative measurements.
A total of 88 (44 × 2) search quests were sampled, which resulted in 176 search sessions as each search session was searched with both MAP and PubMed baseline. Among the 88 search quests, 60 of them were requests that had been searched by the participants before (revisited searches), 28 of them were searched for the first time (new searches). Sampling genuine search quests affords us an opportunity to examine the relationships among different characteristics of information needs and how they might impact on the effectiveness of the interface. Table 3 shows the correlations among different attributes of the search requests. Not surprisingly, the original goodness of the query was found to be highly correlated with the participant's familiarity with the search topic, which is also highly correlated with the completeness needed for the search.
Table 3. Search characteristics correlations (N=88)
As the interface was designed with a view to facilitating query reformulation, naturally we would like to see whether users' querying behaviors differ between the two interfaces. Specifically, four measures were compared to help us get a better sense of how querying behaviors differ: number of terms added and deleted per session, number of query submissions, and the similarity between users' original query and finalized query. The original-final queries similarity can be seen as an indication of the effect of the interfaces on users' queries. The higher the similarity, the less users' final queries diverge from the original. Jaccard's coefficient was used as the similarity measure:
Where A stands for the set of terms in the original query and B stands for the set of terms in the finalized query. A paired-samples t-test was conducted to evaluate whether original-final queries similarity differ between two interfaces. The results indicated that mean similarity for PubMed (M =.54, SD =.31) was significantly greater than the mean similarity for MAP (M =.39, SD =.27), t (87) = 3.77, p <.001. In other words, when using MAP, users' final queries diverge from their original queries to a significantly larger degree than when using PubMed baseline.
Significantly differences were also found between PubMed and MAP with respect to the number of terms added, t (88) = 4.03, p <.001; and deleted, t (87) = 2.06, p <.05 during user interactions, as well as number of terms in finalized query t (87) = 3.00, p <.005) (See Table 4).
The participants were also found to make more query submissions (i.e. each time the search button is clicked) when searching with MAP (M = 5.26, SD = 2.78) than with PubMed (M = 3.74, SD = 2.74), the difference is significant, t (87) = 20.81, p<.001. Despite great disparity in the numbers of search iterations, an equivalent amount of records were viewed between the two interfaces, which was measured by summing up the number of surrogate records contained in the results pages (20 surrogate records per page) brought up by the user during a search session. This indicates to us that, on average, fewer records were viewed per submission when MAP was used.
Table 4. Comparison of querying behaviors
A one-way multivariate analysis of variance (MANOVA) was conducted to determine the effect of the two interfaces (PubMed baseline vs. MAP) on the four performance criteria variables: perceived usefulness, self-assessed goodness of the query, average relevance score of the ten records saved, and user satisfaction with the results. Significant differences were found between the interfaces on the performance measures, Wilk' Λ =.92 F(4, 167) = 3.79, p<.01.
As the MANOVA test was shown to be significant, individual ANOVAs were then conducted. In the following we report the results of factorial ANOVA that test for the effects of two factors, namely, the interfaces and the types of search requests, on the aforementioned performance criteria. Here search requests were classified by whether it was a new or revisited search. This is based on our hypothesis that the experimental interface is more effective for users who are new to a research area.
The mean and standard deviations for participants' assessment of the goodness of queries as a function of two factors are presented in Table 6. The MAP interface was shown to do significantly better in terms of the final “goodness” of the query, F(1,86) = 7.88, p=.006), especially in cases where the requests have not been searched before by the participants (See Table 5).
Table 5. “Goodness” of final query (n=88)
A similar pattern manifests itself with respect to the usefulness of the interface, F(1,86) = 13.13, p <.001. The MAP interface did better, and especially so when new searches were attempted (Table 6).
Table 6. Usefulness of interface (n= 88)
In terms of average relevance scores, MAP was found to be only a slightly better than PubMed baseline in both new and revisited searches (See Table 7), though the difference is not significant, F(1,86) = 2.38, p=.126 The type of searches, however, has a slightly stronger impact on the quality of search results, F(1,86) = 3.84, p=.053. Revisited searches produced better search results than new searches.
Table 7. Average relevance scores (n=88)
Similarly a 2 × 2 ANOVA was conducted to evaluate the effect of the interfaces and users' previous topic search experiences on their satisfaction with search results. The results of the ANOVA test indicated a non-significant effect for interfaces F(1,82) = 1.82, p =.18, and a non-significant effect for search types, F(1,82) =.61., p=.44. In order to further investigate whether there was an interaction effect between interfaces and types of searches, we chose to ignore the method main effect and instead examined the method simple main effects, that is, the differences between interfaces for new and revisited searches, separately. There were no significant differences in users' satisfaction with the results between interfaces for revisited searches F(1,116) =.75, p =.39, but there was a significant difference between the two for new searches, F(1,52) = 5.71, p =.021, which indicates that MAP did better in this regard only when a new search was attempted.
Table 8. Satisfaction with results (n=84)
Users' satisfaction with terms suggested by MAP was also slightly higher in requests that have not been searched before (Table 9), though the difference is not significant.
Table 9. Satisfaction with suggested terms (n= 88)
The analyses that have been done so far seem to suggest that MAP was beneficial to users' searches, particularly when the searches had not been attempted before. The remaining question is: in what aspect does it help users' search? In the post-questionnaire, the participants were asked specifically about their perception of MAP with several Likert scale (0-6) type questions. They were asked, if the MAP interface helps their search at all, in what aspect does it help: 1. it helps me generate new ideas and concepts for future research, 2. it helps me clarify my search questions, 3. it helps by showing the structure of the literature in the database, and 4. it helps me manage the enormous amount of search results. ANOVA tests were conducted to evaluate whether the participants' answers differed significantly between old and new searches. Among the four dimensions, only one significant difference, “help me generate new concepts”, was found between new and revisited searches (Table 10).
Table 10. Help generate new concepts (n= 88)
Types of query reformulation
Detailed analysis of the characteristics of added and deleted terms during user interactions are currently underway and reserved for a future discussion. The analysis focuses on how the expanded terms are related to the users' original query. Here we present a few examples showing how the initial-expanded terms relationships will be categorized. Our current coding scheme largely concerns two relationship categories: terms semantically related to the initial terms, and terms that represent new ideas not included in the initial queries. Semantically related terms are arraigned in either a hierarchical (i.e. broader or narrow terms) or synonymous relationship with the initial terms. (See Table 11). For example, in the search conducted by subject #19 with “tolerance of morphine dosage' as the original query, the added MeSH terms “pain” and “treatment outcome” were coded as new ideas generated by MAP, as they do not have a clear hierarchical or synonymous relationship with terms in the original query. Another example of new ideas generated can be found in the session conducted by subject #24 with MAP, two new MeSH terms: “environment exposure” and “flame retardants” were added in the final query. An instance of hierarchical relationship can be found in the search conducted by subject #48 with the original query “cartilage repair AND tissue engineering”. Two MeSH terms were added during user interaction with MAP: “articular cartilage”, “biomaterials”. “Articular cartilage” was identified as a narrow term for the initial term “cartilage”, and “biomaterials” was a broader term for “tissue engineering”. For the same search request, another term, “osteoarthritis”, was added in the session with PubMed, which was coded as a new concept. With the categorization of term relationships, we will be able to examine the distribution of the aforementioned relationships in MAP vs. PubMed, and new vs. revisited search, which should help us infer whether and how the two factors: interfaces and previous search experiences, might influence the characteristics of expanded terms.
Table 11. Original-expanded terms relationships
Appendix 1. NFC scale.
Appendix I: Screenshot of MAP