Do human-developed index terms help users? An experimental study of MeSH terms in biomedical searching

Authors


Abstract

To what extent do MeSH terms improve search effectiveness of different kinds of users? We observed four different kinds of information seekers using an experimental information retrieval system: (1) search novices; (2) domain experts; (3) search experts and (4) medical librarians. The information needs were a subset of the relatively difficult topics originally created for the Text REtrieval Conference (TREC). Effectiveness of retrieval was based on the relevance judgments provided by TREC. Participants searched either using a version of the system in which MeSH terms were displayed or another version in which they had to formulate their own terms. The results suggest that MeSH terms are more helpful in terms of precision for domain experts than for search experts. We speculate that this is because of the highly technical nature of the topics; only the domain experts had the knowledge to understand and therefore make use of the MeSH terms. The results advance our understanding of the usefulness of controlled vocabulary in interactive information retrieval systems.

Introduction

The type and volume of information resources available to users have dramatically increased because of the rapid growth of the Internet information services. More importantly, the fact that users are able to access more non-library resources online (e.g., open access journal articles and author's preprint files) has contributed to the decreasing use of library subject headings in library online catalogs in recent years. One of the challenges within this changing information landscape is what kinds of information search tools will be most useful for users, and under what conditions these tools can be effective in finding information.

Given the widespread recognition of the importance of search engines and full text searching, and the time and expense required to create controlled vocabulary, one of the crucial issues is whether human-developed index terms are still needed. Recent reports suggested the use of automatic subject indexing because manual indexing is not cost effective (Calhoun, 2006) or difficult to understand and use (“Bibliographic Services Task Force,” 2005). In contrast, (Byrd et al. (2006) and Mann (2006) argued for maintaining manual indexing on the grounds that it is needed for systematic and comprehensive literature searching in academic libraries.

Most recently, The Library of Congress Working Group on the Future of Bibliographic Control recommended the re-purposing of LCSH and the recognition of the potential of computational methods in the practice of subject analysis, given the current economic model for data sharing, higher user expectations and rapidly changing information environments (“On the record: Report of the Library of Congress Working Group on the Future of Bibliographic Control,” 2008).

Both human-developed index terms and automatic indexing systems are created to assist their intended users. Previous research on the impact of user characteristics in the use of information retrieval systems indicated that (a) subject domain knowledge or specific topic knowledge is not correlated with search outcome (Allen, 1991; Pao et al., 1993); (b) search experience with databases cannot predict search outcome (Fenichel, 1981; Howard, 1982; Sutcliffe et al., 2000).

In view of these issues of vocabulary control, we need more empirical research to inform our practice in subject analysis. In this paper, we report on an experimental study on the usefulness of MeSH terms. The present study aims to provide answers to the question: to what extent do MeSH terms improve search effectiveness of different kinds of users? More specifically, the study examines the following questions:

  • 1.Do human-developed index terms help users?
  • 2.Do MeSH terms help users overall?
  • 3.Do MeSH terms help different kinds of users?

We observed four different kinds of information seekers using an experimental information retrieval system based on a 10-year subset of MEDLINE bibliographic records: (1) search novices; (2) domain experts; (3) search experts; and (4) medical librarians (See the subjects section for operationalization and recruitment of the four types of searchers). The information needs were a subset of the relatively difficult topics originally created for the TREC (Text REtrieval Conference) 2004 Genomics Track (“TREC 2004 genomics track document set,” 2005).

Based on the research questions and empirical research findings in related work, the researchers formulated the following research hypotheses:

  • H1.Queries using MeSH will get better results than queries not using MeSH.
  • H2.Searcher's results will vary by user type.
    • a.Domain experts and medical librarians using MeSH will get better results than search novices and search experts.
    • b.Medical librarians using MeSH will get better results than domain experts.

Related Work

The use of controlled vocabularies (such as indexing thesauri or subject headings) for representing and retrieving documents is to address the problem of vocabulary variability. Early efforts primarily focused on the construction of complex indexing languages to represent information objects (e.g., Austin, 1974), without particular reference to retrieval effectiveness. The Aberystwyth index languages test is an exception (Keen, 1973). By considering the factors of index language specificity and exhaustivity, and method of co-ordination, Keen found that there are no large differences in retrieval effectiveness and efficiency of five index languages.

Previous research has compared the retrieval performance of manually assigned MeSH terms and automatic indexing in laboratory settings without the involvement of human searchers (Salton, 1972; Savoy, 2005). One of the general findings is that the retrieval results obtained from automatic indexing systems can be comparable to those provided by manual indexing (Salton, 1986). However, it has also been demonstrated that improved search performance in batch mode studies may not correspond with the user evaluation results involving searching tasks (Hersh et al., 2000; Turpin & Hersh, 2001).

Recent studies by Wacholder and Liu (2006, 2008) are examples of research that specifically attempts to compare the usefulness of query terms identified by different methods: one constructed by human indexer and the other two identified automatically. One of the prominent findings is that query languages affect search outcome.

From a theoretical perspective, the usefulness of controlled vocabulary is relatively difficult to measure because several associated factors are intertwined in the use of information retrieval system, including the nature of controlled vocabulary, subject discipline, system functions provided for developing search strategies, users (indexers and searchers), as well as requirements for search topics (Svenonius, 1986). Given this, we designed an experimental study to investigate the usefulness of controlled vocabulary within a specific subject discipline, using MeSH terms and MEDLINE system as test bed. The MeSH terms were selected because:

  • 1.It is widely recognized that MeSH is representative of state-of-the-art indexing practices;
  • 2.Pertinent and comprehensive information is critical to biomedical research;
  • 3.There are extremely strong demands of PubMed database searches (the public accessible version of MEDLINE), with more than 2.38 million searches per day (“PubMed searches,” 2008).

Method

This study employed user-oriented evaluation methods for information retrieval systems. The general procedure included pre-search background questionnaire, training session, search tasks, post-search questionnaire and brief interview after all search tasks. Search logs with search terms and ranked retrieved documents were recorded.

Subjects

Four types of participants were identified and recruited at a major public university in the US and in nearby medical libraries in early 2007. They were assigned to one of four searcher types based on the two initial questions: (1) Have you taken any college-level biology courses? (2) Have you taken classes in how to do online searching? The four types of searchers are operationalized as follows:

  • 1.Search Novices (SN): Individuals without formal training in online searching courses and without advanced knowledge in biomedical domain. In our study they were undergraduate students who are not biology majors. While many of these students are users of the Web and of search engines but they are not expected to have in-depth understanding about online bibliographic databases.
  • 2.Domain Experts (DE). Graduate students in a biomedical domain such as biology or medicine, without formal training in online searching courses.
  • 3.Search Experts (SE). Graduate students enrolled in Master of Library and Information Science (MLIS) programs who have taken online database searching or other courses and do not have advanced knowledge in biomedical domain. They are neither biology majors nor have a Master degree or above in any biomedical field.
  • 4.Medical Librarians (ML). Medical librarians specializing in online searching services. The subject domain knowledge is defined by formal education in biomedical areas or more than two years of experience of working in medical libraries.

Experimental design

The experiment is a 4x2x2 factorial design with four types of searchers, use of an experimental system with/without the aid of MeSH terms and control of search topic pairs. The order of the two versions of a system, searcher types and search topic pairs are controlled by a Graeco-Latin squared balanced design (Fisher, 1935). Based on the experimental design thirty two subjects (eight for each type of searchers) were recruited.

Search tasks

Participants were told that several biologists had expressed their information needs as specific search topics; the participants' task was to use the retrieval system to find as many relevant documents as possible. The subject was then instructed to do concept analysis, identifying important facets in search topics before each search.

The participants filled out a searcher background questionnaire before the search assignment. After a brief training session, they were assigned to one of the arranged experimental conditions. They completed a search perception questionnaire and were asked to indicate the relevance of two pre-judged documents after each search. A brief interview was conducted when they finished all search tasks. Search logs with search terms and ranked retrieved documents were recorded.

Participants searched either using a version of the system in which MeSH terms were displayed (MeSH) or another version (Non-MeSH) in which they had to formulate their own terms. Users of the MeSH system were instructed to consult an online vocabulary look-up aid, MeSH Browser 2003 (“MeSH Browser,” 2003) for help with concept analysis. Each participant searched eight topics in total (four for each version of the system) from a pool of ten carefully selected search topic pairs. The subjects were allowed up to ten minutes per topic; when ten minutes were up, the subject was given another topic.

Experimental system

The experimental information retrieval system is based on the Greenstone Digital Library Software version 2.70 (“New Zealand Digital Library Project,” 2006) set up on a server, using the 2004 TREC Genomics document set (“TREC 2004 genomics track document set,” 2005) (See Figure 1). TREC Genomics Track 2004 Data Set document test collection is a 10-year (from 1994 to 2003) subset of MEDLINE with a total of 4,591,108 records. The test collection subset fed into the system used 75.0% of the whole collection, a total of 3,442,321 records, excluding the records without MeSH terms or abstracts.

Figure 1.

Experimental system based on Greenstone digital library software

Search topics

A total of ten search topic pairs selected from a pool of fifty search topics, originally prepared for 2004 TREC Genomics Track ad hoc retrieval task are used (See Figure 2 for an example). The selection of topics follows these steps:

  • 1)Consulting an experienced professional searcher with biology background and a graduate student in neuroscience for assessment of intelligibility and technical difficulty of MeSH topics;
  • 2)Ensuring that major concepts in search topics could be mapped to MeSH by searching MeSH Browser;
  • 3)Eliminating topics with very low MAP (mean average precision) and P10 (precision at top 10 documents) score in the relevance judgment set.
Figure 2.

Sample search topic

Outcome measures

We measured outcome using precision and recall measures for accuracy and time spent for user effort.

Theoretically speaking, the calculation of recall measure requires relevance judgments from the whole test collection. However, it is almost impossible to obtain these judgments from a test collection with more than 3 million documents. For practical reasons the recall measure used a pooling method from the 27 groups participated in the TREC 2004 Genomics Track ad hoc tasks (Hersh et al., 2004). Empirical evidence has shown that the recall measure using a pooling method results in a reasonable approximation, although the recall is likely to be overestimated (Zobel, 1998).

Incentive system

An incentive system was designed to motivate the searchers. Each subject was paid $20 for participating. They were also paid up to $10.00 dollars more based on the number of relevant documents in the top ten search results; on average each participant received an additional $4.40, with a range of $2.00 – $8.00.

Searcher background

To distinguish the different classes of users, we used two characteristics: domain knowledge and training in search.

  • Domain knowledge: In general DE searchers have the most subject domain knowledge; this is particularly true for graduate level biology courses; this is as expected. MLs come in second in terms of domain knowledge, again as expected. But their domain knowledge is much lower than that of the DE searchers; ML searchers primarily have undergraduate courses (Figure 3 and Figure 4).
  • Search training: The search experience, measured by formal training in online searching courses, suggested that ML searchers have participated in the largest number of online searching classes followed by SE searchers (Figure 5). Most DEs and SNs had no search training.
Figure 3.

Box plot of undergraduate level biology knowledge by searcher type

Figure 4.

Box plot of graduate level biology knowledge by searcher type

Figure 5.

Box plot of search training by searcher type

Results

This section reports the overall results of search success measured by search effectiveness (precision/recall) and search efforts (time spent).

Search effectiveness

The overall success result comparing MeSH and Non-MeSH systems suggested that there is no statistically significant difference between the two versions of the experimental system, in terms of both precision (one-way ANOVA, F(1, 254) = 0.01, p = .94, p > .05) and recall (one-way ANOVA, F(1, 254) = 0.30, p = .58, p > .05) measures. The hypothesis that queries using MeSH will get better results than queries not using MeSH is not supported.

Overall, different types of searchers obtained similar results. Search effectiveness by different types of searchers did not make statistically difference in terms of precision (one-way ANOVA, F(3, 252) = 1.86, p = .14, p > .05) and recall (one-way ANOVA, F (3, 252) = 1.66, p = .18, p > .05) measures (Figure 1).

Table 1. Mean precision and mean recall by searcher types
Searcher TypeMean PrecisionNMean RecallN
  1. Note. SN = search novices; DE = domain experts; SE = search experts; ML = medical librarians

SN0.29640.2164
DE0.40640.1564
SE0.30640.1564
ML0.35640.2364
Total0.342560.18256

But when we compared effectiveness of different types of searchers within the different systems, we found a very strong effect of system version and searcher type pairs in terms of precision measure (one-way ANOVA, F(7, 248) = 3.48, p = .001, p < .01). In particular, there were highly significant differences in precision between DEs and SEs when they both used the MeSH version of system (Tukey's HSD, mean difference = .31, p = .003, p < .01), and between DE's use of MeSH and SN's use of Non-MeSH (Tukey's HSD, mean difference = .28, p = .009, p < .01) (Figure 2).

Table 2. Mean precision and mean recall by system version and searcher type
 MeSHNon-MeSH
Searcher TypeMean PrecisionMean RecallNMean PrecisionMean RecallN
  1. Note. SN = search novices; DE = domain experts; SE = search experts; ML = medical librarians

SN0.360.21320.230.2032
DE0.510.15320.290.1532
SE0.210.16320.380.1332
ML0.280.22320.420.2432
Total0.340.191280.330.18128

6

Figure 6.

Line plot of the mean and standard error of square root of precision by searcher type-system version pair

The hypothesis that subject domain experts (DE searchers) will get better results than subject domain novices (SN and SE searchers) using MeSH is therefore partially supported. The hypothesis that search experts (SE and ML searchers) using MeSH will get better results than untrained searchers (SN and DE searchers) using MeSH is not supported (Figure 5).

Search efforts

The participants were very engaged with assigned search tasks. A density histogram of time spent by all searches with a superimposed theoretical normal curve showed an extremely high-density value of time within the range of 550 and 600 seconds (Figure 7).

There was no significant difference in the time spent using MeSH or Non-MeSH systems (one-way ANOVA, F(1, 254) = 2.77, p = .10, p > .05). However, the time spent by searcher types was statistically significant (one-way ANOVA, F(3, 252) = 3.47, p = .02, p < .05). Further analysis showed that DEs spend significantly more time than SEs do (Tukey HSD, mean difference = 71.86, p = .01, p < .05).

Figure 7.

Histogram of time spent with normal density overlaid by all searches (N = 256)

The greater amount of time may reflect at least these two factors: 1) Trained searchers found the topics more difficult than did domain experts; 2) Trained searchers persist in searching, even when they are having a hard time.

Figure 3 summarizes some of the major findings from the experimental results in terms of search effectiveness and search efforts. The overall results suggest relatively low precision and recall score in search effectiveness (the mean precision and mean recall were only .34 and .18 respectively), and expended search efforts for assigned search tasks (the mean time spent was 8 minutes for each search topic). There was no statistically significant difference in search effectiveness and search efforts obtained from the two versions of an experimental system. It means that the hypothesis that queries using MeSH will get better results than queries not using MeSH is not supported. Likewise, different types of searchers achieved comparable search effectiveness, but DEs spent significantly more time than SEs. More important, MeSH terms were most helpful in terms of precision for DEs than for SEs, suggesting that DEs benefit the most in using MeSH terms when the search topics are technical.

Table 3. Summary of search effectiveness and search efforts results
 Search EffectivenessSearch Efforts
  1. Note. ≥ means better at .05 level of significance; = means no significant difference; SN = search novices; DE = domain experts; SE = search experts; ML = medical librarians

OverallMean precision = .34 Mean recall = .18Mean = 485.3 secs/topic
System VersionMeSH = Non-MeSHMeSH = Non-MeSH
Searcher TypeSN = DE = SE = MLDE ≥ SE
Searcher-System PairDE-MeSH ≥ SE-MeSH DE-MeSH ≥ SN-Non-MeSHDE-MeSH ≥ SE-Non-MeSH

Discussion and Conclusion

This indicates that subject domain knowledge plays a particularly important role in effective use of MeSH terms, particularly when the search topics are technical in nature. Specifically, MeSH terms are most useful in terms of precision for domain experts. Our results therefore contradict earlier research (Allen, 1991; Pao et al., 1993) that has suggested that subject domain knowledge is not correlated with search outcome. This may be because the earlier research used homogenous group of participants with relatively similar subject background. Another explanation may be the intrinsic difficulty of the TREC topics. Only the domain experts had the knowledge to understand and therefore make use of the MeSH terms; searchers who did not understand the topics could not make good use of the MeSH terms. Furthermore, Blair (2002) suggests that the type of search we asked users to conduct—exhaustive searches on a very large document collection, based on complex context descriptions of topics—are especially difficult.

Medical librarian searchers in the study indicated that these are not typical search topics in their work settings, and they usually need more time to research difficult topics by consulting reference tools. We do not therefore generalize about the application of these results to the usefulness of MeSH terms with less difficult topics. Still, the greater amount of time searching spent by MLs indicates that they do persist in searching even when the topic is hard.

The results from the present study suggest that searchers with substantial subject domain knowledge can benefit from the use of MeSH terms in terms of precision measures, when they search technical topics in the context of user evaluation experiment. The previous finding that retrieval performance obtained by automatic indexing with various combinations of search models can be as effective as that obtained by manual indexing in a batch mode evaluation, is now extended to a user evaluation. The effect of searcher differences, such as those identified in this research and in a study comparing experimental sites in TREC interactive track (Lagergren & Over, 1998), deserves further investigation to advance our understanding of the usefulness of human-developed indexes in interactive information retrieval systems.

Acknowledgements

This work is funded by NSF grant # 0414557, Michael Lesk and Nina Wacholder, PIs.

Ancillary