Expediting medical literature coding with query-building


  • Effective January 23, 2011, this work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.


Manual sorting of published journal articles into several pre-defined subsets for the purpose of qualitative analysis is common practice in social science research. Unfortunately, this can be a time-consuming process which requires the attention of a subject specialist, and relies on various measures of inter-rater reliability to ensure that the results are valid and reproducible to serve as a basis for further study. We describe a system we have implemented, steelir, to help determine features common to one set of PubMed® articles in order to distinguish them from another. The system provides users with word-level unigram and bigram features from the article title and abstract, as well as MeSH® indexing terms, and suggests robust sample queries to find similar articles. We apply the system to the task of distinguishing original research articles on functional magnetic resonance imaging (fMRI) of sensorimotor function from fMRI studies of higher cognitive functions.


When a biomedical research topic is fully defined by known query terms, PubMed's automatic word stemming and MeSH index mapping facilitate high document recall. In other situations, however, researchers must resort to a broad query and time-consuming manual filtering in order to achieve sufficiently high recall for comprehensive literature reviews and bibliometric analyses. Usually, this problem is addressed by beginning with a broad query, and manually sorting or coding the resulting set of articles into any of several pre-defined subsets. This method, however, requires considerable expert attention.

In this poster, we present a simple system, dubbed “steelir,” that suggests and assembles PubMed search features from a set of relevant and non-relevant examples provided by the researcher (Piwowar 2010). Using this system, investigators need only code as many articles as necessary to establish relevant subsets within the data.

To evaluate the effectiveness of this approach, we applied it to the task of identifying research articles that used functional magnetic resonance imaging (fMRI) to study one of two discrete types of cognitive function.


We began with a set of fMRI research articles that had been manually curated based on the cognitive function under observation (Illes et al., 2003).

Of these sample documents, 62.5% were randomly selected to serve as a training set for query development. For each query we wished to develop, we supplied the PubMed IDs of our positive and negative examples to our steelir program.

Steelir (Python 2.6) retrieved titles, abstracts, and MeSH® indexing terms through the NCBI's Entrez Programming Utilities (eUtils) web service (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) and calculated simple features for each article based on unigrams and bigrams of this text.

Steelir disqualified features that did not have at least 10% precision and 10% recall in our training corpus. To assemble unigram and bigram features for the abstracts and titles, we delimited the text on spaces and all punctuation except hyphens. We excluded any unigram or bigram that included a word less than 3 or more than 30 characters long, did not include at least one alphabetic character, or represented a PubMed stop word.

We then applied domain knowledge to exclude features that were not germane to the intent of the query, and considered generalizations of MeSH terms where appropriate.

To evaluate the performance of the queries, we calculated the precision, recall, and f-measure of the full queries across the test samples (the PubMed positive and negative examples that were not used for query development). We compared the results to baseline MeSH queries.


We applied the steelir system to the task of identifying research articles of fMRI of basic sensorimotor function. We compared the results of the derived query to that of a naïve query based on MeSH terms on the test set. The derived query had better precision than either of the MeSH queries and a substantially improved level of recall for our intended task, as shown in Table 1.


We describe a simple tool to help researchers expedite the classification of biomedical research articles. As a proof of concept, we applied this approach to a task previously performed by manual annotation: identifying research articles that used fMRI to study either sensorimotor function or language/motor function, as distinct from fMRI studies of other cognitive functions. The derived queries achieved performance levels that outperform baseline MeSH queries for our intended application, requiring measurably less time to construct a data set to be used as the basis for further research.

In constructing our machine-derived queries, the steelir system suggested several features that would not have been selected by a naïve searcher. While there was little increase in precision as compared with the baseline MeSH queries, recall was improved considerably by incorporating a greater number of more specific word-level features. This allows searchers to broaden their queries with less risk of obtaining minimally relevant documents, further lessening the time spent in coding.

Table 1. Comparison to simple MeSH queries
[fMRI query elements] AND (“somatosensory cortex”[MeSH] OR “motor cortex”[MeSH])166.43.35
 [.38 .48][.30 .40]
[fMRI query elements] AND “somatosensory cortex”[MeSH] AND “motor cortex”[MeSH]166.55.11
 [.50 .60][.08 .14]
Derived sensorimotor query166.57.52
 [.52 .62][.47 .57]

Although the evaluation demonstrates the usefulness of this approach only in the context of one type of research task, we believe this end-user method for formulating PubMed queries is widely applicable. For example, given a representative set of examples, it can be used to develop an automated query to rapidly complete the annotation of a cross-sectional study or continually monitor trends over time. Unlike relevance feedback, which takes place during the conduct of a search, the method we describe treats query construction as a preliminary stage in the search (Manning, 2008). This method of query refinement could also be applied in an information filtering task, in which a high quality query is needed as a filter for retrieval over time from a data stream (Hanani, 2001).

The system could be extended in many ways. It could incorporate other attributes of MEDLINE metadata, such as journal or author names, MeSH qualifiers, and Entrez links. It could evaluate a list of PubMed limits and subsets (e.g. bioethics[sb]), or present distinguishing features by semantic category, like Arrowsmith (Smalheiser 2009). Query results could be passed through the MetaMap program to identify synonyms and multi-word phrases. Additionally, steelir could be generalized to produce features appropriate for searching other databases.

This query development method requires a careful eye for detail to ensure that no undesired elements are included in a derived query due to MeSH term expansion. More evaluation is needed to better understand its accuracy across tasks. Nonetheless, the system is easy to maintain, is free, open, and extensible, and has already saved much manual coding effort.