Literature-based discovery: Beyond the ABCs

Authors


Abstract

Literature-based discovery (LBD) refers to a particular type of text mining that seeks to identify nontrivial assertions that are implicit, and not explicitly stated, and that are detected by juxtaposing (generally a large body of) documents. In this review, I will provide a brief overview of LBD, both past and present, and will propose some new directions for the next decade. The prevalent ABC model is not “wrong”; however, it is only one of several different types of models that can contribute to the development of the next generation of LBD tools. Perhaps the most urgent need is to develop a series of objective literature-based interestingness measures, which can customize the output of LBD systems for different types of scientific investigations.

Introduction

Text mining is an umbrella term for extracting and analyzing information expressed in the form of text. Literature-based discovery (LBD) refers to a particular type of text mining that seeks to identify nontrivial assertions that are implicit, and not explicitly stated, within (generally a large body of) documents. As articulated by Don Swanson (1986a, 1986b, 1988), identifying such assertions is a first step in formulating and assessing new scientific hypotheses that may be regarded as potential new discoveries. Strategies for LBD have been studied primarily by information and computer scientists (see the comprehensive book edited by Bruza & Weeber, 2008, for reviews; e.g., Hristovski, Friedman, Rindflesch, & Peterlin, 2008; Sehgal, Qiu, & Srinivasan, 2008; Smalheiser & Torvik, 2008; Wren, 2008; Yetisgen-Yildiz & Pratt, 2008). The bioinformatics community has also created numerous specialized systems that utilize implicit textual assertions for predicting, e.g., gene associations with disease and protein–protein interactions (e.g., Jansen et al., 2003; Rzhetsky, Wajngurt, Park, & Zheng, 2007; Leach et al., 2009; van Haagen et al., 2009; Tjioe, Berry, & Homayouni, 2010). In this review, I will provide a brief overview of LBD, both past and present, and will propose some new directions for the next decade.

The goal of LBD is really to generate or assess new hypotheses that might represent potential scientific discoveries and, hence, are worthy of follow-up in the laboratory or clinic. The term LBD can be ambiguous or misleading (Kostoff, 2007; Kostoff et al., 2009) and Bekhuis (2006) has proposed that it should be replaced with some alternative term such as “exploratory mining.” “Discovery” has many different meanings in different contexts and at different stages in the cycle of scientific discovery (Grinnell, 2009). An LBD system might be very useful when it discovers things that are novel to the investigator doing the search, even if it is well known to other experts or even to the scientific community at large. On the other hand, a great deal of information has been published and, hence, ought to be known by the scientific community, yet lies unknown, unaccessible, or neglected for one reason or another (undiscovered public knowledge, Swanson, 1986a; neglected medical discoveries, Swanson, 2011).

A few years ago, Vetle Torvik and I published a case of undiscovered public knowledge in genomics databases, namely, the fact that a significant subset of mammalian micro-ribonucleic acid (miRNA) precursors derive entirely from genomic repeat elements (Smalheiser & Torvik, 2005). To make this observation, all that was necessary was to view miRNA genes on the UCSC Genome Browser, juxtapose the miRNA track with the Repeatmasker track, and notice the association. The knowledge contained in the Browser is entirely public and explicit; nothing implicit was involved. However, apparently no one had thought to look for such a pattern before—it was literally hidden in plain view.

This single discovery can be deconstructed into a series of discoveries. First, in the course of an earlier study (Smalheiser & Torvik, 2004), we “discovered” the hypothesis that miRNAs might derive from genomic elements. Then, we “discovered” the observation as empirical data lying within public databases. Finally, the finding was analyzed further in detail, written up, and subjected to peer review, to establish the miRNA/genomic repeat link as a generally accepted and biologically significant fact, which would be generally acknowledged as a discovery by anyone's definition (Grinnell, 2009). With these caveats, in the present review, I will refer to any knowledge or finding identified using an LBD system or strategy as a “discovery,” regardless of where it sits in the cycle of scientific discovery—as long as it provides something new to the searcher that assists him or her in the task of generating or assessing a hypothesis.

A “Dirty Little Secret”

A further ambiguity is that LBD can refer to either a system, that is, a software product designed to assist (or replace) humans in formulating hypotheses, or to a “strategy,” that is, a cognitive approach that humans employ to combine assertions, whether carried out as a deliberate conscious effort or in an intuitive manner. For several reasons, it has been very difficult to obtain hard evidence documenting the extent to which LBD does, or potentially can, accelerate the process of scientific discovery.

On the one hand, only a score or so published scientific articles have proposed hypotheses that the authors said were obtained via LBD systems; only a few have validated the hypotheses experimentally in the same article (e.g., Wren et al., 2004) or even openly acknowledged that LBD played any role in their thinking (Manev & Manev, 2010). Some observers (e.g., Spasser, 1997) have used this paucity of evidence to suggest that LBD arose within the information science community (and stayed there) without successfully connecting with active scientists. However, we must not forget the stark distinction between private and public phases of discovery—most of the thoughts, conjectures, pilot studies, puzzling findings, modeling activities, and literature searches that are pursued during the private phase of a scientist's work are missing, sanitized, or erased from the final published article (Grinnell, 2009). Just as scientists are generally loath to publish negative findings, most experimental scientists regard hypothesis-papers as an inferior type of literature (in the same manner, I suppose, in which poets regard limericks) and generally will postulate new hypotheses in print only when tacked at the end of an experimental study or a review article. Another factor is that scientists may be reluctant to trust, much less give credit to LBD systems for their outputs. Computer-generated diagnosis systems were rejected by physicians, in part, for similar reasons: They were unwilling to trust or credit the software when it gave the correct diagnosis, because physicians still had to double-check its reasoning and use their own judgment anyway (Shortliffe, 1987).

More likely, scientists are, indeed, routinely carrying out LBD analyses on their own, manually and unsystematically, perhaps without realizing it. For example, Don Swanson once followed-up the impact of several of his classic LBD hypothesis articles (Swanson, 1986b, 1988) by looking at later articles written by others, which validated these hypotheses in experimental or clinical studies. He demonstrated persuasively that these later authors had read, and been influenced by, his own earlier papers, yet few of them cited or discussed them (Swanson, 1993). Moreover, at the panel “Beyond (simple) Reading: Strategies, Discoveries, and Collaborations” held at the 2009 American Society for Information Science and Technology meeting, I gave a detailed example of one neuroscientist who carried out a classic, systematic A–B–C analysis that led to the discovery of a new extracellular matrix protein receptor, yet was unaware that she was performing a discrete, iterated, LBD text-mining task. She thought she was simply reading a number of articles and reasoning logically about them! Indeed, LBD does represent intuitive common sense, but domain scientists do not realize that modeling common sense is a formal (and very hard) problem.

To my knowledge, there has not been any systematic evaluation of when, and how often, scientists carry out LBD-style analyses (manually) in the course of their scientific work. Nor is it clear whether scientists themselves recognize when they are doing an LBD analysis, as opposed to carrying out a literature search or other types of information-seeking activities. This is a great PhD thesis topic for someone.

Yet another hurdle for the LBD community is the fact that most domain scientists in the biomedical and physical sciences seem to be unaware of the various web-based LBD interfaces that have been set up by information scientists (reviewed in Weeber et al., 2005). Only a few of these websites have been maintained continuously by their creators, and only a few have been subjected to user testing (Smalheiser et al., 2006; Yetisgen-Yildiz& Pratt, 2008; Yetisgen-Yildiz & Pratt, 2009). The Arrowsmith two-node search interface (http://arrowsmith.psych.uic.edu) has been shown to assist field testers materially in assessing their hypotheses (Smalheiser et al., 2006), and has even garnered unsolicited testimonials from outside users of the site (Best of the Web, 2007; Manev & Manev, 2010).

Finally, hypothesis formation is only one of many driving forces for discovery. Someone may have a good hypothesis and not pursue it for a variety of reasons, including lack of funding, lack of available analysis tools (Edwards et al., 2011), competing priorities, prevailing biases, and so on. Given all of these considerations, we should not be unduly discouraged that LBD seems to have a low profile among domain scientists. (Bear in mind that most biomedical scientists do not even utilize informatics tools for other basic tasks such as visualizing their data or summarizing the documents retrieved by a literature search.) Going forward, information scientists can raise its profile by not only improving LBD algorithms but also studying the prevalence and role of LBD-like analyses in scientific workflow, as well as by educating both students and scientists in informatics literacy.

Incremental Versus Radical Discoveries

Swanson formulated the strategy of LBD in terms of what has become known as the ABC model (Swanson, 1986b, 1988; Swanson & Smalheiser, 1997). For example, given the assertion “A affects B” appearing in one article, and “B affects C” appearing in a different article, one can derive the implicit assertion “A affects C,” which represents a potential hypothesis. This formulation has simplicity and power, and (given a corpus of articles of the size of PubMed) suffices to generate an enormous number of plausible hypotheses. Nevertheless, the time has come to relax the ABC formulation and consider alternatives for the field of LBD.

The ABC approach, as commonly pursued, begins with a collection of articles “A” within MEDLINE or PubMed that represent a scientific problem (e.g., articles that discuss small-cell lung carcinoma). Words and phrases “Bi” (that appear in the title or abstract of articles in A) are then listed, and for each “Bi” term (or a filtered subset), a separate literature search is carried out using that term as query. The words and phrases “Ci” that appear in each of the Bi literatures are then compiled (and possibly filtered). Finally, by some criteria, the Ci terms are ranked, such that high-ranking Ci terms are thought to represent the most promising hypotheses. (Depending on the system, Bi and Ci may alternatively represent other features extracted from the articles such as Medical Subject Headings or concepts.)

For example, for A = small-cell lung carcinoma, and C = members of the category of therapeutic agents, the Ci terms may be the names of drugs that have not yet been tested against small-cell lung carcinoma, but that have been proven to have efficacy in other situations (e.g., in other forms of cancer or in animal models), suggesting that they might be explored as new therapies. (Note: Some authors reverse the A and C in this scheme, so that one begins with a problem C and seeks to find a possible solution A.)

There are several limitations in this ABC approach. First, the sheer number of Bi terms causes an exponential explosion that is hard to handle computationally and that requires one or more shortcuts to be implemented (Wren, 2008). Second, the huge number of resulting Ci terms is difficult to assess or interpret manually, so that it is crucial to have effective ranking procedures to identify the most promising finds.

Although different systems have dealt with these two issues in various ways, almost all current systems employ similarity algorithms that rank Ci terms as more promising if they closely resemble terms or concepts that are already known to be true in A. For example, thalidomide has been investigated as a therapy against certain autoimmune diseases, and an LBD analysis predicted that it might be worth investigating in certain other diseases that share similar pathogenetic features (Weeber, Kors, & Mons, 2003). Reelin has been shown to bind to certain proteins, and an LBD analysis identified other proteins (that share certain features with the known set) as promising reelin-binding proteins (Homayouni, Heinrich, Wei, & Berry, 2005). By their very nature, similarity algorithms will find only incremental discoveries—those that are similar to what in machine learning is called “the training set” (see also Kostoff et al., 2009).

Another, more subtle limitation of the ABC approach is that systems are generally evaluated according to the probability that the ACi assertions are likely to be true. That is, they look for highly probable assertions. However, novel discoveries often seem very unlikely at the time that they are first proposed (Simonton, 2004). A better approach is to rank the Ci terms according to how many different biological mechanisms link Ci and A, but the sheer number of linking Bi terms (e.g., as tabulated by Don Swanson's Kiwi one-node search system; Swanson & Smalheiser, 1987) is a poor proxy for estimating this. Other methods, such as mutual information measure, have also been proposed (Wren, 2004). Use of directional action cues (does A inhibit or enhance B?; Giles & Wren, 2008) and mapping genes or terms onto functional pathways (e.g., Kim, Wuchty, & Przytycka, 2011) are active research areas in bioinformatics and may contribute to the solution of this problem.

Moreover, several of the discovery systems attempt to improve the signal-to-noise ratio by employing natural language processing techniques that identify explicit statements of the form “A affects/binds/regulates/interacts with B” and “B affects/binds/regulates/interacts with C” (e.g., Hristovski et al., 2008). This is certainly a valid approach, particularly suited to simple statements of chemical interactions, and useful for genomics and proteomics data in particular.

However, I argue that most implicit information present in the scientific literature does not follow such simple templates (and may not comprise simple factual or propositional statements at all). Rather, it is analogies and images—juxtapositions and novel associations of ideas—that appear most often to stimulate scientists to formulate radically new hypotheses (see discussion in Simonton, 2004). Many classic discoveries follow AB and BC assertions but at a rather high level of abstraction, which is unlikely to be captured or highlighted in explicit templated, factual statements:

  • (a)According to Lenoir and Giannella (2006): “The technological development of peptide and DNA microarrays was driven by analogy to photolithography techniques, particularly those employed by the semiconductor industry. In one of the meetings of the Affymax scientific board, Leighton Read tossed out the idea of just mimicking the makers of semiconductor chips, who use beams of light to manipulate molecules on solid surfaces in order to create random chemical diversity.”
  • (b)According to Ban (2006): “Potassium bromide is the oldest widely used sedative in medicine. Charles Lockock, a London internist, discovered the anticonvulsant and sedative action of the drug. His discovery was one of the many quaint examples of serendipity in which an utterly false theory led to correct empirical results. Lockock, like most physicians of his time, believed that there was a cause–effect relationship between masturbation, convulsions, and epilepsy. Bromides were known to curb the sex drive. Lockock's rationale was to control epilepsy, i.e., convulsions, by reducing the frequency of masturbation. The treatment was a success insofar as control of convulsions was concerned. It also brought to attention the sedating properties of the drug.” (Admittedly one could construct this discovery from individual pre-existing statements, but only if one were to accept false statements, thought to be true at the time, as inputs for discovery systems!)
  • (c)In my own scientific work, we proposed that RNA interference might have a physiologic role in regulating learning and memory (Smalheiser, Manev, & Costa, 2001). This hypothesis was based on similarities between gene silencing studies in C. elegans that were published around 2000, and experiments carried out on memory transfer in planarians more than 30 years earlier. For example, (a) one can feed C. elegans bacteria that express double-stranded RNAs to induce silencing, whereas one could transfer memory in planarians by feeding nave worms extracts of trained worms. (b) One can inject double-stranded RNAs in one location and it will spread gene silencing throughout the body of C. elegans, whereas one could cut a trained planarian in half and it would regenerate a new head that retains the memory. (c) The silencing activity in C. elegans depends on double-stranded RNAs, whereas the active memory transfer molecules in planarians appeared to be some type of RNA. (d) RNA interference in C. elegans is extremely potent and self-amplifying, whereas memory transfer in planarians was effective even when the extracts did not contain any detectable RNA at all (at levels that were measurable by optical density).

Even if each of these individual similarities could be captured in simple, templated, factual assertions within a body of articles within each literature (which is doubtful, at least for the primary research articles), no single feature was very compelling, specific, or unusual, and so it is unlikely that they would have drawn attention in the forward direction from a discovery system. Rather, it was the combination of all four similarities that created an intriguing story and led to the testable hypothesis that endogenous silencing RNAs (siRNAs) are expressed in brain and upregulated during the onset of learning (Smalheiser et al., 2001). Interestingly, the initial experimental attempts to detect endogenous siRNAs (during 2000–2005) gave negative results. This did not disconfirm the hypothesis, however, because the recent development of deep sequencing methodology has allowed them to be detected (see discussion in Smalheiser, Lugli, Thimmapuram, Cook, & Larson, 2011).

Another limitation of the natural language processing-based approach, i.e., utilizing templated assertions, is that they often enforce semantic agreement across the linking term; that is, to link AB and BC assertions, the term B must have the same meaning or context in both AB and BC. Yet magnesium itself can be mapped to many different concepts: It can be conceptualized as an element, a cation, a dietary ingredient, a bodily fluid constituent, a co-factor of enzymes, a channel blocker, or a therapeutic agent. The same term (Mg) is often discussed in different contexts in different literatures that we would like to connect. The limited “slippage” across those loose links is desirable, and may be lost if links are forced to share the same semantic meaning or connotation. Root-Bernstein (1989, p. 483) gave an example of the importance of slippage in the discovery of lysozyme by Alexander Fleming: “Enter Fleming the mischievous game player. His problem: What causes his frequent and uncomfortable runny noses? Wait a minute! Runny bottoms are caused by bacteriophage infections! Why not runny noses? A hypothesis is born of verbal analogy!”

Interestingness Measures for LBD Systems

To date, the challenge of LBD (the one node search) has largely been framed in terms of finding hypotheses that are novel, nontrivial, and likely to be true. Torvik and Smalheiser (2007) employed shared title words and phrases (B-terms) to link two disparate literatures A and C in a biologically meaningful manner, in which the emphasis was on finding terms that are relevant and meaningful in a particular context. Yet significant scientific discoveries have one or more additional aspects: For example, they may exhibit simplicity, they may be surprising, or beautiful in an aesthetic or conceptual sense. They often link disparate disciplines, and ideally they are actionable (i.e., they lead to testable hypotheses that can be tested immediately or in the near future). They have great impact within their own field, their premises are based on reliable experimental support, and they have explanatory power that generalizes and ripples widely across other domains of science.

Whereas the field of numerical data mining has extensively explored a variety of rule interestingness measures (Han & Kamber, 2006), to my knowledge, few interestingness measures have been formulated in the context of text mining, and even fewer have applied literature-based measures (e.g., Weiss, Indurkhya, & Apte, 2010; Sebastian & Then, 2011). Interestingness measures can be objectively formulated for a given finding “A affects B” in terms of formulas that are derived from literature based features (i.e., the set of articles that demonstrate, mention, or discuss “A affects B”) or literature pairs (i.e., the set of articles related to A and the set of articles related to B). The study of Swanson, Smalheiser, and Bookstein (2001) was a case in which interestingness measures were employed to identify viruses that were particularly promising to be exploited for biological warfare. The premise was that biological warfare investigators were most likely to choose viruses that had their genomes already sequenced and that had been investigated with regard to aerosol stability. (This strategy is based on a model of how researchers may themselves select a virus for study.) Thus, the list of potential viruses was ranked according to how actionable they were for experimental manipulation. A parallel study by Smalheiser (2001) used similar criteria to predict that gene therapy biotechnologies (specifically, gene delivery methods) were likely to be employed for viral biowarfare research.

Removing the “B” From the ABC Model: Reformulating One-Node Searches as Two-Node Searches

As mentioned above, one-node searches have generally been formulated in a manner that faces an explosion of intermediate links: Starting with a single literature A, one obtains up to thousands of Bi-terms, and for each Bi-term, a new query is performed that obtains many Ci-literatures. Because of this, all existing LBD strategies restrict the number or type of B-terms, and most restrict the Ci-literatures to those that fall within a predetermined category (e.g., diseases or drugs). Yet one can bypass the process of collecting B-terms altogether, at least for the purpose of identifying candidate Ci-literatures (Torvik & Smalheiser, 2007). This is because the range of possible Ci-literatures are generally known in advance. Given a specific disease (say, A = Parkinson's disease)—we may be looking for novel therapeutic agents, say—the Ci-literatures might comprise the list of drugs that are FDA-approved for other indications but not previously tested in Parkinson's disease. One simply makes a list of all agents within the general category, and examines them one by one. In other words, a one-node search can be performed by carrying out a series of two-node ACi searches, in which the output from each search is a score that estimates how good Ci is as a candidate. One simple score is the estimated amount of overall shared implicit information that is shared between the A and Ci-literature (Torvik & Smalheiser, 2007), though it is likely that better rankings will be achieved using a combination of interestingness measures. Certainly, the Bi-terms are not irrelevant to this process, because they are likely to be useful features in calculating the overall scores for each two-node search. Yet they no longer sit as a bottleneck in the discovery system.

A Phone Call From Don

My first contact with Don Swanson occurred in the early 1990s, when he phoned me to discuss an apparent anomaly in his analyses. Following up on his Mg-migraine hypothesis (Swanson, 1988), he had noticed that Mg seemed to rank highly as a candidate therapy, no matter what neurological disease was under consideration. How could this happen? I said the issue was very simple: Mg is known to gate (i.e., limit) calcium currents through the N-methyl-D-aspartate (NMDA) receptor. Over-stimulation of the NMDA receptor, or over-accumulation of intracellular calcium, causes excitotoxicity, which occurs in many diverse situations (stroke, ALS, seizures, etc.). Thus, a deficiency of Mg should exacerbate excitotoxicity and Mg supplementation should help to counteract it, not just in migraine, but also across many neurological diseases. In fact, our first joint paper pointed this out in the context of individuals who exhibit mild dietary Mg deficiency (Smalheiser & Swanson, 1994). Putting this back in terms of the ABC model, one could say that the candidate Ci = Mg is highly interesting regardless of the specific A literature, at least within a certain range.

Whereas measures to identify emerging research fronts have been the concern of scientometrics and bibliometrics, these measures have tended to be geared towards policy makers and sociologists—detecting the fronts after they have already started to become “hot.” Some areas are not simply “hot,” but also have such pervasive implications (noncoding RNAs, prion proteins, microRNAs) that they should arguably be ranked high on any list of possible topics to study, no matter what the specific question and regardless of the specific area of interest by the investigator. This is reminiscent of a t-shirt slogan that I have seen: “No matter what the question is… the answer is to do more yoga.”

Nevertheless, most scientists are likely to feel that they can identify “hot” areas already. The biggest need, and the biggest “bang for the buck” for LBD, is to identify research areas that are currently neglected, but that, when juxtaposed with other information, have the potential to identify important frontier areas for investigation (Smalheiser & Torvik, 2008; Swanson, 2011). There are many reasons why a line of work may have become neglected, and these need not be discussed here. However, one would like to reconsider and possibly revive those neglected hypotheses or lines of work that are the most interesting when viewed in light of other more recent evidence that has appeared in other scientific fields, even if—perhaps especially if—the original hypotheses were generally thought to be experimentally disproved.

Inheritance of acquired characteristics is a stellar example of a field that, for more than a hundred years, appeared to be a pre-Darwinian relic that was thoroughly discredited as scientific nonsense. Recent findings in genomics and molecular biology, however, have validated several mechanisms by which environmental stimuli can influence the genome and pass changes to subsequent generations (Landman, 1991; Liu, 2007; Koonin & Wolf, 2009). In fact, this area has quickly become one of the “hottest” in biomedical science. The studies on memory transfer in planarians (discussed above) is another example of a field that was abandoned after the original practitioners had retired, yet sparked a new field of investigation.

Once again, Don Swanson has pioneered the effort to identify neglected research findings, which he conceptualized as a generalization of one-node searching (Swanson, 2011). However, much more work is needed to discern which neglected findings ought to remain that way, which deserve revival, and which (when combined with other findings) create an entirely new and promising hypothesis.

The Problem of Creating Gold Standards for LBD Systems

To evaluate and compare different LBD systems, it is crucial to develop an extensive set of gold standard examples. The very nature of one-node searches and their traditional goal (to identify totally novel hypotheses with no existing experimental support) makes it difficult to establish gold standards (Smalheiser & Torvik, 2008). Some studies have employed a handful of validated one-node searches created by Swanson's early predictions (Swanson, 1986b, 1988) and others have advocated the use of time-sliced literatures to evaluate LBD methods. In this approach, LBD predictions are based upon an analysis of MEDLINE at a given date. One examines MEDLINE articles at later dates to see if the predictions have been confirmed or at least investigated subsequently. Another option is to employ lists of known facts or relationships, either extracted from the literature or manually curated, as an external standard for one-node searches (e.g., Homayouni et al., 2005). For example, suppose one is conducting an LBD analysis to predict novel interactions that reelin may have with other proteins. Given a list of proteins known to interact with reelin, a successful LBD method should rank the known interactors highly, even if they are excluded from the final list of predictions due to lack of novelty.

Besides these evaluation methods, one can imagine innovative ways of utilizing other datasets. For example, the Text REtrieval Conference Genomics 2006 and 2007 queries resemble one-node searches insofar as they seek to rank articles within a given category (equivalent to the Ci-literatures) in terms of their relevance to a given item or concept (equivalent to literature A). Thus, if one were to apply one-node search systems to these data, one could employ the gold standard TREC results. Another idea is to obtain the abstracts of new R01 and R21 grants that have been funded by the National Institutes of Health, available via the CRISP/RePORTER database. Certainly, at the time the grant was reviewed, a panel of experts had agreed that the central aims were novel and promising for further study, and so a good LBD system should be able to identify them and rank them highly. Similarly, new hypotheses that are proposed in a published review article can be regarded as a gold standard of what (at least certain) experts feel are promising new research directions. The search for different ranking strategies and the project to build gold standards should proceed in parallel, covering a variety of different ranking strategies, because a strategy to identify relevant information will be expected to rank items quite differently than one intended to identify high-risk, paradigm-shifting ideas.

Concluding Thought

The ABC model is not wrong. However, it is only one of several different types of models that can contribute to the development of the next generation of LBD tools. Perhaps the most urgent need is to develop a series of objective literature-based interestingness measures, which can customize the output of LBD systems for different types of scientific investigations. The field of bioinformatics has exploded in the past few years, because of the richness of genomics and proteomics datasets, despite employing (for the most part) relatively simple data mining, statistics, and text-based mining methods. The scientific literature is certainly rich enough, and expanding rapidly enough, for LBD systems to serve as major facilitators of scientific discovery.

Acknowledgements

This article is an expanded version of a presentation made at the 2009 ASIS&T Annual Meeting. It is dedicated to my senior partner, Don Swanson, and my junior partner, Vetle Torvik, who helped shape the ideas in this article and arguably should have been co-authors (except that then I could not make this dedication to them!).

Ancillary