Beyond (simple) reading: Strategies, discoveries, and collaborations



Scientists increasingly use the scientific literature in ways that go beyond traditional reading. This panel explores three ways, other than traditional reading, that the computer can help us exploit scientific articles: the strategic reading of large numbers of articles, the discovery of implicit assertions, and the identification of collaboration patterns. A respondent will speculate on where these trends will take us, and what sort of outfits should be packed for the trip.


Allen H. Renear: Strategic reading in science

Renear is an Associate Professor in the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign, where he teaches courses and conducts research in information modeling and digital publishing. His current research is on developing strategies for integrating scientific ontologies with the global STM publishing system.

Neil R. Smalheiser: Literature-based discovery

Smalheiser is a neuroscientist and an Associate Professor in the Department of Psychiatry and a member of the Psychiatric Institute at the University of Illinois at Chicago. He has 20 years of experience in biochemical, molecular, developmental, and cellular studies of nerve cells and is one of the pioneers in the study of small RNAs and RNA interference in the nervous system. He is also an international leader in the development of new data-mining techniques in medical informatics and bioinformatics.

Vetle I. Torvik [organizer]: A multidimensional model of scientific collaborations

Torvik is a Visiting Assistant Professor in the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign where he teaches courses in text/data mining, informetrics, literature-based discovery, and bioinformatics. His research addresses problems related to scientific discovery and collaboration, and focuses on fundamental, powerful mathematical properties such as monotonicity, transitivity, and multidimensionality.

Catherine C. Marshall [respondent]

Marshall is a Senior Researcher at Microsoft Corporation and an affiliate of the Center for the Study of Digital Libraries at Texas A&M University. Previously a long-time member of the research staff at Xerox PARC, she has conducted seminal research on reading, hypertext, annotation, and digital libraries.


Working with the literature is an essential everyday task for almost all scientists. Although scientists have long been deviating from simple sequential reading in a variety of ways, the situation today is at the cusp of a revolution. The field of text mining has arisen to assist investigators in identifying, summarizing and integrating the key findings relevant to a given scientific question (Zweigenbaum et al. 2007). These new tools promote new reading strategies by helping to prioritize reading selections, by piecing together snippets of text related to a given scientific concept or question, or by condensing a large body of literature. Whereas information retrieval and information extraction techniques deal with information that is stated explicitly in a scientific article, literature-based discovery tools go beyond information that is stated explicitly in a scientific article to help identify assertions that are implicit across two or more articles (Smalheiser, Torvik, & Zhou 2009). Some tools go even further to transform the existing literature into entirely new entities, like co-authorship networks (Newman 2004), that have little (if anything) to do with reading but open up new lines of scholarly investigation. It is often argued that scientists need automated tools to keep up with the literature because of its explosive growth. Yet, these tools appeal to scientists not just out of necessity, but also because of other driving factors like curiosity or “scientific opportunism” (Snyder 2005). The opportunities, we argue, are here now.

1. Strategic reading in science

Allen H. Renear and Carole L. Palmer

Scientists increasingly engage with the literature as if they were playing a fast-paced video game. Using indexing and retrieval systems such as Google Scholar, Scopus, or Web of Science, they skim quickly through scores or even hundreds of documents, refining queries, tracking references backwards and citations forwards, hunting for equations, protocols, and formulas, glancing at graphs and captions, copying equations or chemical formula for processing in specialized software tools, and, always, dodging publisher sites to find open access copies. (Palmer 2007, Renear 2006; Nicholas et al. 2004; Nicholas et al. 2006). We call this “strategic reading.” The activity itself is not new, but the scale, speed, and functionality provided by computer tools and digital documents results in an experience that is vastly different from traditional research with paper journals. Studies of reading behavior confirm that scientists are reading more articles, more quickly (Tenopir 2006).

With the recent explosive development of interoperable scientific ontologies (such as the Gene Ontology) these trends will continue and intensify. Although originally designed for data integration these ontologies will soon become part of the scientific publishing workflow and support even more advanced browsing and reading tools. Ontology-based retrieval and browsing tools are already in common use in the biomedical sciences (Kim & Rebholz-Schuhmann 2008).

2. Literature-based discovery

Neil R. Smalheiser

Literature-based discovery (LBD) is a strategy for uncovering novel scientific hypotheses from the literature. The fundamental idea behind LBD is that one can bring together explicit statements (taken from different scientific papers) to form implicit assertions. Although most research has focused on computer tools and methods (Weeber & Bruza 2008), LBD is also a cognitive process that scientists routinely follow in their daily life. A common scenario involving implicit information arises when an investigator finds experimentally that two phenomena, previously thought to be unrelated, are unexpectedly related in some meaningful way, and would like to find existing knowledge that might shed light on potential mechanistic links between them. Alternatively, an investigator may hypothesize that a link exists between two disparate phenomena, and wish to assess whether the existing literature provides any implicit support for the hypothesis that would encourage experimental testing. Thus, computer-assisted LBD tools should fit nicely into the scientific workflow. The Arrowsmith two-node search tool finds terms that are shared across two sets of articles within MEDLINE and displays them in a manner that facilitates human assessment and has been implemented in free, public web interfaces (Smalheiser, Torvik, & Zhou 2009). Recent quantitative modeling has greatly simplified and streamlined the process of carrying out Arrowsmith searches. These developments, taken together, make computer-assisted LBD a useful strategy for the general scientific community.

3. A multidimensional model of scientific collaborations

Vetle I. Torvik

Large scale co-authorship networks open up a window into the complex patterns of collaborative behavior and provide a rich arena for scholarly investigation. This is not a new idea (Newman 2004), but in order to create a realistic and informative model of collaboration behavior, at least five different types of data need to be collected and combined: a) Information about individuals and their publication behavior such as their gender, affiliation, geographical location, seniority, expertise, topics, publication dates, productivity, collaborators, citations, funding, and patents. b) Measures relating pairs of individuals such as mentorship (senior vs. junior), topical similarity, social network proximity, geographical proximity, expertise complementarity. c) Measures that go beyond pairs to describe the roles of individuals and structures in the social network such as hubs, bridges, isolates, large communities, smaller cliques, central cores, periphery, and hierarchy. d) Multiple time scales – collaboration behavior has changed collectively over the years, and usually changes during the course of an individual's career. e) Interactive factors that take into account patterns that are context or discipline dependent – some individuals may tend to collaborate preferentially with others in their own field, but this may not apply to, for instance, statisticians.

Accurate and efficient collection of this data can be supported by Author-ity (Torvik et al. 2005, Torvik & Smalheiser 2009), an author-centered database of disambiguated names. Besides providing a better understanding of collaborative behavior, a multidimensional model can also inform the process of developing recommender tools that could, for example, assist scientists in finding potential collaborators – or co-panelists.

4. Response

Catherine C. Marshall