Discovery in an age of artificial intelligence


Technology will continue to enhance discovery but there is a big gap before we reach artificial intelligence.

The value of scholarly literature today is assessed based on use – whether it is authors citing an article, readers downloading an article, editors seeking a higher impact factor, funders seeking a broad audience, or librarians calculating a cost-per-use. Much of the use today begins with a search on Google Scholar or Google which is often the single largest source of links to articles and comprises far more links than most publisher platforms. Given that search tools are the primary means of discovery today, how is discovery changing as artificial intelligence evolves?

In the print environment journal publishers created the content, and other organizations developed the indexes that identified articles by using words in structured vocabularies. When Eugene Garfield created the Citation Indexes in the 1950s, he demonstrated that two articles are more likely to be related when they share many of the same references. This approach to discovering related works is especially useful in emerging fields before a common language is developed.

During the 1980s Don Swanson, who was a professor of information science at the University of Chicago, recognized the hidden connection between two biomedical articles based on their hypotheses. Subsequent research validated his approach which pioneered the field of “text-based informatics” also known as “literature-based discovery” and served as the precursor for text mining. While there is great interest in data and text mining today, discussions on this topic indicate that progress is incremental and a surprisingly small number of researchers are taking advantage of the opportunities enabled by publishers.


Discovery today occurs in a global network of content, organizations and people. The use of content increases exponentially as it is linked not just to other content but also to other entities. To clearly distinguish these entities, identifiers are being introduced for authors (ORCID), their institutions (ISNI, Ringgold) and their funders (FundRef). With these entities identified, it is easy to envision a geographic map of the research landscape that reveals which funders are supporting faculty conducting research in specific disciplines at various locations around the world.

Content networks are expanding to include components of the research process such as datasets and videos that are not easily accommodated within the existing publication structure. In addition to books and journals there is a need for more timely and diverse communication formats such as blogs that support debate within a discipline or to address research in ways that are more accessible for a wider audience.

New tools are being introduced to support and connect these additional formats with more established forms of scholarship. Figshare provides a DOI for figures, data, images and video to make these elements more discoverable and accessible. recognizes and measures use of content by different types of outlets such as blogs, news, twitter, Wikipedia, policy documents, LinkedIn, Youtube and other sources.

At a time when the role of traditional subject indexes is challenged by the major search engines, two new indexes were recently launched at the American Library Associations’ Midwinter Meeting. Both are curated collections of content that is freely available.

  • The founders of 1Science created an index of peer reviewed open access (OA) journals to acknowledge that a tipping point had been reached with the volume of open literature. They have since added a service that compares a libraries’ subscriptions with OA journals to assist the library in evaluating how much paid content they need to acquire.
  • ACI's Scholarly Blog Index is a curated collection of 10,000+ blogs across all disciplines. In addition to being selected and indexed, the content is preserved for future use.

Academic networks for individual researchers (Mendeley,, ResearchGate) serve as another source of discovery. Two new companies are mapping institutions and their researchers for two very different markets.

  • ResearchConnection provides an institutional view of academic research and researchers.
  • Expernova serves the corporate market by identifying the expertise of researchers within companies and across different industries to reveal where innovation is occurring.

Collectively these different tools reveal elements of the current global research environment that will be useful in building smarter search results.


Inspired by the promise of the Semantic Web some large publishers and organizations are defining the relationships between the entities in their articles in order to create the infrastructure necessary to inform enhanced search results.

Technology companies such as IBM, AI2 and Meta are developing sophisticated technologies that focus on a specific body of research literature and seek to deliver intelligent results.

  • IBM created Watson which uses natural language processing and computational power to analyze vast amounts of unstructured data. Watson evaluates the literature in a given area which is curated with human support. Quality content is then indexed and Watson, using machine learning, is trained with the help of humans to understand linguistic patterns. Watson has applications in data mining, financial modeling and molecular dynamics.
  • The Allen Institute for Artificial Intelligence (AI2), backed by Microsoft founder Paul Allen, launched Semantic Scholar which uses natural language processing, data mining and machine learning to understand the content of articles. Starting with 3 million open access articles in computer science, the software is selecting keywords and phrases from the text. They can determine which paper's cited references actually influenced the paper rather than being incidental.
  • Meta (formerly Sciencescape) has created a Knowledgegraph that uses machine intelligence to understand the papers, people and entities in biomedical research. They recently incorporated software that will enable them to detect emerging areas and offer services to publishers.


Through the years discovery of the research literature has evolved from relying upon the meaning of words to identifying the meaning of relationships between entities within a document. The ability for software to parse unstructured data makes a much larger body of literature in digital form accessible.

When will discovery be replaced by artificial intelligence? It's safe to say not in the near future. Both Siri and Deep Blue are considered Narrow Artificial Intelligence since the functions are limited and very focused. General AI is generally defined as doing what a human can and Super AI is smarter than humans. Despite the rapid progress of technology and the advancement of robots, they are far from being able to perform any intellectual task that a human can.

As the networks of content expand and the utility of search tools increase, discovery will continue to improve returning all relevant and only relevant results. The next level for many of us will be having systems that anticipate (correctly) what we want to know and provide it to us before we ask for it. This is a different set of requirements and one that would save us all a lot of time.


  • Image of creator

    Judy Luther