Beyond Data Mining: Delivering the Next Generation of Service from Library Data
Lynn Silipigni Connaway and Timothy J. Dickey
Several WorldCat data mining projects not only have produced theoretical papers and presentations (O'Neill, Connaway & Dickey, 2008; Connaway, O'Neill & Prabha, 2004; Connaway & Heard, 2005; Connaway & Olszewski, 2006; Connaway, Snyder & Olszewski, 2005; Olszewski, Connaway & Snyder, 2005) but also have resulted in prototypes that are being integrated into library products and services. Three of these projects will be discussed and demonstrated – Audience Level, Publisher Name Server, and WorldMap. New prototypes, research advances, and interoperabilities with other products and services will be highlighted since the introduction of the subject in 2006 (Nicholson, Connaway & Molyneux, 2006).
Audience Level (http://audiencelevel.oclc.org/), a tool that utilizes WorldCat bibliographic data, has been integrated into the WorldCat discovery environment and tested in others. It provides both a numerical and graphical representation of the ideal audience for a monograph, based on the types of libraries that hold the titles in WorldCat. The audience level for a book theoretically represents the type of reader for which the resource is most appropriate, and thus can improve both collection assessment and the development of a ranking system for discovery. Estimating the audience level also can enhance resource discovery by increasing the relevance of items retrieved. The researchers hypothesized that the audience level could be estimated from the types of libraries – research, academic, public, and school – that have acquired the resource. WorldCat includes a holding symbol for every member library holding an item represented in the database. Each holding represents a discrete selection decision implying that the material is relevant to the library's patrons and is consistent with the library's collection development strategy. Thus, the totality of these individual decisions can serve as an indicator of audience level.
The Audience Level tool extends the difficulty effect described by White (1995) and further tested by Lesniaski (2004) and Bernstein (2005) in earlier data mining projects within WorldCat. These earlier researchers had asserted that more “difficult” works would be held by fewer libraries, and made conclusions regarding audience accordingly. The OCLC Audience Level goes beyond this framework and considers the types of libraries holding the resource – academic, public, or school. By assigning a weight to each type of library that owns the title in WorldCat, an audience level can be calculated for each title based on the aggregate of library holdings. An algorithm was then developed to estimate the audience level for each WorldCat resource. The audience level is determined in two steps. First a weighted holdings value is derived, either using the target audience in the 008 field from the bibliographic record, or (more often) based on the types of libraries holding the resource. This weighted holdings value is a numeric value between zero and one. In the second step, the weighted holdings value is converted to a percentile to form the audience level. For example, a final Audience Level score of 0.66 indicates that 34% of the books in WorldCat have a higher audience level while 66% have a lower value. Researchers developed and conducted two test methodologies to systematically evaluate their calculations, and found no significant difference between the programmatically-assigned Audience Level and humanly-assigned rankings.
The Publisher Name Server (http://www.oclc.org/research/projects/publisherns/) evolved as a method to enhance WorldCat data mining based upon publishers and their content specializations. Using the publisher data within the MARC record, this project maps WorldCat data onto an experimental authority file of publishing entities. To date, more than 50,000 variant strings have been mapped onto more than 1650 high-incidence publishing entities from a dozen countries. They include the top 25 entities (by ISBN prefix) in WorldCat from the United States, top 20 from Great Britain, the top 10 from Canada, Australia, Germany, France, the Netherlands, Japan, Italy, China, the Russian Federation, Spain, Finland, Australia, Taiwan, and New Zealand, as well as the top 10 university presses. In addition, any publisher involved in a merger or acquisition since 2001 has been included. Finally, any imprints or related entities that can be identified are included. Materials from these high-profile entities represent approximately 7 million records within WorldCat.
A variety of information has been captured about these entities by a combination of data mining and human research. A single authoritative string represents each entity, 93% of which correspond to the name in the Library of Congress' Name Authority File (110 field), Bowker's Books in Print, or the International ISBN Registry (K. G. Saur). Former names and variant names associated with each entity, numbering some 53,000 strings, have been mined from the 260b subfield in WorldCat, and mapped onto each entity. Information regarding relationships among publishers has been difficult to keep current in a climate of mergers and acquisitions; the Publisher Name Server does, however, contain flexible information regarding these relationships, and the relationships are classified as acquisitions, imprints, subsidiaries, mergers, joint ventures, and the like. The publishers in question could represent more than 10% of the contents of WorldCat, and the associations within this dataset will enable more in-depth collection analysis and comparison within the global database. Aggregate composites of the publications in WorldCat from two major publishers reveal interesting differences in their profiles, and support the methodology.
Another project that utilizes WorldCat data is the OCLC WorldMap prototype (http://worldmap.oclc.org/). It geographically represents WorldCat and other source data to enhance users' interaction with the data through data visualization. Data visualization, an interactive and sensory layer placed on top of a mined dataset, is a relatively new field for practical library applications. The data in this case include library and bibliographical data on a global level, including country-specific data for WorldCat holdings, languages represented in WorldCat for items published in each country, and total titles published within the country. Data from other reference and statistical sources include a breakdown by type of library for the country's total number of libraries, library volumes, certified/degreed librarians, registered library users in each country, library expenditures (in US $), cultural heritage institutions (museums and archives), and publishers. The WorldMap presents this rich and varied dataset in an interactive and graphical interface involving user interaction with both maps and comparative charts. The WorldMap also encapsulates the breadth of its data in a format users may process at a glance.
Together, these prototypes and research projects not only provide librarians data for decision-making for collection and service development, but also provide users with enhanced discovery and access methods. They utilize data collected and stored by libraries to develop enhanced discovery and delivery services to meet the different needs of users.