Transforming data into services: Delivering the next generation of user-oriented collections and services



For decades, the corporate sector has exploited technological advances to better market and deliver products and services to customers via the techniques of data mining. The technique was not widely used in libraries. However, with the current emphasis on evidence-based decision making, libraries are beginning to utilize their system- and user- generated data. Data mining usually involves a significant endeavor to extract embedded and potentially useful information from large undiscovered data sets (Mitra & Acharya, 2003; Hand, Mannila & Smyth, 2001; Frawley, Piatetsky-Shapiro, & Matheus 1992; Piatetsky-Shapiro & Frawley 1991).

These data mining techniques are being used by librarians to improve both internal decision-making and external user services by extracting information from operational datasets of both bibliographic and user data. OCLC Research has taken advantage of the WorldCat database, which includes more than 95 million bibliographic records and 1.2 billion holdings records, as well as data provided by other major library systems and consortia to develop user-oriented collections and services (

Beyond Data Mining: Delivering the Next Generation of Service from Library Data

Lynn Silipigni Connaway and Timothy J. Dickey

Several WorldCat data mining projects not only have produced theoretical papers and presentations (O'Neill, Connaway & Dickey, 2008; Connaway, O'Neill & Prabha, 2004; Connaway & Heard, 2005; Connaway & Olszewski, 2006; Connaway, Snyder & Olszewski, 2005; Olszewski, Connaway & Snyder, 2005) but also have resulted in prototypes that are being integrated into library products and services. Three of these projects will be discussed and demonstrated – Audience Level, Publisher Name Server, and WorldMap. New prototypes, research advances, and interoperabilities with other products and services will be highlighted since the introduction of the subject in 2006 (Nicholson, Connaway & Molyneux, 2006).

Audience Level (, a tool that utilizes WorldCat bibliographic data, has been integrated into the WorldCat discovery environment and tested in others. It provides both a numerical and graphical representation of the ideal audience for a monograph, based on the types of libraries that hold the titles in WorldCat. The audience level for a book theoretically represents the type of reader for which the resource is most appropriate, and thus can improve both collection assessment and the development of a ranking system for discovery. Estimating the audience level also can enhance resource discovery by increasing the relevance of items retrieved. The researchers hypothesized that the audience level could be estimated from the types of libraries – research, academic, public, and school – that have acquired the resource. WorldCat includes a holding symbol for every member library holding an item represented in the database. Each holding represents a discrete selection decision implying that the material is relevant to the library's patrons and is consistent with the library's collection development strategy. Thus, the totality of these individual decisions can serve as an indicator of audience level.

The Audience Level tool extends the difficulty effect described by White (1995) and further tested by Lesniaski (2004) and Bernstein (2005) in earlier data mining projects within WorldCat. These earlier researchers had asserted that more “difficult” works would be held by fewer libraries, and made conclusions regarding audience accordingly. The OCLC Audience Level goes beyond this framework and considers the types of libraries holding the resource – academic, public, or school. By assigning a weight to each type of library that owns the title in WorldCat, an audience level can be calculated for each title based on the aggregate of library holdings. An algorithm was then developed to estimate the audience level for each WorldCat resource. The audience level is determined in two steps. First a weighted holdings value is derived, either using the target audience in the 008 field from the bibliographic record, or (more often) based on the types of libraries holding the resource. This weighted holdings value is a numeric value between zero and one. In the second step, the weighted holdings value is converted to a percentile to form the audience level. For example, a final Audience Level score of 0.66 indicates that 34% of the books in WorldCat have a higher audience level while 66% have a lower value. Researchers developed and conducted two test methodologies to systematically evaluate their calculations, and found no significant difference between the programmatically-assigned Audience Level and humanly-assigned rankings.

The Publisher Name Server ( evolved as a method to enhance WorldCat data mining based upon publishers and their content specializations. Using the publisher data within the MARC record, this project maps WorldCat data onto an experimental authority file of publishing entities. To date, more than 50,000 variant strings have been mapped onto more than 1650 high-incidence publishing entities from a dozen countries. They include the top 25 entities (by ISBN prefix) in WorldCat from the United States, top 20 from Great Britain, the top 10 from Canada, Australia, Germany, France, the Netherlands, Japan, Italy, China, the Russian Federation, Spain, Finland, Australia, Taiwan, and New Zealand, as well as the top 10 university presses. In addition, any publisher involved in a merger or acquisition since 2001 has been included. Finally, any imprints or related entities that can be identified are included. Materials from these high-profile entities represent approximately 7 million records within WorldCat.

A variety of information has been captured about these entities by a combination of data mining and human research. A single authoritative string represents each entity, 93% of which correspond to the name in the Library of Congress' Name Authority File (110 field), Bowker's Books in Print, or the International ISBN Registry (K. G. Saur). Former names and variant names associated with each entity, numbering some 53,000 strings, have been mined from the 260b subfield in WorldCat, and mapped onto each entity. Information regarding relationships among publishers has been difficult to keep current in a climate of mergers and acquisitions; the Publisher Name Server does, however, contain flexible information regarding these relationships, and the relationships are classified as acquisitions, imprints, subsidiaries, mergers, joint ventures, and the like. The publishers in question could represent more than 10% of the contents of WorldCat, and the associations within this dataset will enable more in-depth collection analysis and comparison within the global database. Aggregate composites of the publications in WorldCat from two major publishers reveal interesting differences in their profiles, and support the methodology.

Another project that utilizes WorldCat data is the OCLC WorldMap prototype ( It geographically represents WorldCat and other source data to enhance users' interaction with the data through data visualization. Data visualization, an interactive and sensory layer placed on top of a mined dataset, is a relatively new field for practical library applications. The data in this case include library and bibliographical data on a global level, including country-specific data for WorldCat holdings, languages represented in WorldCat for items published in each country, and total titles published within the country. Data from other reference and statistical sources include a breakdown by type of library for the country's total number of libraries, library volumes, certified/degreed librarians, registered library users in each country, library expenditures (in US $), cultural heritage institutions (museums and archives), and publishers. The WorldMap presents this rich and varied dataset in an interactive and graphical interface involving user interaction with both maps and comparative charts. The WorldMap also encapsulates the breadth of its data in a format users may process at a glance.

Together, these prototypes and research projects not only provide librarians data for decision-making for collection and service development, but also provide users with enhanced discovery and access methods. They utilize data collected and stored by libraries to develop enhanced discovery and delivery services to meet the different needs of users.

Analysis of Collection Use in the OhioLINK Library Consortium

Edward T. O'Neill and Julia Gammon

With the increasing demand for resources and budget constraints it is important for librarians to use data to make informed decisions. The OhioLINK consortium provides library resources statewide for 85 academic institutions in Ohio (USA) that serve 600,000 faculty, staff, students, and researchers. OhioLINK institutions are a diverse group of large and small, public and private libraries. Cooperative collection development and resource sharing has been practiced by OhioLINK to reduce unnecessary duplication, to stretch budgets, to strengthen the aggregate statewide collection, and to share library resources. The OhioLINK Collection Building Task Force is conducting a study that compares circulation data to the libraries' book collections. The circulation data will be captured from the consortium's borrowing system and matched to the bibliographic records in OCLC's WorldCat database.

The project goal is to collect, analyze, and compare the book circulation data against the libraries' book collections. Circulation data gathered from the institutions will be validated prior to the analysis. Most of the data collected will include an OCLC number. For records lacking an OCLC number, the ISBN or LCCN will be used to link the circulation records to the WorldCat bibliographic records. FRBR (Functional Requirements for Bibliographic Records) will serve as the model for the analysis by linking the circulation data to WorldCat records using OCLC's FRBR work-set algorithm. The research questions to be addressed include: 1) What materials are not used or are underused? 2) Are there similar usage and collection patterns between the large research universities and the small community colleges? 3) Are there too few books in some disciplines and too many in others? 4) Are the books appropriately distributed across institutions? 5) What books are the best candidates for remote or compact storage?

Using WorldCat Data for Evidence of Copyright

Brian Lavoie

Determination of copyright status is becoming an increasingly important element in many aspects of collection management. It is particularly important in regard to digitization activities. Unfortunately, the copyright status of many information resources is unclear. Moreover, making a reliable assessment of copyright status often requires evaluating evidence from a variety of sources. To assist libraries and other organizations in determining copyright status, and to reduce the effort involved in copyright status investigations, OCLC is deploying the Registry of Copyright Evidence (RCE) service. The RCE will allow users to search, update, and comment on copyright evidence, drawn from multiple sources, related to a particular resource, as well as to share the results of their investigations with other organizations. This talk will report on research done in support of development of the RCE. Topics covered include identifying data points in MARC relevant for determining copyright status; issues involved in, and the results of, mining the WorldCat bibliographic database for these data points; work involved in matching WorldCat copyright evidence to other sources of copyright evidence, such as the US Copyright Office database; and finally, examples of how rules-based algorithms can be applied to copyright evidence to help assess the copyright status of an item.