Old data, new scheme: An exploration of metadata migration using expert-guided computational techniques

Authors


Abstract

This paper explores the evaluation of metadata quality within the context of migration to new conceptual models. While metadata quality and interoperability are commonly studied areas, few studies have explored the issues introduced when migrating metadata to models that require a re-definition of metadata record to resource relationships. In order to explore this issue, this study asks the question “How do human and computational techniques compare with regards to the creation of FRBR work-sets from existing bibliographic metadata?” The study compared cataloger assessment of FRBR relationships in a selection of 848 MARC records with automatically generated work-sets. The Levenhstein edit-distance algorithm was used to identify work-sets in the automated approach. The comparison found that an improved version of work-set keys from the OCLC FRBR algorithm provided the highest match rate with 62% of keys matching perfectly and 87% of correct work-sets showing an 80% or higher similarity rate. The study methodology and evaluative measure overall indicate that a hybrid expert/automated approach and an improved key generation algorithm are effective alternatives to manual or automated approaches alone.

INTRODUCTION

This paper explores the evaluation of metadata quality within the context of migration to new metadata conceptual models. Metadata migration includes a number of processes such as harvesting, transformation and evaluation. While this can pose technical challenges for simple migrations, it also poses theoretical challenges when the migrated metadata must establish new record to resource relationships or when the migration process must rely on string matching as opposed to unique identifier matching. Metadata models and encoding systems are grounded in real-world service needs and are designed to fit a specific descriptive, administrative or technical purpose. This may include document description as well as event or system state representation. As information needs and metadata services change, existing metadata must be migrated to new standards, new encoding approaches and often new conceptual models. This process introduces problems in regards to metadata record to resource relationships as well as practical issues related to metadata consistency, accuracy and migrate-ability. While metadata crosswalk standards can be useful in this area, it is difficult to code the re-definition of record relationships in these models. In addition, when migrating to new models, unique identifiers can be of limited value as they are often designed with a specific schema or standard in mind (e.g., the connection of a Dublin Core record with a Website via a URL value in the Identifier element). This paper explores metadata quality assessment in this situation and considers the viability of model application using automatic text analysis methods.

Metadata quality and interoperability is an important issue across domains but is a central problem in the library, archive and museum (LAM) community as it re-conceptualizes the purposes, roles and construction of metadata. The work of identifying a new bibliographic metadata standard is under way in the library community but using a number of standards (e.g., RDA, FRBR), encoding techniques (e.g., MARC21, MARCXML, JSON) and publication frameworks including Linked Open Data (LoD) (IFLA, 2011; Marcum, 2011). As the library community explores this migration it must focus both on creating a standard for new data and migrating old metadata. Because of the range of standards and new vocabularies, this work needs to be grounded in multiple techniques including the application of authorities using LoD endpoints, formal establishment of new relationships and encoding of existing relationships in semantically and syntactically accurate ways.

This paper explores a specific metadata quality issue in this area using an approach that provides the promise of scalability for large record sets and which is applicable to a range of metadata. Specifically, this paper reports the results of an effort to migrate existing bibliographic descriptive fields to the FRBR work-set model. The goal of this work is two fold. First, this study seeks to better understand how instantiations of descriptive metadata elements can be used to group existing bibliographic data into work-set groups. Second, this study seeks to establish a method that will allow high-confidence processing of this data using automated processes. In order to explore this issue, this research compares the outcomes of expert and computational FRBR work-set clustering approaches and considers the implications of metadata quality, cataloger expertise and FRBR model constraints. The selected data set of bibliographic records for this study focus on Mark Twain (in both subject and author roles). One of the key outcomes of this work is the development of a series of software programs and analytical techniques for completing additional research with similar datasets.

LITERATURE REVIEW

Metadata quality has been discussed in a number of different ways (Guy, et al, 2004; Greenberg, 2001; Park, 2009) and has been assessed using a wide range of metrics. In general, metadata quality focuses on both qualitative and quantitative measures of the applicability of metadata to a specific schema or for a specific information use. Park's (2009, 219) meta-analysis groups quality into three broad areas, intrinsic (e.g., accuracy, consistency), relational (e.g., informativeness, verifiability) and reputational (e.g., authority). Park also contextualizes the valuation of specific metadata quality measures including completeness, accuracy and consistency. Park, for example, points to Bruce and Hillman's work (2004) in suggesting that the measure of metadata completeness depends on information need and domain of use.

These three measures (completeness, accuracy and consistency) are good indicators of metadata fit with a schema or intended use but can be difficult to measure. For example, Weagley, Gelches and Park (2010) explored the fit of existing metadata in digital repositories to the Dublin Core Schema (http://dublincore.org) and Stvilia and Gasser (2008) evaluated metadata value as a function of value, effectiveness and cost. These approaches relied on a mix of qualitative assessment of metadata and calculation of quality measure based on those assessments.

In Bibliometric and Digital Humanities work, the use of computational techniques to extract and analyze information from metadata and information documents has enabled the exploration of new concepts and trends. These approaches rely on text analysis using syntactic and semantic relationships to group records, derive meaning and show relationships. This approach can result in a publically researchable dataset such as Google's Ngram viewer (Michel, et al., 2010) or can lead to new approaches of deriving meaning from text-analysis (Fleischmann, Templeton, Boyd-Graber, 2011). Text analysis approaches have also been used to evaluate metadata quality or create new metadata records. For example Wrenn, Stein, Bakken and Stetson (2010) used automated techniques based on the Levenshtein edit-distance algorithm to analyze redundancy in electronic health records. Aumüller and Rahm (2011) used text extraction, database querying and results analysis to create a repository of metadata records from libraries of PDF documents. This study also used the Levenshtein edit-distance algorithm to evaluate similarity of metadata elements among returned records.

While computational techniques can be useful for identification of elements and evaluation of metadata, semi-automated processes that use human judgments are often valuable in assessing quality. For example, Kurtz (2010) used manual expert assessment in determining quality factors such as accuracy and consistency. In contrast, Ochoa and Duval (2009) explore a number of automated evaluation metrics in metadata quality studies but also observe that manual approaches are more often used than statistical approaches (2009, 69).

While these studies examine the fit of metadata to existing schemas or uses, there have been few studies that examined metadata quality within the view of new metadata models. This is of particular importance in the library community as it re-conceptualizes relationships between bibliographic works and instantiations of those works. New approaches to bibliographic data including Resource Description & Analysis (RDA) (http://www.rdatoolkit.org/) and Functional Requirements for Bibliographic Records (FRBR) view bibliographic works as instantiations within a resource hierarchy (e.g., Work, Expression, Manifestation, Item). The FRBR model in particular seeks to abstract the concept of the work (e.g., “Huckleberry Finn”) from instantiations of that work (e.g., various editions, printings and formats of the work). In order to create new bibliographic records that can reflect this level of description it is important to create representative work-set combinations that point to a single abstract work. This is particularly difficult for older works that may have been through a number of editions and formats and which have been described using a range of bibliographic standards. The grouping of existing bibliographic metadata into these new work-sets requires mapping of existing metadata to new fields and also requires the abstraction of metadata from manifestation or expression specific bibliographic data. There are existing MARC fields that may contain this information such as the 240 field (Uniform title) in the MARC bibliographic standard but this field is not regularly applied in all bibliographic metadata. In contrast, the 245 field (title) is commonly used to document variations in a work through the inclusion of refining information using subfields. The creation of abstract work-sets has been tested using a number of approaches. For example, OCLC created and published a FRBRization work-set generation algorithm (Hickey and Toves, 2009) which focuses on the generation of four string-based keys based on data in MARC fields that focus on title, author, unique identifier and publication fields. This model focuses on including metadata that helps identify work-set metadata (e.g., 245 subfields a and b) while excluding manifestation level metadata (e.g. 245 subfield h – General Material Designation). The abstraction of work-sets from this metadata has proven difficult given the wide range of relationships that must be negotiated and the variation in metadata that has been used in these records.

Zhang and Salaba (2009) documented a number of proof-of-concept and production environments that had implemented FRBR work-sets. For example, the Perseus Project (http://www.perseus.tufts.edu) used FRBR to represent relationships between classical texts. In addition, the OpenFRBR project (http://openfrbr.heroku.com) sought to establish an open FRBR work-set database. More recent work by OCLC has focused on a FRBRized classification service (http://classify.oclc.org) that returns FRBRized work sets with visualizations of the classification schemes and values utilized. In general, these projects have focused on creating systems that leverage FRBR as opposed to the migration of existing metadata to new FRBR record models. In addition, these works have focused more on the outcomes and not the challenges of mapping existing metadata to new FRBR concepts.

The authors of this work initially explored the FRBRization of existing MARC metadata using the OCLC FRBR work-set algorithm (Hickey and Toves, 2009) and found that the generation of concept keys using local metadata roughly fit with the success rates pointed to in the OCLC work-set document (Mitchell, McCallum and Strickland, 2011). This approach used a mix of title (e.g., 240, 245) author (e.g., 100, 110, 710) and identifier (e.g., 035) fields and found that it was difficult to assess the quality of these matches found via informal analysis. This pointed to the need both for an improved work-set generation approach and an authoritative dataset of FRBRized records that could be used to measure the quality of the automatically generated keys.

This study builds on this previous work by refining the FRBR key generation process and by measuring the quality of the work-set matches against a set of expert-grouped FRBR work-sets. The focus of this work is not on the outcome of FRBRization but rather on the process of establishing and evaluating metadata quality measures using computational techniques.

Research questions

This study explores the problem of quality assessment of FRBRization techniques by comparing records that were FRBRized both by utilizing expert cataloger decision-making and computational techniques. This specific case of FRBRization is considered within the larger context of the migration of metadata not only to new representation schemas but also to new conceptual models. FRBR is a good venue for this research because it relies on analysis of text rather than unique identifiers and requires complex distinctions between published versions of similar works.

In order to explore this issue, this study asks the question “How do human and computational techniques compare with regards to the creation of FRBR work-sets from existing bibliographic metadata?” In answering this question, this study explores the judgments that expert catalogers make regarding FRBRization, how these judgments are grounded in bibliographic metadata and text analysis, and the effectiveness of automated approaches in matching expert judgments.

The study seeks both to answer theoretical questions about the process of metadata migration and practical questions about the applicability of computational techniques to large-scale metadata problems. In doing so, this research asks questions about metadata quality that are related to a specific use (e.g., migration to the FRBR model) and which are grounded in the assessment of description exhaustivity and specificity. This research is bounded by the assumption that the creation of FRBRized work-sets must lead to highly accurate work-set clusters. As a result, it assesses quality in terms of the ability for automated techniques to match expert-cataloger groupings and is conservative when trusting automatically generated work-sets.

This study sought to 1) Create a comparison key that improved upon the OCLC FRBR work-set algorithm, 2) Examine the similarity distribution of expert-matched records and 3) Compare this distribution against the distribution of similarity scores for all records in the dataset.

This study involved the creation of data processing scripts using common programming-language tools (e.g. Python, PyMARC, python-Levenshtein) and utilized low-barrier data visualization tools (e.g., Google Refine) to explore the impact of text normalization and clustering. The methodology section discusses both the record set selected for analysis and the script development process.

Methodology

This study was completed in four phases and builds on previous work of the authors (Mitchell, McCallum, Strickland, 2011). The general approach involved 1) Selecting a representative record set that would allow the exploration of FRBR work-set generation issues, 2) The extraction of data from that record set for expert cataloger and automated FRBR work-set assignment and development of programmatic techniques to create and compare normalized text keys from elements of the metadata, 3) Expert cataloger assessment of work-set relationships and 4) The analysis of key similarity scores using the Levenshtein edit-distance algorithm. Each of these phases is discussed separately.

Identification of a record set

The study selected a set of 848 records related to Mark Twain (either published by Samuel Clemens or involving Samuel Clemens as a topic) from a library catalog. All records were in MARC format and described traditional library resources. Records by or about Samuel Clemens were selected for a number of reasons. First, because the study required expert analysis of relationships, it needed a manageable record set size. Second, records were selected from a library catalog that the cataloger had access to. This facilitated work-set analysis as the cataloger could gain access to the materials in question. Third, the records in this section included a number of complex work-set relationships including multiple works, author pseudonyms, editions and formats. This created a dataset that included a rich selection of FRBRization issues. In contrast, FRBRization in a scientific discipline or using a specific format (e.g. serials) would have led to a more homogenous dataset with fewer work-set relationships

Extraction of data for analysis

Using a python-based MARC data extraction and normalization program the selected records were extracted from MARC data files. The program pulled records where Samuel Clemens' name or pseudonyms (e.g., Mark Twain, Snodgrass, De Conte) were used in either an author or subject field. The data was written to a spreadsheet which included the initial OCLC FRBR work-set keys (Hickey and Toves, 2009), unique record identifiers and other descriptive metadata fields. Once the records were extracted the data was loaded into Google Refine for initial exploration and analysis. Based on the results of clustering and faceting analysis, a new field was added to the spreadsheet to track cataloger assignment of work-sets.

The analysis of data in Google Refine also indicated that the Key Collision technique used in helping cluster strings might be helpful during the automatic work-set generation process. The Key Collision logic is part of the fingerprint clustering technique and is documented on Google's Support site (Google Refine, 2012).

Expert cataloger analysis

An expert cataloger grouped records into work-sets by using this metadata and created a matrix of unique identifiers based on the accession number pulled from the MARC records (e.g., 3234, 34323, 5443, 51). This matrix enabled the connection of rows from the spreadsheet to indicate which bibliographic records had been grouped into work-sets. During the process, the cataloger found that the variation of metadata in the spreadsheet made it difficult to make work-set determinations. In order to improve the quality of the matches, 5 titles were selected for ‘book-in-hand’ analysis, two of which were Twain's most prolific titles (e.g., “The Adventures of Tom Sawyer,” and “The Adventures of Huckleberry Finn”). This secondary analysis included comparison of Title and publication pages, review of table of contents (ToC), and analysis of text (particularly for added or removed sections indicated in ToC or referenced to in an edition's introduction if available). This process involved checking approximately 100 books in a number of locations (e.g., circulating stacks, rare books).

Analysis of similarity scores

Following the determination of work-sets by the cataloger, additional programs were developed to extract five representative text strings from the MARC records. These strings or ‘keys’ were merged with the cataloger's spreadsheet. This spreadsheet contained unique identifiers, work-set ID matrix, cataloger notes, a series of MARC fields (e.g., 100, 245) and five specific key fields. These files are discussed in Table 1.

Table 1. Key values for the title Pudd'nhead Wilson: a Tale / by Mark Twain Record ID 12502
original image

Table 1 shows examples of each key and emphasizes a different approach to text normalization. For example, key 245a has had no normalization completed. Keys 245ab – KC, 245abcfg – KC, and OCLC FRBR Key 1 – KC have all had a specific unique keying process, called Key Collision, applied to them. The Key Collision process was adapted from the approach used in creating clusters in Google Refine (Google Refine, 2012). This approach converts all alphabetic characters to lower case, removes all non alphanumeric characters, sorts the words alpha-numerically and removes duplicate words. This process is grounded in the idea that removing duplicate words and stripping syntactic formatting leaves only valuable elements of the string. This string processing is the foundation of the fingerprint clustering technique in Google Refine.

Following the creation of keys, a Python script was written to compare the differences between keys of associated work sets. Two approaches to comparing similarity were examined, the sequenceMatcher method of the difflib Python library and the ratio method of the Levenshtein Python library (e.g., python-levenshtein) (Ohtamaa and Necas, 2012). The comparison using the sequenceMatcher method of the difflib library returned inconsistent results for strings over 200 characters. This issue is documented in the python developer community (http://bugs.python.org/issue10534). The Levenshtein library returned consistent results across all string lengths and proved to be faster for string comparisons.

Both approaches implement a ratio calculation method that generates a similarity score for two compared strings. Similarity scores are based on the edit-distance between the two strings or the relationship between the changes required to make the strings identical. As pointed out in the literature review, edit-distance has proven to be a good measure for assessing metadata quality. In the Python scripts, similarity calculations were completed for each key. This resulted in 346,528 similarity scores for the 848 records. In order to get more accurate key matching results, record keys were not compared to themselves during the data analysis process. In addition, record similarity nodes were de-duplicated prior to analysis.

Similarity scores were analyzed for two purposes. First, average scores were calculated for each key using only expert-matched work-sets. The results of this analysis were compared using paired t-tests to determine which key led to the highest average similarity for expert matched titles. Second, similarity scores were compared between expert and automatically matched records. The resulting distribution of scores were compared to identify the optimal similarity score cutoff for automatic matching and to get a sense of the overall success of automated work-set generation. The goal of this comparison was to evaluate the fit of the similarity ratio for creation of work-set groupings without expert guidance to those created with expert guidance.

The data and scripts for this research have been published on GitHub (http://github.com/mitcheet).

RESULTS

Results of the expert-matching process, key-based similarity scores and overall distribution of similarity scores are discussed separately in this section. The results include both qualitative findings from the expert review process and quantitative findings from the key similarity analysis.

Expert cataloger grouping

The record set selected for this project included 848 unique bibliographic records related to Mark Twain. The records fit within the Library of Congress Call number range PS 1300–1348. Of this record set, 410 records were connected through expert-cataloger review for a total of 147 work-sets with two or more expressions and 420 work-sets with only one expression. The largest clusters of records in the set involved the title “The Adventures of Huckleberry Finn.” (26 records) and “The Adventures of Tom Sawyer” (14 records).

The cataloger found that the use of title (i.e., MARC 245 field), Author (i.e., MARC 100 field) and OCLC FRBR Key 1 (i.e., combination of title and author fields) was largely effective in making broad work-set assignments but also introduced uncertainty particularly in cases of critical editions and the use of alternate subtitles (e.g., 245b). This left unanswered questions about work-set grouping that needed to be answered by exploring the actual books in depth. As mentioned in the methodology section the cataloger examined table of contents, verso and recto sides of title pages, and books' text to make determinations with regards to work sets.

While reviewing copies of Twain's works, the cataloger discovered differences between publications of specific titles. For example, the short story, “Those Extraordinary Twins” is sometimes included as a separate section at the end of the novel “Pudd'nhead Wilson” and other times is left out altogether. In the editor's introduction of two individual editions of “Huckleberry Finn,” it was noted that the “raft chapter,” which originally appeared in Twain's 1876 manuscript but was removed by Twain himself, had been restored. While some records included bibliographic data that pointed to these changes it was uncommon to see them discussed in detail in the metadata selected for analysis. According to the IFLA FRBR report (2009), a work is the primary entity of the group one entities and potentially may have one or more expression children. Text enlargement, as is the case with both of the examples mentioned above, is regarded in the FRBR model as an expression of the work as opposed to a new work. During the secondary ‘book-in-hand’ analysis, the cataloger found it necessary to re-read more than once the FRBR report's work section to assist in deciding on whether or not the book she was examining was a new work or an expression of the work.

When conducting the initial grouping of titles using only the 245a key and the OCLC FRBR Key 1, the cataloger grouped “The Art of Huckleberry Finn: Text, Sources, Criticism” (ArtHF) separately from all instances of “The Adventures of Huckleberry Finn” in the record set, but after looking at the MARC catalog record and the three physical copies of ArtHF,” doubts and questions arose for the cataloger about her title groupings. A first edition of the book resided both in the circulating main stacks and in rare books, while a second edition was housed in the main stacks as well. After examining all copies and noting their similarities and differences, the cataloger searched the catalog for this title and found that each copy had its own MARC record. Two interesting things found in the MARC record of the second edition book that were not included in the records for the first edition were a 240 Uniform Title field for the “Adventures of Huckleberry Finn” as well as a note stating “The text of Huckleberry Finn … is a facsimile of the first American edition [with t.p. reading: Adventures of Huckleberry Finn (Tom Sawyer's comrade) New York, C. L. Webster, 1885] … Facsimile texts of sources have been chosen wherever possible from an edition Clemens used or probably used.” After this discovery and discussion with another cataloging colleague, it was clear that with its inclusion of the Huckleberry Finn facsimile, ArtHF while being able to be grouped as an individual work could possibly be viewed as a manifestation of the expression of the “Adventures of Huckleberry Finn.”

Without viewing the MARC records and books themselves for ArtHF, knowledge of the presence of a full-text facsimile of Huckleberry Finn would not have necessarily been gleaned by the cataloger going by the 245a and OCLC FRBR Key 1 keys alone. The lack of robust and consistent descriptive metadata especially in older MARC records can inhibit and affect the ability of an individual to precisely group records into FRBR work-sets and to distinguish between a work versus an expression in Group 1 entities (IFLA, 2009).

With the nonconcrete concept and constitution of work, understanding and meaning of the FRBR model can vary between catalogers and their respective institutions and may potentially be frustrating when catalogers and programmers begin implementing the FRBR model into their institutions' online catalog. Consideration of patrons' historical and current catalog search and use patterns may assist in guiding libraries how best to implement and FRBRize their data so that their resources may be discovered with ease and efficiency.

Evaluation of metadata key similarity

As part of the data analysis each record had five keys generated (see Table 1). The similarity values for expert-cataloger work-sets were calculated for each key. The comparison excluded work-sets with only one title and did not compare titles to each other. Table 2 shows the average and standard deviation for each compared key in this result set. Table 3 shows some examples of how these keys result in different values.

Table 2. Average similarity ratio for expert-guided work sets
Key TypeAverage similarity score for expert matched work-setsStandard deviation
245a0.8690.154
245ab – KC0.8840.171
245abcfg – KC0.6860.211
OCLC FRBR Key 10.8810.154
OCLC FRBR Key 1 — KC0.9050.142

Table 3 shows that there was some variation in the average scores for different keys. Because the goal of this comparison was to identify a good measure of similarity, the OCLC FRBR Key 1- KC was in t-tests to assess the variation in scores. Specifically, t-tests were conducted to compare the impact of the Key Collision technique on OCLC FRBR keys. The comparison found that the Key Collision technique for the OCLC Key 1 (M=0.905, SD =0.142) did significantly differ from the unprocessed OCLC FRBR Key 1 (M = 0.881, SD= 0.154) (t(1680) = 1.931, p < .001). A statistically significant difference was also found comparing the OCLC FRBR Key 1 — KC (M = 0.905, SD = 0.142) and the 245ab – KC fields (M= 0.884, SD = 0.171) (t(1680) = 14.069, p < .001).

These tests support the assertion that the OCLC FRBR Key 1 with the added refinement of Key Collision processing led to a higher level of computed similarity scores for expert-matched titles. This indicates that a combination of selected title and author elements are a good match for work-set keys and that removal of all non-alphanumeric characters in addition to sorting and de-duplication of tokens (e.g., words) enhances this key. The improvement in the similarity score for the Key Collision approach also provides a good initial value for a cutoff similarity score to use in automated grouping of records into work-sets. Given the findings of this analysis, only the OCLC FRBR Key 1 with Key Collision was selected for unguided work-set generation.

Unguided work-set generation

Work-set groups were analyzed without the benefit of expert cataloger guidance to determine what the unaided success rate of the OCLC FRBR Key 1 Key Collision would be. Because this key appeared to show the highest statistically valid similarity score when used in the context of expert matched records it was used alone as a potential guide for unguided matching. Unique identifiers for related records were output as sets (e.g., 44345:23432) along with the similarity score for the set. For the records grouped using the expert work-set as a guide, similarity scores ranged between 43% and 100%. Figure 1 shows the distribution of similarity scores for the expert-matched records. The distribution of scores at or above 80% similarity values indicate that the potential exists for a reliable automated clustering of records.

Figure 1.

Distribution of similarity scores for expert grouped work-sets

Figure 2.

Distribution of similarity scores for automatically grouped work-sets

Figures 1 and 2 show the distributions of similarity scores for all expert grouped (Figure 1) and automatically grouped (Figure 2) work-sets. Figure 1 shows that expert matched sets (n=840, M = 0.908, SD = 0.136) tend to have similarity scores grouped above 80%. The predominance of values at 100% (n=501) in Figure 1 also indicates that automatic matches at that level may be reliable. In contrast, Figure 2 shows that for all records in the automatically matched sets (n=346,528, M = 0.456, SD = 0.097) the distribution of similarity scores was centered much lower and was more tightly grouped around the mean.

An exploration of the intersection of expert and non-expert work-sets found that the distribution of expert and automatically generated sets converged at similarity scores above 80%. Figure 3 shows the actual number of work-sets for similarity scores equal to or greater than 80% for both approaches.

Figure 3.

Distribution of similarity scores above 80%

Figure 3 shows how consistent the distributions are above 90% for both automatic and expert grouped work-sets. In fact, the actual number of sets with a similarity score of 1 is close for both sets (expert-matched n=515, automatically generated n=521). Although the distribution is very different for similarity rates between 80% and 89%, roughly 1 in 8 work-sets are accurate in this range. Additional analysis of automatically generated sets found clusters of “The Adventures of Huckleberry Finn” which proved to be problematic during the expert-matching process. Below 80% the ratio of correct to incorrectly grouped work-sets drops dramatically (.03% at 70% similarity).

A record-by-record comparison of the set found that all of the expert-matched sets with similarity scores of 100% were also in the automatically generated set. It also found six new titles with a similarity score of 100%. These titles along with their key are included in Table 3.

Table 3. Automatically generated sets that were not identified by expert with similarity score of 1
SetKey (both keys matched exactly)
0008514|13220231835 1910 adams diary extracts from mark twain
0008515|13220231835 1910 adams diary extracts from mark twain
0012775|01185031835 1910 letters mark mary to twain twains
0218807|24224791835 1910 autobiography mark of twain
0226871|1431645circle circular mark twain
0911915|12932311835 1910 a adventures case controversy critical finn huckleberry in mark of study twain

As Table 3 shows, the additional automatically generated matches are based on sufficiently complex keys such that there is a high reliability that they represent the same work. Consulting the expert cataloger's notes for these titles indicated recognition of these relationships. While being sure of these matches would involve consideration, only a few keys were generic enough (e.g., “circle circular mark twain” and “1835 1910 autobiography mark of twain”) to require review.

The comparison of sets also found that while 399 sets were unmatched with similarity scores between 80% and 99% (e.g., in the automatically-generated group but not in the expert-matched group), only 21 sets were unmatched between similarity values greater than or equal to 90% and less than 100%. Table 4 shows some example records along with the variation in keys.

Table 4. Example variation in keys, similarity score > 90% and < 100%
ScoreKey 1Key 2
0.9021601 1835 1910 as by conversation date fireside in it mark of social the time tudors twain twains was1601 1835 1910 as at conversation fireside in it mark of or social the time tudors twain was
0.9091835 1910 a dogs mark tale twain1835 1910 a horses mark tale twain
0.9091835 1910 a drama mark sawyer tom twain1835 1910 abroad mark sawyer tom twain
0.92218671910 budd critical essays j louis mark on twain19101980 budd critical essays j louis mark on twain
0.9441835 1910 mark sketches twain twains1835 1910 mark speeches twain twains

The variation in these records shows how difficult it can be to make accurate decisions using short keys. For example, while “A Dogs Tale” and “A Horses Tale” are separate works, “Critical Essays on Mark Twain” edited by Louis J Budd (row 4 in Table 4) is likely pointing to two editions of the same work but has incorrect metadata in the 245 field, subfield b. This points to the need to experiment with a more complex field-weighting scheme or to use a successive series of similarity scores to identify better matches.

In contrast to the data from high similarity scores shown in Table 4, the data from expert-matched sets with low similarity scores show considerable key variation due to the inclusion of verbose subtitles. Table 5 includes selected records with similarity scores between 46.5% and 83.6%. The variation of key length in these examples as well as the introduction of a number of refining words make it much more difficult to make work-set decisions without access to other metadata.

Table 5. Example variations in keys where similarity score between 40% and 89%
ScoreKey 1Key 2
.8361835 1910 adventures finn huckleberry mark of twain1835 1910 adventures comrade finn huckleberry mark of sawyers tom twain
0.7061835 1910 adventures ago comrade fifty finn forty huckleberry mark mississippi of sawyers scene the time to tom twain valley years1835 1910 adventures comrade finn huckleberry mark of sawyers tom twain
0.4651835 1910 albert an and appreciation bigelow by dean howells introduction mark paine speeches twain twains william with1835 1910 mark speeches twain twains

The variation in length of these titles could be a clue for a two-step algorithm that first matched based on the longer key and then attempted to find high similarity scores for shorter keys (e.g. 245a and 100/110/710 fields). While the reliability of the match would have to be very high (e.g., 100%), it could provide a means to highlight potential matches.

DISCUSSION

This study explored the issues of metadata quality in regards to the ability to create FRBRized work-set relationships. In doing so it explored a technique grounded in computational approaches but guided by expert cataloger matching. The findings of the cataloger showed that while computers can handle the inconsistency in MARC record syntax and can even generate highly relevant keys by which FRBR work-sets can be created, some decisions must be informed with a more granular understanding of the variation in the resources themselves. As such, a key goal of automation should be to help both catalogers and computers make better informed decisions regarding metadata model migration.

Despite the problems associated with automatically identifying all relationships in the record set, the data represented in figures 2 and 3 indicate that even without enhancement, the key comparison approach utilized in this study could be helpful in guiding FRBR clustering. While the variation displayed in even high similarity scores (e.g., Table 4) points to a need to refine the key generation process, overall the study found that 62% of the matched work-sets could be accepted without review (similarity score of 100%) and that 87% of work-sets could be identified by reviewing a relatively small number of records. In this data set, 544 records had scores between 80% and 99%. A number of these records showed overlap between varying title forms of “The adventures of Huckleberry Finn,” showing the difficulty making automatic metadata matching choices when the source records reflect a complex publication history.

In order to test a secondary matching routine based on comparison of multiple key similarity scores, automatically grouped work-sets with a similarity score at or above 80% had similarity scores computed for a normalized version of the title alone (245ab – KC in Table 2). This new score was compared to the OCLC FRBR Key 1 – KC score. Although only 27 records showed an improvement in similarity score, all were determined to be correct work-sets by expert review. In contrast, work-set comparisons that showed a decrease in score had a large number of incorrect work-set matches. While additional work would be required to identify a high quality secondary match, the process of generating work-sets by comparing the trajectory of similarity scores could be a useful technique.

The computational techniques utilized in this research show that comparison of similarity scores is a valuable measure of metadata consistency, particularly in the context of metadata migration. This fills a need in libraries for automated techniques for MARC record processing and, given the relative success of exact key matches, suggests that a sufficiently complex algorithm with multiple key comparison approaches could be successful in handling a high percentage of metadata migration tasks. In addition, while ultimately it did not play a significant role in the analysis of the records, Google Refine was instrumental in enabling the researchers to compare, visualize and discuss the MARC metadata. Certain techniques including the string normalization routine from the fingerprint clustering technique were drawn from Google Refine code and helped improve the work-set generation process.

It is worth noting that the automatic generation of work-sets in this research and the evaluation of quality of these work-sets were only possible because a cataloger invested considerable time and effort in analyzing these records. As Ochoa and Duval (2009) point out, a highly reliable training metadata repository would make quantitative assessment of metadata quality much easier. Because the initial work has been done with even this limited dataset, it would be valuable to replicate these techniques on larger collections in the same domain or using other similar datasets along with random sampling of records for expert assessment.

Analysis of key length against similarity scores indicated that there might be an inverse relationship between the length of the key and the similarity scores for that key. While this initially appeared to be problematic and caused the researcher to move towards longer keys with a lower standard deviation (e.g., the OCLC FRBR Key 1- KC), the discovery that the low similarity scores of roughly 20% of the expert matched titles were due to very lengthy sub-title statements indicates that a two-step key assessment technique could be valuable.

Finally, the data in this study show the viability of both computational assessments of metadata quality and of the ability to migrate metadata to new conceptual models based on this work. With a database of valid key relationships in hand, it would be a rather trivial process to select a representative value and create new records along with the appropriate relationships. Although the methods were not explored in this research, the python-levenshtein library includes methods that would facilitate this process including string consolidation and multi-string comparison.

CONCLUSION

Ultimately, this study found that using computational techniques does enable mapping to new metadata models and can simplify the work of complex model application. This is an uncommon technique in evaluating the quality of metadata and has the benefit of being scalable to similarly structured metadata and schemas. While the keys tested in this dataset are not likely to transfer directly to other schemas, the key normalization approaches and principles of work-entity analysis using Levenshtein edit-distance provide a framework by which keys could be generated for other metadata schemas. This process could prove to be valuable as libraries, archives and museums migrate to new metadata schemas and encoding standards that de-emphasize the document and focus on relationships between documents and metadata (e.g., Linked Open Data). Finally, while this study focused on a very small record set, new MARC datasets are available from sources that number in the millions of records (e.g., HathiTrust and Harvard University Libraries). While the work of generating and comparing keys for this number of records is non-trivial it is a good candidate for parallel processing and could lead to a repository of high quality FRBRized linked data.

Ancillary