Metadata quality is directly related to the quality of services provided by digital libraries. The paper presents results from evaluating the metadata quality of the IPL (www.ipl.org). Two evaluation methods were used: a preliminary automatic evaluation and a human involved evaluation using a survey. The automatic evaluation was focused on the completeness of major IPL metadata fields. The human evaluation asked evaluators to judge the accuracy, completeness, consistency, functionality and usefulness of IPL metadata fields. Qualitative feedback from evaluators provided us an in depth picture of the IPL metadata quality. We also compared results from the automatic evaluation and the human evaluation.
The Internet Public Library website (www.ipl.org) holds a collection of large amount of authoritative websites on various subjects. The IPL primarily has three collections: the general collection containing about 10,000 links; the youth collection (KidSpace) with about 3500 links; and the teen collection (TeenSpace) with about 2000 links. Each of these links point to a website that contains valuable content to the IPL collection. Each website is described by a metadata record stored by an open source software known as Hypatia. The IPL metadata records are created by non-professionals, mainly library science majored college students who acted as volunteers.
We investigated the metadata records currently in the collection. The metadata evaluation project was carried out in two stages: a preliminary automatic evaluation without human evaluators and a human evaluation using a survey. We are interested in whether there were any differences between the results from these two methods. Advantages and disadvantages of using an automatic approach and a human involved approach for metadata evaluation will be discussed in the paper.
Why Metadata Quality?
Metadata quality is closely related to the quality of services provided by digital libraries. Low quality metadata can impede the process of resource discovery carried out by users. For instance, incomplete and inaccurate data entry in the metadata records may cause the described digital objects never being found by users. Metadata quality is also essential for the interoperability and aggregation of distributed repositories. For example, in digital projects like OAI, it is important that data providers submit standard high-quality metadata. Otherwise, users would not be able to search across different metadata repositories and find all the relevant resources distributed in different digital repositories. Considering the significance of metadata quality, it is necessary for digital libraries to evaluate their metadata collections frequently. Evaluation results can help digital libraries identify defects in their metadata collection.
When doing metadata quality evaluation, the first question one may ask is what high quality metadata is, or what properties high quality metadata should possess. Researchers answered this question from different perspectives (Cole, 2002; Lei et al., 2006, Guy et al., 2004). In general, high-quality metadata should facilitate the process of identifying, describing, managing and searching data.
Metadata Evaluation Criteria and Methods
With different evaluation goals and different types of data objects and collections, researchers apply different criteria and principles in the evaluation process. Bruce and Hillmann (2004) proposed six principles for evaluating metadata quality. The suggested criteria include completeness, accuracy, provenance, conformance to expectation, logical consistency, coherence, timeliness and accessibility. Moen & Stewart (1998) evaluated the GILS metadata records according to four criteria: completeness, profile, accuracy, and serviceability. Shreeves et al. (2005) evaluated the quality of Dublin Core metadata records of collections harvested through the Open Archives Initiative (OAI) Protocol. Especially, their evaluation focused on four dimensions: completeness, structural consistency, semantic consistency and ambiguity. Zeng et al. (2004) applied four criteria to evaluate NSDL's metadata repository: completeness, correctness, consistency and duplication. Tolosana-Calasanz et al. (2006) developed a quantitative method for assessing the quality of geographic metadata. The researchers first formulated a list of geographic quality criteria by consulting domain experts. The identified criteria indicated two tendencies: structural and semantic. Rather than evaluating the metadata quality directly based on the structure and content of metadata records, some researchers tried to evaluate the metadata in regards to its function and use. Sokvitne (2000) evaluated the effectiveness of Dublin Core metadata embedded in web pages in facilitating the retrieval of web resources.
In the above research, the evaluated metadata was created by cataloging or metadata professionals. Some researchers were interested in assessing the quality of metadata generated by authors or contributors. Greenberg et al. (2001) conducted a study to evaluate author-generated Dublin Core metadata. Two cataloging professionals were asked to determine whether the quality of each metadata element was acceptable. They also assessed the intelligibility and the general correctness of the metadata. Especially, the subject keywords were evaluated in terms of specificity and exhaustiveness. The results indicated that authors can create good quality Dublin Core metadata. Wilson (2007) evaluated the contributor-supplied metadata at three levels: syntax, structure, and semantic content. At the syntax level, the overall completeness of the metadata record submitted by the contributor was evaluated. At the structure level, consistency and correctness of the format of each metadata element was assessed. At the semantic content level, the correctness of each element's content was checked.
While most existing metadata quality evaluation is based on human review, some researchers explored how to evaluate metadata quality automatically without the involvement of human judgment. Hughes (2004) and Bui & Park (2006) collected the statistical information at the metadata repository level to get an overall estimation about the quality of metadata records in the repository. Dushy & Hillmann (2003) applied a visualization tool called Spotfire to facilitate automatic metadata evaluation. The tool can present a graphical overall view of the structure of a collection's metadata.
Although statistical methods and visualization tools may be efficient during evaluation process, they can only be used to assess the structural aspect about metadata quality, such as completeness and consistency. Ochoa & Duval (2006) proposes a way for automatic evaluation based on the evaluation framework proposed by Bruce & Hillmann (2004). For each of the evaluation criterion, they developed some quality metrics which can be calculated mathematically. For instance, the completeness of a metadata record is measured by the number of filled fields divided by the total number of fields in a record. Even the quality criteria at semantic level can be measured quantitatively based on some metrics. For example, the semantic accuracy of fields like title, abstract and keywords is measured through the vector cosine distance between the content of these fields and the content of the original resource.
In our research, we decided to use both automatic evaluation and human review to gain a thorough understanding of the metadata quality in the IPL. We will focus on accuracy, completeness and consistency of the IPL metadata. These criteria can both be measured quantitatively and qualitatively. We also want to look into the function roles of metadata. Therefore, evaluations on searchability, browsability and management were also included in our survey.
The IPL Metadata Schema
The metadata records in the IPL were developed by student volunteers. Before they start working on developing any IPL metadata records, student volunteers are required to go through an IPL collection development manual step by step. They first get familiarized with the IPL's collection policy and practice with assignments. They will then be evaluated in order to be considered as qualified volunteers. When a link to a new or existing external resource is provided, the volunteers would have to first evaluate the quality of the resource and decide whether to keep or add the resource to the collection, depending on whether it meets the IPL's collection policy standards. After the decision is made, the volunteer then create and edit the metadata record for the resource, following the IPL metadata development guidelines.
Collection development in the IPL is conducted using a web interface to the Hypatia database. In order to get a full picture of the IPL metadata structure, the IPL metadata schema is mapped to the simplified Dublin Core metadata standard in Table 1. The IPL metadata schema contained 9 elements from Dublin Core. The IPL metadata schema also contained some elements that are not present in the Dublin Core, such as Youth level, Body, Comments.
Table 1. Dublin Core and the IPL Metadata elements
The metadata evaluation project was carried out in two stages: a preliminary automatic evaluation without human evaluators and a human evaluation using a survey. In the automatic evaluation stage, we mainly aimed at finding out how complete the metadata fields are. For each of the metadata elements, a SQL query was run to find out how many records have value in the field. A sample query is shown here.
“SELECT count (ic.item) FROM Item_coll ic JOIN Value_d vd ON vd.idx = tf.type JOIN Textfield tf ON tf.entry = ic.item WHERE ic.coll = 1 AND vd.attr = 2 AND vd.idx = 1 AND vd.variant = 0”
For the second evaluation stage, human evaluators' assessments of the metadata quality were collected. An online evaluation form was developed using Google Doc. The questionnaire contained 21 questions, trying to evaluate the metadata quality from four aspects: accuracy, completeness, consistency, and functionality & usefulness. The accuracy section included six questions, asking how accurately each of the IPL metadata fields describes the pointed web resource; The completeness section included eight questions, asking how complete the IPL metadata fields are; The consistency section included four questions, asking how well the metadata record adheres to the current IPL metadata development guidelines; and the functionality & usefulness section included three questions, asking how functional and useful the overall metadata record is.
We used a serious of random numbers generated by computer to collect a random sample of 467 records from the IPL collection. Eighty-three college students who were taking digital reference and digital library related courses were recruited to act as human evaluators. They evaluated the records using the evaluation form and submitted the evaluation online. The evaluation task was given as part of their course assignments.
Automatic Evaluation Results
A completeness percentage was calculated for each of the IPL metadata fields. Three largest collections in the IPL repository, the general collection, the youth collection (the KidSpace) and the teen collection (the TeenSpace), were investigated. Since results from the three collections were similar, we only report the completeness percentage of the general collection, in figure 1. The results show that while the Main title, Main URL and abstract were 100% complete, many other fields had very low completion rates.
Human Evaluation Results: Accuracy of tested fields
Student evaluators judged the accuracy of keywords, subject headings and abstract by giving a rating, ranging from 0 to 5. The rating of zero means that the value in the field was not at all accurate. The rating of five means that the value in the field was very much accurate. Results show that the mean accuracy rating for keywords was 2.06, and the standard deviation was 2.01. The mean accuracy rating for abstract was 3.91, and the standard deviation was 1.25. The mean accuracy rating for subject heading is 4.4, and the standard deviation was 1.11.
We calculated the percentage of ratings that were above three (from somewhat accurate to very much accurate). While 44% of keyword fields were rated above three, 91% of the subject headings fields were rated above three and 83% of the abstract fields were rated above three. Figure 2 shows the percentage of ratings that were above three for the three tested fields: keywords, subject headings and abstract.
Evaluators were also asked to test and judge the accuracy of the field of URL. The results showed that 93% of the main URL links were working, and 88% of the URL opened the page as the metadata record described.
Human Evaluation Results: Completeness of tested fields
Student evaluators judged the completeness of keywords, author and title by giving a rate, ranging from 1 to 5. The rating of one means that the value in the field was not at all complete. The rating of five means that the value in the filed was very much complete. As the lowest, the mean completeness rating for keywords was 2.52, and the standard deviation was 1.50. The mean completeness rating for author was 3.79, and the standard deviation was 1.51. The mean completeness rating for title was 4.32, and the standard deviation was 0.98.
Similarly, we calculated the percentage of ratings that were above three (from somewhat complete to very much complete). While 48% of the keywords fields were rated above three, 74% of the author fields were rated above three, and 93% of the title fields were rated above three, as shown in figure 3.
Question asking the presence of alternative URL resulted that 91% of the records didn't have an alternative URL. Questions asking about the presence of EMAIL revealed that 78% of the records had an email but only 38% of the provided email matched with what was given in the web page. The sources of the email were from four categories: author of the page (29%), contact email (27%), web master (12%), and cannot be determined (31%).
Human Evaluation Results: Consistency of tested fields
The consistency section of the questionnaire was mainly focused on the field of author and the field of keywords, targeting at evaluating the level of adherence to the IPL metadata development guidelines. Evaluators judged the consistency of the fields by choosing “yes” or “no” to a consistency statement.
For the field of author, good consistency was described as “the field is filled with a person's name and whether the name is given in Last Name, First Name format”. The results showed that only 12% of the records were considered consistent, with 28% considered not consistent, and the rest undecided. For the field of keywords, good consistency was described as “the field includes at least 2 keywords or phrases not found in the Main Title, Abstract and Subject heading fields”. Only 27% of the records were considered consistent, with 72% considered not consistent.
Human Evaluation Results: Overall evaluation
We asked the evaluators to judge the overall accuracy, completeness and consistency for the metadata records. Similar to previous rating scales, the rating of one means that the metadata record was not at all accurate, complete or consistent. The rating of five means that the metadata record was very much accurate, complete and consistent. Figure 4 shows the ratings of the overall consistency, completeness and accuracy. The mean value of the overall consistency rating was 3.46, and the standard deviation was 0.99. The mean value of the overall completeness rating was 3.58, and the standard deviation was 1.07. The mean value of the overall accuracy rating was 3.90, and the standard deviation was 1.02.
Percentages of ratings to overall consistency, overall completeness and overall accuracy that were above three were calculated. 86% of the metadata records were considered at least somewhat consistent overall. 86% of the metadata records were considered at least somewhat complete overall. 92% of the metadata records were considered at least somewhat accurate overall.
We also asked evaluators to judge the searchability, browsability and management of the metadata records, by giving a rating between 1 to 5 in three separate questions. The mean value of the searchability was 3.52, and the standard deviation was 1.20. The mean value of browsability was 3.77, and the standard deviation was 1.10. The mean value of management was 3.70, and the standard deviation was 1.09. Percentages of ratings to management, browsability and searchability that were above three were calculated. 87% of the records were rated above three. 88% of the records were rated above three. 81% of the records were rated above three.
We gathered qualitative evaluations for three fields: abstract, subject headings and keywords. The evaluators gave detailed description of the rationales of their ratings. The qualitative feedback helped us to understand the quantitative results.
From the automatic evaluation, we found that the presence of abstract is close to 100%. However, the human evaluation found that the overall accuracy rating for abstract is 3.91 (a perfect accurate rating is 5). The reasons why the author field was considered not so accurate were from the following aspects.
First of all, a large number of abstracts could no longer describe the targeted web resources. The web sites have consistently been updated. The update of the metadata did not keep up with the pace. One evaluator used a sarcastic tone to indicate that the abstract did not include some of the content that the web site has, “maybe [this content] didn't exist 11 years ago when the metadata was first created”. The IPL holds a large collection and the metadata development was done manually. Some metadata records were created when the resources were first added to the database and were never again updated.
Second, the abstracts were supposed to be summaries of the web site content, rather than a quotation from the web site texts. Evaluators pointed out that a large number of abstracts were simply copy and paste from the web site about page. The reason might be that summarizing the content of a web site with large volume of data was difficult for student volunteers. Since there was no quality control, volunteers might simply choose a “copy and paste” approach.
Third, wordings of the abstracts seemed to be an important factor that affected evaluators' ratings. They described the abstracts as “vague”, “not comprehensive”, “too technical”, “not so descriptive”, or sometimes “too casual”. Sometimes they were too short to cover all the important resources from the web site; sometimes they were too long that they exceeded the length limit. The tone for the abstract was also a concern. Evaluators were not sure whether abstracts should sound like coming from the web site owner, or from a neutral observer. The reasons for such confusions might come from the IPL metadata development guidelines itself. There is no coherent standard of quality. Different evaluators might have completely different opinions on whether the abstract was good enough.
The fifth reason was that incorrect or wrong information was contained in the abstract. The last reason was the presence of typos, misspelling and grammar mistakes in the body of abstracts.
Student volunteers decided on which subject headings to use by choosing from a drop down list of pre-defined categories. The evaluators commented that the subject heading structure sometimes did not contain the specific category that can pinpoint resource. Some subject headings were close enough but not quite well.
Second, the evaluators found that some records needed more detailed sub-subject headings or simply needed more subject headings. Sometimes there was just one subject heading and a few more could be added. More subject headings can provide new angles for patrons to get access to the information.
The completeness percentage of keywords from automatic evaluation was below 60%. The human evaluation of keywords was consistent with the automatic evaluation: 44% of keyword fields were rated above three for accuracy, 48% of the keywords fields were rated above three for completeness. The mean accuracy rating for keywords was 2.06. The mean completeness rating for keywords was 2.52.
Qualitative feedback revealed that the major reason for the low ratings was no presence of keywords at all. Almost half of the reviewed records did not contain any keywords. For those records that had keywords, some of them did not have enough number of keywords.
Second, some of keywords were existing words from the abstract or titles, or subject headings. Such keywords did not add any new value to the metadata, which was against the IPL development guidelines.
Third, the evaluators were not satisfied with the quality of the keywords. They were either too broad or too narrow. There were no coherent standards on how general or how specific the keywords should be.
Fourth, some keywords were not likely the ones that might be used by patrons for search. The evaluators believed that the development of keywords should keep the users in mind and came up with the list of keywords that would be useful for later searching.
The fifth reason came from the currency of the information: the information of the target resources was updated but the keywords were not updated accordingly. The last reason came from misspellings, capitalizing issues, and typos. Some evaluators had the concern that whether plural form and singular form of the same keyword should both be present in the keyword field.
Overall Consistency Problems
The currency problem seems to be the most common reason for the poor quality of the metadata records. It often happens that the web site that the metadata described was not the one that the URL pointed to: either the web site has changed address, or the web site has been updated, or sometimes the link simply did not work.
Second, the metadata developers had confusions about the IPL metadata schema, which resulted in inconsistency in the metadata records. For example, “The record treats the site's editor as the site's author”; “The creator listed is for the parent site, but the author linked at the bottom of the game theory page for that content is an entirely different person, parts of whose cybernetic dictionary have been represented with his permission”; “Full contact info for the real author has been provided, and the creator should actually be the publisher or editor as he is part of an editorial board that maintains the parent site”; “The current URL was placed in the ‘Former URL section’ and the former URL was placed in the ‘Main URL section’”.
The third reason was that some resources described no longer met the IPL collection inclusion criterion. In some cases, the web site started to require payment to get access to the information; or the site was no longer active; or sometimes it became too commercial and contained too much ads. For example, an evaluator pointed out that the overall information on the site was valuable, but that there was an area for jokes which may not be appropriate.
Fourth, adherence to the IPL development guidelines seems to be a common problem. Violations to the guidelines were observed quite often in the metadata records. For instance, the names in the author field were not in last name, first name format has been mentioned many times by the evaluators.
The last but not the least, incorrect information was contained in the metadata fields, either coming from human mistakes or typos.
Discussion and Conclusion
In order to assess the current quality status of the IPL metadata, we planed and conducted a two-stage evaluation. In the first stage, the automatic evaluation gave us an overview of completeness for the metadata records in the collection. We found out that the field of main title, main URL and abstract seemed to be fine. The field of author and keywords were far from satisfactory as the completeness percentage were low. We then used a survey method to collect human evaluators' feedback on the quality of the metadata, focusing on the following aspects: accuracy, completeness, consistency and over functionality. Human evaluators were asked not only to rate the quality for test fields, but also to give out qualitative comments on why whey gave out such ratings. The results from human evaluation were much more informative than the automatic evaluation.
Following findings in terms of the IPL metadata quality were revealed from the evaluation: main title did not have any quality problems. However, the field of title contained a list of sub-fields: former title, sort title, acronym, alternate title, alternate spelling, real title, and authority title. Some of these sub-fields were required, some optional, which made it a complicated field for student volunteers. As results shown, only 60% of the title field was considered very complete.
Main URL was 100% complete based on the automatic evaluation, however, for about 12% of the records, the page opened by the URL was not the page described in the records. The field of abstract was close to 100% complete based on the automatic evaluation, but only 45% was considered accurate by the evaluators. Automatic evaluation pointed out that author and keyword might have the most problems. Indeed human evaluators found out that only half of the author fields were very complete. For the field of keywords only less than one fifth of the records were considered very complete or very accurate.
From the qualitative feedback from evaluators, we found out five major reasons that have caused the problems to the IPL metadata quality: large amount of the metadata information was outdated; student volunteers did not fully understand the IPL metadata schema and had troubles in filling out some of the fields; some of the recourses included in the IPL but actually should be cut from the collection; the understanding of the IPL metadata development guidelines were not as good as it should be; problems associated with typos, misspellings and grammar mistakes.
We believe that some changed can be made to the IPL metadata development process. First of all, student volunteers should go through more thorough trainings of the IPL schema itself and the development guidelines as well. There should be more quality control in the IPL metadata development process. Second, manual metadata development seems to be an easy way to start with; it gets more difficult as the IPL keeps growing in both collection size and personnel. Some automatic or semiautomatic metadata creation method is needed for the IPL.
We would like to thank all the researchers who have been working on the IPL and IPL related projects. We would also like to thank all the students who participated in the metadata evaluation project and provided their valuable feedback.