Identifying reusable resources in digital reference answers
What can we learn from resources provided in answers to digital reference questions? Using the Internet Public Library's archive of over 60,000 records of answered reference questions, this project aims to determine (1) characteristics of resources provided in answers to digital reference questions, (2) the extent to which these resources are reusable in future answers, and (3) the useful lifespan of a resource that has been provided. As part of our research, we have developed an XML schema for digital reference questions and answers.
Ever since the advent of digital reference services, there has existed a belief that, as Coffman (2001)1 puts it, “if we could somehow access the work another librarian had done before, there would be no need to start over answering every question from scratch” (p. 152). This is a seductive notion, as it offers at least a partial solution to a perennial problem of reference work: scalability. Human labor is time-consuming and expensive, and significant savings of both could be realized if even some of the products of that labor could be reused. The popularity and usefulness of FAQs provides evidence for this point. Further, accessing the work of other librarians has the potential to provide at least two other benefits: (1) resources for training new librarians, and (2) the possibility of using data mining and other automation to assist in reference work.
This poster reports on a project to identify the resources that are provided in answers to digital reference questions. If it is true, as some authors suggest, that the situation that gives rise to a question is unique for every individual, then even identically phrased questions cannot be treated as actually being identical, even if the question is commonly received by a reference service. It therefore may not make sense to attempt to reuse resources across ostensibly identical questions. Many digital reference services allow users to categorize their question by topic, however, so it may be possible to identify resources commonly provided in answers to questions within specific topics. The research questions guiding this project are therefore as follows:
- 1)What resources are provided in answers to digital reference questions? With what frequency are these resources used in answers to different topics and types of questions?
- 2)To what extent are the resources provided in answers reusable for future answers? What factors affect their reusability?
- 3)What is the useful lifespan of a resource provided in an answer to a past digital reference question? To what extent is time a factor in a resource's reusability?
This project makes use of data from the Internet Public Library's (IPL, ipl.org) archive of reference questions. This archive contains all answered questions submitted to the IPL's reference service from the inception of the IPL in September 1995 to the present, which have been deidentified to a 90-95% accuracy level by the IPL.
We built a simple spider and downloaded all 60,393 records that existed in the IPL's archives as of the end of January 2008. The records contain fielded data, as an artifact of the IPL's Ask A Question webform (ipl.org/div/askus/). Some of these fields are closed-ended, such as the drop-down list of subjects, while others are open-ended, including the field in which the user specifies their question. In its current form, 5 fields on this webform are required and 4 are optional.
We developed XML tags based on Dublin Core for every field in the question-answer records, and all data within the fields that we wished to identify. In particular we tag URLs provided within the answers, which are relatively easily identifiable as regular expressions, and resolve all TinyURLs in answers (IPL policy dictates that an additional shortened link be provided using TinyURL for any URL over 65 characters in length). Other resources (e.g., books and journal articles) are also provided in IPL answers, but far less frequently than URLs. This is likely due to the fact that the answerer can be certain that the questioner has access to the internet, but cannot be certain to what print resources the questioner has access. Future work will be required to develop methods to automatically identify resources other than URLs in answers. We have also developed a parser to automatically tag each record using this XML schema.
As with most projects involving large data sets, data cleaning required considerable time and effort. Having completed data cleaning, we ran the parser on all 60,393 records and have tagged all records using our XML schema. The IPL has devoted a great deal of effort to standardize the format of the answers provided, by training their volunteer answerers and by maintaining a detailed set of policies for how answers should be structured. Even with this, however, there is considerable variability in the question-answer records, due in part to the evolution of the IPL's question submission form over time and in part to differences in users' and answerers' writing styles. Our schema has proven to be flexible enough to tag all 60,393 records with only two failures. At the conclusion of this project we plan to deposit the schema and parser with the IPL's Learning Community (ipl.ci.fsu.edu), currently under development thanks to a grant from the Institute of Museum and Library Services.
Given the similarities between digital reference services' question submission forms, we believe that this schema may prove useful for marking up digital reference questions universally. Future work is needed to confirm its applicability, and develop necessary changes or options. We hope that future work will also result in a collection of data analysis tools that may be shared across the digital reference community to utilize commonly marked-up question-answer records.
Our data analysis is still underway as of this writing. We have found that, not surprisingly, resources from the domain ipl.org are provided most often, in 12% of all answers (n=31,485). Next is google.com (5%); and yahoo.com, altavista.com, and amazon.com, at 1% each. The distribution of resources provided in IPL answers has an extremely long tail, with fully 60% of domains appearing in only one answer, and 96% in ten or fewer answers, which is an average of less than once per year. The set of subdomains and individual webpages provided in answers is even larger. The subdomain en.wikipedia.org appears in 0.06% of answers, which is surprisingly high as the IPL has a policy discouraging answerers from using Wikipedia.
We have found very little correlation between the resources provided in answers and the subjects of questions (from the Ask A Question webform). Despite the large number of resources from the ipl.org domain provided in answers, most individual webpages within ipl.org were provided fewer than 100 times per subject. The exception to this is for questions in the FARQ (Frequently Asked Reference Questions) category, but most webpages provided for these questions are pages developed by the IPL specifically to answer FARQs. Further, most FARQ webpages are provided in answers with other FARQ pages, and only rarely with resources outside of ipl.org.
There are (at least) two possible interpretations of these findings. The first is that librarians craft an answer to every question, treating each as having a unique answer. This has an implication for teaching: reference work is conducted as a craft rather than a science, and as such would best be taught by internship or apprenticeship. A second possible interpretation is that librarians do not reuse answers or resources for similar questions. This suggests a need for better support for storing and searching previously-answered questions. This also has an implication for the practice of reference work; that it may be beneficial for librarians to be trained and encouraged to consult repositories of previously-answered questions, rather than conducting a new search every time.
We continue to explore factors that may correlate with the resources provided in answers. Some possibilities include how the questioner plans to use the information and sources already consulted. We also plan to identify a window of text around each URL that provides the reason the URL was provided in the answer. The IPL's policies regarding the formulation of answers specify some mandatory elements of answers, including providing an answer, citing the resource(s) used, and describing the search conducted by the answerer. The inclusion of these elements in answers usually results in the answerer providing a description of how the URL may be useful to the questioner. Even with the IPL's policies for formulating answers, we expect there to be variation in the location of “justification windows” (before the URL, after the URL, or both) and their size (measured in words, sentences or even paragraphs). We will use text extraction and mining techniques in this investigation.
1Coffman, S. (2001). We'll Take it from Here: Developments We'd Like to See in Virtual Reference Software. Information Technology and Libraries, 20(3), 149-153. http://www.ala.org/ala/lita/litapublications/ital/2003coffman.htm