A conceptual framework for improving information retrieval in folksonomy using Library of Congress subject headings



Social tagging or collaborative tagging is a new trend in the organization, management, and discovery of digital information. Folksonomy, a collection of tags assigned to online information by all participants of social tagging systems, has many advantages as an information organization tool. As an information retrieval tool, however, folksonomy lacks the capability of discerning the relevancy of information. We propose a conceptual framework of IR for folksonomy on the basis of Library of Congress Subject Headings. Algorithms for constructing the conceptual framework are described, and values of the framework are also discussed.


Social tagging or collaborative tagging is a new trend in the organization, management, and discovery of digital information (Begelman, Keller, & Smadja, 2006; Cattuto, Loreto, & Pietronero, 2007; Choy & Lui, 2006). Popular social tagging systems include Delicious (http://del.icio.us) for web resources, Flickr (http://www.flickr.com) for pictures, and Technorati (http://technorati.com) for blogs. In the practice, anyone can participate in the process of annotating online information, thinking of keywords or tags, and assigning them to the information with the intention of accessing the information using the keywords or tags assigned. As a result, a collection of tags assigned to online information by all participants is created; this collective tag set is called folksonomy. The idea of social or collaborative tagging has been quickly adopted by library communities, blogs, media and broadcasting websites, and commercial websites such as Amazon (www.amazon.com), to name a few. As social tagging's popularity increases, however, the vast quantity of online information organized by folksonomies becomes harder and harder to access and retrieve because folksonomies are uncontrolled vocabularies that users subjectively select and assign as tags. Currently, social tagging systems and their applications heavily rely on the users' uncontrolled vocabularies for information retrieval and discovery. New attention and research on the issue is thereby in great demand in order to improve the quality of information retrieval (IR) and accessibility of folksonomy-based information systems.

In this study, we propose a conceptual framework of information retrieval in folksonomy based on Library of Congress Subject Headings (LCSH). LCSH is the largest controlled vocabulary in English, and it has served as a de facto standard controlled vocabulary that is extensively used by libraries in many countries around the world (O'Neill & Chan, 2003). More recently, it has been applied to some projects in organizing digital information: integrating digital libraries with traditional libraries (Frank & Paynter, 2004), and integrating information environments with multiple information depositories (Koch & Day, 1997). One of this study's motivations is the incorporation of LCSH to enhance the search capabilities of Google Book Search by Google (Riley, 2007) where further detailed algorithms or methods are not disclosed. This study presents a new approach for improving folksonomy's IR capability by applying LCSH to folksonomy.

Folksonomy can be defined as a collection of tags used by individuals. When retrieving information, it is presented to users in the form of a tag cloud consisting of the most popularly assigned tags. When a tag in a tag cloud is selected, a certain number of online information associated with the tag is returned according to the chronological order of the tagging itself - there is no room for considering the relevancy of the information to the users' needs, which is a goal of retrieving information. For example, Delicious, which is a well-known collaborative bookmarking application, is a typical social tagging system. In this system, the most recently tagged online sources appear at the top of the rankings rather than the information most relevant to the tag. In fact, achieving the display of relevant information for folksonomy has never been taken into account in any tagging systems or applications.

LCSH-based Conceptual Framework of IR in Folksonomy

We will describe how LCSH can become a conceptual framework of IR in folksonomy. For this task, we first need to translate LCSH into its corresponding tree (called a LCSH Tree). The tree is then utilized as an IR framework for folksonomy. A core challenge in successfully implementing the LCSH-based framework is the selection of appropriate LCSH for folksonomy.


A tree is a data structure into which a hierarchical structure can be represented (Cormen et al., 2001). It is composed of a set of nodes containing a special node (called root node) placed at the highest level of the tree and links that connect nodes. There is only one root node in a tree. A node at one level immediately higher (lower) than a current node is called the parent (child) node of the current node. A node that does not have a child node is called a leaf node. In other words, a leaf node is the last node along the path starting from the root node to the leaf node. To the contrary, the root node of a tree is only a tree node without its parent node.

LCSH can be represented by a tree data structure. Let LCSH Tree be a tree in which a term in LCSH is denoted by a tree node and hierarchical relationships between terms is denoted by links between the nodes for the terms. The following term relationships are available in LCSH: established terms (main headings), Related Terms (RTs), terms for Used For (UFs), Broader Terms (BTs), and Narrower Terms (NTs). Hierarchical term relationships are not always rigorously defined, particularly with regard to subject heading strings (i.e., combinations of a main heading with one or more subdivisions). Since multiple facets are involved in subject heading strings which in turn lead to ambiguities in deciding hierarchical relationships among subject heading strings, we will focus only on main headings without subdivisions. Established terms and their BT or NT are in a hierarchical relationship. BTs (NTs) are denoted by the parent (child) nodes of a node for an established term. By repeating the process, various child-parent nodes are created and a LCSH Tree can be built. UFs are generally accepted as semi-synonyms or synonyms of established terms, and are seen as references to the established terms. UFs are denoted by the sibling nodes (nodes at the same level) of a node for an established term. RTs are excluded in constructing the Tree, however, as the relationship between RTs and established terms are not clearly known except that they are associated.

Once the algorithm for constructing LCSH Tree is applied to LCSH, LCSH is represented by a set of n trees. A tree is said to be independent of the remaining n-1 trees, independent meaning that there is no link (hierarchical relationship) across different trees. Due to the independence, each of the n trees is defined as a concept tree as we believe it represents a concept. To connect all the n trees into one (LCSH Tree), a fictitious node is created and used as a parent node for all the collective n trees. The resultant single tree becomes the LCSH Tree where the fictitious node becomes the root node of the tree. Hence, the LCSH Tree consists of n trees. As n trees are all located under a root node, they are called LCSH sub-trees. The complex structure of LCSH Tree (concatenation of n LCSH sub-trees) is the reflection of the hierarchical structure of LCSH.

original image
original image

Building a LCSH Tree

In this study, a machine-readable Library of Congress subject authority file is taken as a source of LCSH. A retrospective version of the LC subject authority file, consisting of 291,000 authority records and covering the years 1986 to 2005, available at the Library of Congress Cataloging Distribution Service (http://www.loc.gov/cds/mds.html#sa), is employed to build a LCSH Tree. Alternative resources for bibliographic records are not considered because we want to build the Tree based on subject headings assigned and validated by a single authority, namely the Library of Congress. Not all the subject headings in the file are used. Only subject headings for topical terms (specifically headings under the MARC field 150 subfield code a) are used, thereby excluding headings for names or titles, because the hierarchical structure of the Tree is built on the basis of subject or topic. Headings for juvenile (beginning with sj) are excluded as well.

A LCSH Tree is constructed based on the algorithm and dataset described above. The 2005 retrospective version of LC subject authority files produces 28,136 (equivalent to the n above) trees, i.e., 28,136 LCSH sub-trees (Yi & Chan, 2007). The LCSH Tree is composed of a concatenation of 28,136 trees under one root node. The size of the root node's LCSH sub-trees greatly varies from a sub-tree consisting of one node to others with ten thousand nodes. A LCSH visualization tool developed by the author based on the LCSH Tree is used in the implementation of the algorithm defined below. Figures 1 and 2 visually show a list of LCSH sub-trees and the hierarchy of a sub-tree.

Linking LCSH Tree and Folksonomy

We want to present an algorithm based on the LCSH Tree built above that will help determine subject headings relevant to folksonomy. The proposed algorithm is as follows: Let C = {c1, c2,…cn} be a LCSH Tree consisting of n sub-trees (theoretically equivalent to single concepts; the value of n would be 28,136 in this case), and Si = {si1, si2,…sik} be a complete set of k subject headings appearing in the ith LCSH sub-tree, 1 ≤in. Find all the possible pairs of (ci), 1 ≤ in and 1 ≤ jk, such that the sum of the TF (Term Frequency)-IDF (Inversed Document Frequency) based similarity of all pairs of (ci, sij, tag) is maximized.

Many variations of TF and IDF algorithms are available and are used in many different settings. For our purposes, the TF and IDF version that is adopted by the popular IR system called Okapi (Roberson et al., 1995) and several other IR systems (Ponte & Croft, 1998) is used. For the sake of completeness, it is reproduced here:

equation image

where TF(t,d) = modified version of term frequency; Tf(t,d) = number of the term t in document d; l(d) = total number of terms appearing in document d; AveNum = average number of terms per document in the corpus; IDF(t) = standard version of inversed document frequency; N = total number of documents in the corpus; df(t) = number of documents in the corpus having the term t.

The adopted TF-IDF algorithm is modified for measuring the similarity between folksonomy and LCSH. In the modified version, LCSH sub-trees and subject headings correspond to documents and terms in the original TF-IDF algorithm, respectively. The relation between term and document in the original TF-IDF is also applied to calculate the similarity between folksonomy and LCSH. The new modified version of the algorithm is therefore the result of the multiplication of the TF-IDF for concept sub-trees and subject headings with the TF-IDF for LCSH and folksonomy. With the algorithm, a ranked list of subject headings can be provided and presented to users given a selected tag.


A conceptual framework for improving IR in folksonomy has been proposed and described. The values of the new conceptual framework for folksonomy can be summarized as:

  • -Controlling synonyms and homographs: Folksonomy is not a controlled vocabulary and lacks control over synonyms and homonyms. This can be somewhat controlled by LCSH in which synonyms and homonyms are to some extent included.
  • -Revealing multiple concepts associated with folksonomy: A LCSH sub-tree can be defined to correspond to a specific concept. The proposed framework reveals subject headings as well as concepts associated with folksonomy.
  • -Facilitating mapping of folksonomy within digital and traditional libraries: Bibliographic databases in which LCSH is an essential element are available through digital libraries and bibliographic utilities. It may be feasible to link and add user-selected online information to the databases through folksonomy and LCSH.