Tags as keywords – comparison of the relative quality of tags and keywords



Tags assigned by users are considered to have potential in categorizing web resources despite the fact that they do not use a controlled vocabulary and are contributed by non-experts. This study explores the potential of using tags to replace controlled vocabulary as a method of organizing web resources. Author-assigned keywords and users-assigned tags for a shared data collection are compared. The data was gathered from scientific papers that have both author-assigned keywords and user-assigned tags. Author-assigned keywords and user-assigned tags were compared using similarity measurements, classification systems, and clustering methods.


Social tagging systems have gained popularity among web users as a method of organizing and retrieving web resources. Social tagging systems, such as del.icio.us and Flickr, encourage user involvement in the tagging process by providing services to motivate and benefit users, e.g. bookmarking favorite links, organizing/sharing pictures, and getting recommendations on related web pages. Tags can represent user categorization of shared content. Tags that are descriptive of a resource can be used in categorization, i.e. social classification or folksonomy. In this way, social tagging systems can be considered a channel for non-professional users' participation in keyword generation and resource categorization process.

Research has demonstrated and discussed the use of tags for resource organization and discovery. Traditionally, controlled vocabulary as metadata uses selected words to organize knowledge. As Macgregor and McCulloch (2006) discussed, the major obstacles to deploy controlled vocabulary for web resources are high-cost in production and scalability. Tags are freely chosen keywords/annotations by end users. Tags from users can be used as one kind of metadata for organizing resources at lower cost and with greater scalability. Lin et al. (2006) described social tags as having “no hierarchy, no directly specified term relationship, and no pre-determined set of groups or labels.” Given this description of social tags, it may be useful to explore and understand the features and types of social tags to find potential use of them especially for resource categorization and retrieval.

Much of recent research on social tagging is based on casual observations and focused on semantic aspects of social tags to find a way to understand social tags. This study compares the quality of resource representation using tags and author-provided keywords with an eye to justifying user-provided tags for resource categorization with empirical and statistical tests.


There is growing research support for using tags in web resource classification (Golder and Huberman, 2005; Macgregor and McCulloch, 2006; Mathes, 2004; Mika, 2007; Noll and Meinel, 2007; Noll and Meinel, 2008; Quintarelli, 2005; Shirky, 2005; Veres, 2006). Golder and Huberman's work on social tags (2005) showed potential to produce a classification based on tags. They observed that tags tend to represent a basic level classification. Similarly Mika (2007) demonstrated that tags can be used to cluster concepts better especially when more tags are added. Macgregor and McCulloch (2006) argued that social tagging systems let users participate in the organization of web resources, and make it possible to lower the cost of web resource organization. They also predicted the positive potential of letting collaborative tagging and controlled vocabularies systems co-exist. Mathes (2004) analyzed folksonomy as user-created metadata, and Quintarelli (2005) introduced folksonomy as one method of classification. Noll and Meinel (2007; 2008) have examined tags in comparison with web document metadata, e.g. HTML Meta tags, and a top-down taxonomy to define characteristics of tags in terms of metadata and web document classification. Veres (2006) used psychological and linguistic theory to explain that users classify according to principles not too dissimilar from those imposed by formal taxonomies.

Ongoing projects in the library community are making use of tags for libraries with the perspective of Library 2.0. The steve museum project (http://www.steve.museum/) (Trant, 2006a; Trant, 2006b) is a collaborative project of eight US museums and some other foreign museums to examine how social tagging provides new ways to describe and access museum collections that supplement existing museum documentation. With social tags gathered from their system, the steve project expects to have information that makes sense to general viewers and thereby reduces the semantic gap between professionals and the general public. Early proof of concept tests suggest there is a significant semantic gap between professional and public discourse (Trant, 2006a). The technical descriptions provided by professionals describe everything that is not in the picture. On the other hand, untrained cataloguers provide terms that represent the perspective of general users' interest and rich description on concept and context. Another social tagging project from the library community is the Library of Congress Photos on Flickr project. The Library of Congress (LC) has placed two sets of digitized photos on Flickr. The Commons (http://www.loc.gov/rr/print/flickr_pilot.html) has 1,600 color images from the Farm Security Administration/Office of War Information and over 1,500 images from the George Grantham Bain News Service. Goals of this project are to share photographs from the Library's collection, to gain a better understanding of how social tagging and community input could benefit both the Library and users of the collections, and to gain experience participating with the Web community interested in the kinds of materials in the Library's collections. They anticipate being able to learn about the ability of users to participate in the process of making constructive use of tags.

To build Library 2.0 services, some libraries also directly apply social tagging functionality in their library systems. For example, the Ann Arbor District Library (AADL) in Michigan released their social library catalog system called SOPAC in January 2007. It integrated tagging into the library catalog. Users can input tags and write reviews on the book catalog, and they appear both in catalog records and library collection-wide tag cloud. However it is not easy for every library to build its own social library catalog system. Existing social tagging systems such as del.icio.us and LibraryThing offer easier methods allowing libraries add the functionality of social tagging to library services (Rethlefsen, 2007; Wenzler, 2007). Del.icio.us lets librarians and library users provide web resources to other patrons with tags for specific topics or timely subjects, especially beneficial to school and academic librarians and patrons serving or searching resources for particular courses or assignments. Some libraries such as Thunder Bay Public Library and the Nashville Public Library have del.icio.us tag clouds integrated into their website. LibraryThing lets libraries include LibraryThing tags and recommendations into their OPAC systems. It helps libraries enhance user access to their collections and enables users to keep up their interest in library collections by supporting tag browsing and book recommendations. Several libraries, such as the Danbury Library, have already implemented LibraryThing into their system.

Involvement by the public is considered important, although the quality of the tags becomes an issue. Since users input tags without any restriction, terms used for tags may contain misspelled terms, compound terms, single and plural forms, personal tags, and idiosyncratic tags (Macgregor and McCulloch, 2006). Although tags may be used only for their creators' benefit, there are clearly some tags that have social shared meaning as well as personal meaning (Guy and Tonkin, 2006). Shirky (2005) discusses how tags should be organized to produce meaningful forms. Some researchers state that tags can be useful in document classification or retrieval by providing additional information that is not in the content of the document (Noll and Meinel, 2007; Veres, 2006). Currently most of the projects apply tags as raw keywords that allow libraries to include user's experience in resource description. How social tags can be filtered and structured to allow for greater user benefits is still an open question (Rethlefsen, 2007) and the subject of a number of ongoing studies. Related to concerns about tag quality when used as metadata, the results from the steve museum study (Trant, 2006a) showed that the terms provided by non-specialists for museum collections are positive. It demonstrated that using tags assigned by general users might help reduce the semantic gap between professional discourse and the popular language of the museum visitor. The results also supported using tags and folksonomies for resource representation.


This study evaluates whether user-tags can represent resources as author-keywords do and be used to structure or categorize resources as keywords. The comparison is based on scientific papers with both author-provided keywords and user-assigned tags. The source of data was determined based on the need to access both author-provided keywords and user-assigned tags. Other tagged resources, such as bookmarks and images, often do not have author-provided metadata information whereas many published papers include bibliographic information along with author-provided keywords. Data was collected first from the ACM digital library (http://portal.acm.org/dl.cfm), a digital library collection of Association of Computing Machinery publications, and then from Citeulike (http://www.citeulike.org/), a social tagging system for scientific papers. The ACM digital library was selected as it is one of systems supported by Citeulike providing a better chance of having its papers included in Citeulike. In addition, papers from ACM covers technical topics that are considered to be of interest for many social tagging system users and have classification and keywords assigned from authors when they were submitted. Among many journals and proceedings in the ACM digital library, the proceedings of the conference on the World Wide Web (WWW) and Joint Conference of Digital Libraries (JCDL) were selected for its wide coverage of topics, length of publication, and popularity and novelty in topics. A total of 2219 scientific papers from the ACM digital library collection were collected (1055 papers of WWW published from 2002 to 2008, 1164 papers of JCDL published from 1996 to 2008). Citeulike was searched for papers from this collection to find co-existing papers so that author-provided keywords and user-assigned tags might be compared. Among 2219 papers from ACM digital library, only 693 papers existed in Citeulike (31.23%) as of December, 2008. The evaluation in this study was conducted on this collection of 693 papers.

For each of the 693 papers, author-provided keywords, author-provided classification, and user-assigned tags were collected along with the title and abstract representing the contents of the papers. Author-provided keywords include controlled vocabularies from the ACM vocabulary set and terms directly assigned by authors. To compare the representativeness of keywords and tags in describing the contents of papers, similarities to title, abstract, and both title and abstract are measured. Similarity is calculated with term vectors using cosine similarity. Moreover, since the papers are already classified by authors based on the ACM Computing Classification System (CCS), keywords and tags are compared on different classification levels. The goal of this assessment was to determine the similarity or differences between keywords and tags for various levels of classification. In addition, using RefViz, a reference analysis tool developed by Thomson Scientific, the groups (clusters) of keywords and tags on the collection was observed. This part of evaluation was used to predict the potential of tags for categorizing resources.


Comparison of Similarity

Cosine similarity is a measure of similarity between two term-frequency vectors. In this study, vectors are considered to be the terms from keyword set and tag set. Six sets of vectors are compared: keywords versus the words in the title (kwtt), keywords versus the words in the abstract (kwab), keywords versus the word in the title and abstract (kwtx), tags versus the words in the title (tagtt), tags versus the words in the abtract (tagab), and tags versus the words in the title and abstract (tagtx). Overall, the similarity of the vectors is low. This is explained in large part by the small number of terms in the vectors. Table 1 shows the mean similarity and standard deviation for the compared vectors. The average standard deviation was similar for all comparisons. The highest similarity was for keywords versus title and abstract (M =.2970, SD =.0662), and the lowest similarity was tags versus abstract (M =.2179, SD =.0571). Overall keywords had better similarity to the document components than tags. Given that keywords are assigned by authors to describe the document and tags may be assigned by readers for other purposes, e.g. a personal note, this is not particularly surprising. A t-test confirms statistically reliable similarity mean differences between keywords and tags to title (t(1384) = 5.069, p <.001), abstract (t(1384) = 16.639, p <.001), and both title and abstract (t(1384) = 16.315, p <.001). An interesting difference between keywords and tags is that the similarity values for keyword is best when compared with both title and abstract and becomes smaller when compared with abstract only, and then title only. Whereas the similarity values for tags versus title is relatively high compared to the other two cases. This suggests that users tend to input tags based on the title of a paper as has been observed in other studies (Lin et al., 2006; Noll and Meinel, 2007).

Table 1. Means and Standard Deviation of Similarity
original image
Table 2. Quartiles of Similarity
original image
Figure 1.

Distribution of Similarity

Table 2 and Figure 1 show the range and distribution of similarity values for each comparison. One assumption of the t-test is homogeneity of variance. While the Levene's test measure in Table 1 provides support for the homogeneity of variance, Figure 1 provides some evidence that suggests that the variance is not particularly homogeneous. The white box for each comparison shows the range for 50% of the cases closest to the mean (interquartile range). That is, for the comparison of tags to title (tagtt), 50% of the similarity measures were between 0.1977 and 0.2848. The highest 25% of the similarity measures for tagtt comparison fell between 0.2848 and 0.8169. As can be easily seen in Figure 1, the distribution range of similarity values is wider for tags than keywords, i.e. having wider dispersion. In addition, the mid-half portion of the similarity values is more focused for keywords, i.e. showing a dense peak for interquartile range in the distribution graph. One possible conclusion is that author-provided keywords are more consistent in describing the content of the resources. User-assigned tags show more variation in describing the content of the resources, especially with title. One explanation for the high similarities between tags and the title may be that users assign tags based more on titles, and as a result similarity of tags with title appears to have wider variation. One interesting question that still needs to be addressed is whether filtering user tags for idiosyncratic tags or action tags (such as “***” or “toread” or “important”) might yield different results that would increase the similarity score assigned to tags.

Another comparison of similarity can be made by examining the paper collection based on the number of tags or keywords. That is, to examine whether the similarity is higher or lower based on the number of keywords or tags. First, the collection was divided into two groups based on the numbers of tags per paper. The paper with the largest number of tags had 63 tags. The papers were divided into two groups with the 50% of the papers with the smallest number of tags in one group and the papers with the largest number of tags in a second group. For tags, the dividing line was 4.5 tags. A similar process was carried out for keywords. The paper with largest number of keywords had 29 keywords. For keywords, the dividing point was 6.5 keywords. From this, observation on similarity and its mean difference was made (Figure 2 and Table 3). There were no significant differences of mean similarity for comparisons of keywords to title, abstract, title and abstract when the number of keywords differs. For tags, the comparisons with abstract (tagab) (t(691) = −2.091, p =.037) and both title and abstract (tagtx) (t(691) = −1.977, p =.048) showed a significantly high similarity when more tags were assigned. From the analysis on groups by number of tags or keywords, we make three tentative observations. First, in both cases the more tags or keywords that were used, the higher the similarity measures (Table 3). Second, we thought that keywords might gain little increased similarity with large number of keywords for their topic concentrative nature. Finally, we would have thought that large numbers of tags would have included many that detracted from the similarity – i.e. idiosyncratic tags or action tags. Quintarelli (2005) and other researchers suggest that tags gain power by mass of users. Figure 2-(a) would seem to support this conjecture - when there are more tags for a resource the similarity to the content increases. While this data is suggestive, the current analysis is not at all conclusive and more work is needed.

Figure 2.

Similarity Comparison based on Number of Tags/Keywords

Table 3. Similarity Comparison by Number of Keywords or Tags
original image

Comparison by Classification

The ACM Computing Classification System (CCS) is a classification system for computer science fields with 4 hierarchical levels. Authors are requested to assign their work to one or more classes from CCS when submitting it. Therefore every paper in the collection is expected to be classified based on CCS. There are 11 top levels (Table 4) in CCS from which only 10 levels were represented in the data collection. It should be noted that multiple classes can be assigned to one paper. For example, 3 papers were classified Topic J, Computer Applications. One of those 3 papers could also have been assigned to a second, or third top level category, e.g. Topic G, mathematics of Computing. A paper might not be assigned to a second level, or it might be assigned to multiple second levels. Similar logic applies to the third and fourth level categories. From the data set, 686 papers were used for this part of comparison analysis excluding 7 papers without classification assigned by paper authors.

Table 4. Topics of the CSS Top Levels and Count of Paper Distribution
original image

Table 5 shows the similarity between keywords or tags and the titles (kwtt, tagtt), abstracts (kwab, tagab), and titles and abstracts (kwtx, tagtx) when the articles are petitioned into four sets. Group one is any paper classified in one or more first level topic categories with no more refined (2nd level) classification. Group two includes articles classified in one or more first level categories with one or more second level categories and no more refined classification. Group three consists of articles classified in one or more first and/or second level categories with one or more third level categories with no more refined classification. Group four consists of articles classified in one or more first, second and third level categories, and one or more fourth level categories. After petitioning, there were no papers assigned to the Group one, and 27, 205, and 454 papers were grouped to Groups two, three and four respectively. A t-test performed on keywords and tags for each group showed no significant mean difference between keyword and tag similarity in higher classes (general concepts). For group 3 and 4, similarity mean differences between keywords and tags to title, abstract, and both title and abstract were statistically significant, showing keywords are better at representing specific concepts in documents. However, when only tags were compared among classification level groups, in all comparisons Group three and Group four were significantly better representing papers than Group two (Table 5 and Figure 3). For the comparison of tags with titles, a t-test confirmed that the similarity mean differences between Group 2 and Group 3, and Group 2 and Group 4 were significant (t(230) = −2.184, p =.030; and t(38.336) = −3.920, p <.001 respectively). For the comparison of tags with abstracts, and title + abstract, the test result were similarly significant (Group 2 and Group 3 comparison for tagab was t(230) = −2.858, p =.005; for tagtx was t(230) = −3.026, p =.003; Group 2 and Group 4 comparison for tagab was t(479) = −2.822, p =.005; and for tagtx was t(479) = −2.981, p =.003). The comparison by classification showed that tags describe what the resource is about in more general levels of concept when compared with keywords. Although describing resources with specific concepts might not be as good and efficient as keywords, a statistical test confirmed that tags are better in representing specific concepts than general concepts.

Table 5. Similarity Comparison by Classification Levels
original image
Figure 3.

Average Similarity of CCS Levels

Comparison by Clustering

While we find that tags relate less strongly to existing classification systems than keywords, it may be that tags would be useful in generating classification systems. To examine this, papers were clustered by keywords and tags. A bibliography analysis tool called RefViz was used. RefViz includes functions for text analysis and visualization and was designed for reference librarians to provide overviews and reveal trends in collections. It creates groups of related topics on input references mainly based on input text such as title, abstract, and notes. For this study, only title and either keywords or tags were input to observe their affect on grouping related resources. Figure 4 is the visualization of groups by conceptual relatedness, a galaxy view, provided by RefViz. Small dots represent each reference items (papers) and file icons represent topic groups.

Figure 4.

Galaxy View of Tags and Keywords

Clustering of tags (Figure 4-a) shows wider distribution of papers whereas keywords (Figure 4-b) distribute papers in more concentrated way, although overall distribution of clusters is similar for both cases. Considering keywords by authors are more controlled and focused, the groups are very clearly divided - related papers are very closely positioned and the distance between papers distinguishes groups very well. On the other hand, tags distribute papers widely and distinctions between groups are hard to recognize. Due to the fact that the terms assigned by users are more general, the papers are spread with less distinguishable distance. This is also observed from the major topic terms provided by RefViz for each group (Table 6 and 7). Terms for keyword groups tend to overlap more among groups as more controlled vocabularies are used. They may be better in describing the domains of the interest, but it is not easy to tell the exact topics and specific-focus of each group. Terms for tag groups appear to be more varied. These may be less effective in grouping and fitting papers into certain domains, but it is easier to predict the content of the resources from the major terms.

Table 6 and 7 show the groups that match by topics. For each tag group, papers were searched from keyword groups to find the most matching keyword group. Same was done for keyword groups. The matching pattern of tag groups to keyword groups is better distributed than keyword groups to tag groups. A possible explanation could be the distribution of papers in each group (Figure 5). Some of tag groups tend to vary greatly in size. For example, tag group 14 includes 208 papers (30% of the collection). Therefore many of keyword groups were matched to tag group 14. In addition, as shown in Figure 4-(a), the wide distribution of papers among tag groups tends to overlap with various topics of keyword groups. This results in less dominance of one group matching keyword groups (Table 6). Having more match coverage and clearly dominating match groups, keywords seem to have more efficient clustering of papers (Table 7). However, as discussed, the topic terms of tag groups seem to be more representative in describing the group. While preliminary, the current analysis suggests that having both keywords and tags could help to group/cluster a resource collection more efficiently.

Table 6. Tag Groups - Major Topic Terms and Matching Keyword Groups
original image
Table 7. Keyword Groups - Major Topic Terms and Matching Tag Groups
original image
Figure 5.

Number of Papers in Topic Groups

Discussion and Conclusion

Researchers are exploring the potential of tags in categorizing web resources and in solving problems of controlled vocabulary. The question is how to structure tags to benefit from them. The potential role of tags versus controlled vocabulary has not been adequately tested. This study explores the comparative accuracy of tags versus controlled vocabulary as a method of describing and organizing web resources.

Data was gathered from scientific papers that are associated with both author-assigned keywords and user-assigned tags. Keywords and tags were analyzed with the measure of cosine similarity with title, abstract, and both title and abstract of papers. Comparison by similarity measurements showed that overall controlled vocabulary represent the content better than tags. The distribution of similarity showed that keywords were more focused to a topic whereas tags represent variation in description. Possible improvement on tag similarity when filtered was raised as a next step for a better usage of tags.

We observed that the title of paper appeared to be the main source for users to determine the tags to assign. The impact of the size of the tags and keywords was also examined. Both tags and keywords represent the content of paper better when more tags or keywords were assigned. To have a better view of content representativeness of tags, the similarity was compared for different levels of concepts from the existing classification system. It was shown that tags represent general concepts as well as controlled vocabularies, and are better in representing specific concepts than general ones. For a better analysis on whether tags can be used effectively for organizing resources, a visualization tool was used to show clustering of data collection. When clustered based on keywords, the groups are more clearly distinguished. With tags the distance between resources is smaller so that it is somewhat difficult to determine the groups. However the distribution of groups and major terms of groups showed a positive potential of categorizing resources with tags. The major terms of tag groups are more representative in describing the group with the various selection of terms compared to limited terms of keyword groups, providing further information that is not covered by controlled vocabularies. It suggests that tags, especially when both keywords and tags are used, could help to group a resource collection more efficiently. The findings of the study, especially from comparisons of classification (i.e. tags represents specific concepts better) and clustering (i.e. tags describes topic or concept clusters with more specific and various terms), showed that tags may be used for resource categorization. That is, while the categorization is not as good, it may be that the lower cost and variation in description of resource contents warrant consideration of the tags as a categorization mechanism.

Controlled vocabulary tends to represent the content more accurately and stably. As already mentioned in the literature, the weakness of tags derives from the assignment by users who are not trained and who often use counterproductive (misspelled, personal, idiosyncratic) tags. At the same time, given the degree of similarity between keywords and tags from the results, the lower cost of user participation in generating metadata is worth considering. Further analysis on tags for categorizing resources such as filtering problematic tags (i.e. misspelled tags, synonymous terms, idiosyncratic tags), forming hierarchical clustering with tags, or assigning to existing classification system and/or subject headings may provided better insight on the potential of tag usage.