SEARCH

SEARCH BY CITATION

Introduction

  1. Top of page
  2. Introduction
  3. Methodology
  4. Results
  5. Discussion and Conclusion
  6. Literature

With the advent of Web 2.0, one of the noticeable changes in people's information environment is that certain tasks of information seeking and management, that have been traditionally considered ‘personal,’ are now taking place in open spaces where people can share their knowledge and experiences. A prominent example of Web 2.0 services, variously called social bookmarking, social tagging, or a folksonomy, demonstrates this change.

We note the dual nature of social bookmarking, 1) as a personal bookmark management tool, where individual users collect and organize information resources for their own interests, and 2) as social software, where individual activities are accumulated for the benefit of the community as a whole. The fact that people's activities are visible to one another is in and of itself an important ‘social’ feature, in that people can discover, often serendipitously, useful information by following the traces other people have left in the public space.

As social bookmarking gains popularity, the potential value of the aggregated collection of information and human iudgments involved therein attract great attention from the research community. Whereas early studies of this phenomenon focused on finding regularities in user activities or in tag distribution in an attempt to understand the underlying dynamics (Golder & Huberman, 2006), more recent studies have investigated various ways of harnessing collective social knowledge from social bookmarking data, beyond the current social utility of open exploration. Among others, the possibility of making personalized recommendations based on past activities of bookmarking or tagging is a promising topic (Wu et al., 2006; Jiao and Cao, 2007; Jiao et al., 2007). Another line of research addresses the issue of constructing a semantic tool based on user-generated metadata, tags (Begelman et al., 2006; Halpin et al, 2007).

All the above and other approaches to exploring/exploiting social bookmarking data presuppose a certain level of accumulation and overlap of activities with regard to the entity of interest (user, information resource, or tag). For instance, recommendation or filtering of information can only be effective when users have a certain level of shared activities with others so that people with similar interests can be identified. Similarly, deriving relationships among tags requires co-occurrences in a large number of cases.

However, there is little empirical research validating this key assumption of accumulation and overlap of activities in social bookmarking. As stated above, social bookmarking has the dual nature of being both a personal and social information tool. Considering the potentially unlimited range of resources that could be bookmarked by a large variety of users, it might be the case that a maior portion of the information space of a social bookmarking site is comprised of resources that were bookmarked only once or a few times. In other words, individual users may not share resources and interests. On the other hand, it might be that users in the site tend to have many resources in common, partly because they can see resources other people have found valuable and incorporate those resources into their own collection. In that case, the overall level of accumulation and overlap would be high. This study aims to examine this phenomenon. More specifically, bookmarking activities in a popular social bookmarking site, del.icio.us, were analyzed to assess the level of accumulation and overlap across resources (resource-centric view) and users (user-centric view).

I believe that the level of accumulation and overlap is a reasonable indicator of the level of shared interests within the community. Gauging the current level of shared interests would be valuable 1) in understanding the dual characteristics of social bookmarking, and 2) in validating basic assumptions for designing applications or services, such as a recommender system, based on bookmarking data.

Methodology

  1. Top of page
  2. Introduction
  3. Methodology
  4. Results
  5. Discussion and Conclusion
  6. Literature

In order to assess the overall level of accumulation and overlap in del.icio.us, I collected and analyzed bookmark postings. Although there are other activities users of this site perform, including browsing other people's collections, the main activity is a bookmark posting. A user can post a bookmark to include it in their own collection of bookmarks, and optionally assign keywords, called tags, of their choice. The URL of the resource bookmarked, the user who posts the bookmark, and tags assigned to the resource are the three main entities involved in a bookmark posting. Among those three entities, the current study focuses on URLs and users.

Given the huge scale of the information space being studied, it was important to find a way to capture both the breadth and the depth of the space. To this end, two complementary methods of data collection were used. First, using the RSS feed feature provided by del.icio.us, users' most recent bookmarking activities were collected during January 2008. In total, 350,000 postings were collected with 310,172 distinct URLs saved by 120,928 distinct users. This dataset, which will be henceforth referred to as the recentdataset, represents the current breadth of the bookmarking activities. Second, in order to get data accumulated over time, two additional datasets were collected: the URL history dataset and user history dataset. The URL history dataset captures the entire set of postings associated with 10,000 sample URLs, and the user history dataset contains the entire set of postings ever made by each of 1,500 sample users. Sample URLs and users were randomly selected from the recent dataset. A dedicated crawler was developed to get relevant pages from del.icio.us and parse them. The final URL history dataset has 1,654,005 postings (of 10,000 sample URLs) made by 470,456 users, and the final user history dataset has 1,077,910 postings of 892,634 distinct URLs (made by 1,500 sample users).

Having these two history datasets plus the recent dataset allows us to look at the question of accumulation and overlap from two combined views: a resource-centric view and a user-centric view. From the resource-centric view, the proportion of resources (represented by URLs) shared by multiple users can be examined. From the user-centric view, on the other hand, the number of users sharing one or more resources with other users can be examined.

Results

  1. Top of page
  2. Introduction
  3. Methodology
  4. Results
  5. Discussion and Conclusion
  6. Literature

1) URLs with multiple postings (the resource-centric view)

Table 1 shows the number of distinct URLs accumulated as the number of postings increases over time in the recent dataset, and the proportion of URLs whose frequency (n) in the dataset is greater than 1 (URLs that have been posted by more than one users). The last row of Table 1 shows the same statistics in the URL history dataset.

Table 1. Distinct URLs and the proportion of overlaps
   URLs with n > 1
DatasetPostingsDistinct URLsNoPercent
Recent50,00048,6601,2332.53%
 100,00095,0034,0514.26%
 150,000139,9377,5155.37%
 200,000183,52211,4156.22%
 250,000226,00515,6326.92%
 300,000268,58919,6497.32%
 350,000310,17224,0607.76%
History1,654,00510,0006,17861.78%

Table 1 shows that the number of distinct URLs increases fast while little accumulation per URL occurs. In the recent postings, the overwhelmingly large portion of recent postings contains distinct URLs. Up until 350,000 new postings were made by 120,928 different users, there were only 24,060 URLs posted by two or more different users. However, examination of the URL history dataset, which contains the entire history of the random sample of 10,000 URLs, revealed that the majority of those URLs were shared by multiple users. Figure 1 shows the cumulative frequency of URLs in the URL history dataset, by the number of postings made by different users.

thumbnail image

Figure 1. Cumulative Frequency of URLs

Download figure to PowerPoint

The large portion of distinct URLs in the recent dataset indicates the diversity of user interests and the broad range of resources being bookmarked in this site. On the other hand, the huge increase of the proportion of repeatedly posted URLs in the URL history dataset demonstrates the effect of accumulation over time.

2) Users sharing bookmarks (the user-centric view)

Table 2 shows the growth of distinct users in the recent dataset and their accumulated activities, in terms of the proportion of users with multiple postings.

Table 2. Distinct users and users with multiple postings
   URLs with m > 1
DatasetPostingsDistinct URLsNoPercent
Recent50,00028,5828,51729.80%
 100,00050,25418,22236.26%
 150,00067,84427,32340.27%
 200,00082,45535,65743.24%
 250,00094,89643,60045.95%
 300,000108,84952,09 147.86%
 350,000120,92859,96349.59%
History1,077,9101,5001,48498.94%

Table 1 and Table 2 together show that, not surprisingly, the number of distinct URLs grows a lot faster than the number of distinct users. Even with the small window of time the recent data was collected, users frequently come back to the system and add new resources. On average a user made 3.52 postings during January 2008. If we look at the entire history of 1,500 sample users, a user, on average, has 720.05 bookmarks in their collection.

In order to look at the level of overlap across users, as well as at their accumulation of activities, the total number of bookmarks a user has, and the number of bookmarks that he/she shares with other users were calculated and compared. For instance, if a user has posted 10 bookmarks, each of those 10 bookmarks was checked to see whether other users had also posted it. Figures 2 and 3 show the scatter plot of users, of the recent dataset and the user history dataset respectively, by their number of bookmarks and the number of shared bookmarks.

thumbnail image

Figure 2. Scatter plot of users in the recent dataset

Download figure to PowerPoint

thumbnail image

Figure 3. Scatter plot of users in the user history dataset

Download figure to PowerPoint

Note that the plots representing users are relatively spread in Figure 2, while plots in Figure 3 tend to converge on a rather straight line, except for a small number of outliers. The apparently different patterns shown in Figures 2 and 3 suggest that, while many users have rather unique selections, with a large number of bookmarks not shared with others, the proportion of shared bookmarks grows over time in most cases. In addition, Figure 3 shows a tendency that not only the proportion of shared bookmarks increases but also that it more or less stabilizes across users over time. It should be mentioned, however, that the number of users in the user history dataset (Figure 3) is considerably smaller than in the recent dataset, and therefore the different patterns may be in part ascribed to the size difference.

Discussion and Conclusion

  1. Top of page
  2. Introduction
  3. Methodology
  4. Results
  5. Discussion and Conclusion
  6. Literature

Social bookmarking produces a new information environment where users are actively involved, as a part of their own information management strategy, in the accumulation of collective knowledge. While there have been discussions on how to harness this collective knowledge and to build applications or services using it, little attempt has been made to assess the level of accumulation and overlap of user activities, which is a prerequisite for such applications or services. This study tried to address this gap. I also posit that the extent to which bookmarking activities are accumulated and overlapped across resources and users represents the level of shared interests within the community.

In this paper, we presented the preliminary result of an effort to assess the current level of shared interests, in terms of accumulation and overlap of bookmarking activities in del.icio.us. Del.icio.us is known as the first and one of the most successful instances of social bookmarking. With its relatively long history and broad user base, the site can serve as a showcase of collective knowledge.

The result indicates considerable increases in overlap, both from the resource-centric view and from the user-centric view, as bookmarking activities accumulate over time. While a broad range of diverse interests was observed in recent activities, the community has also built a considerable amount of shared interests.

Literature

  1. Top of page
  2. Introduction
  3. Methodology
  4. Results
  5. Discussion and Conclusion
  6. Literature
  • Begelman, G., Keller, P., & Smadja, F. (2006, May 22–26). Automated Tag Clustering: Improving search and exploration in the tag space. Paper presented at the 15th International World Wide Web Conference (WWW2006), Edinburgh, UK.
  • Golder, S., & Huberman, B. A. (2006). Usage patterns of collaborative tagging systems. Journal of Information Science, 32 (2), 198208.
  • Halpin, H., Robu, v., & Shepherd, H. (2007). The complex dynamics of collaborative tagging. In Proceedings of the 16th international conference on World Wide Web (pp. 211220). Banff, Alberta, Canada: ACM Press.
  • Jiao, Y., & Cao, G. (2007). A Collaborative Tagging System for Personalized Recommendation in B2C Electronic Commerce. Paper presented at WiCom 2007, International Conference on the Wireless Communications, Networking and Mobile Computing.
  • Wu, H., Zubair, M., & Maly, K. (2006). Harvesting social knowledge from folksonomies. In Proceedings of the seventeenth conference on Hypertext and hypermedia (pp. 111114). Odense, Denmark: ACM.