Application of structural equation modelling in exploring tag patterns: A pilot study

Authors


Abstract

This pilot study examines the semantic structure of tag space in Library and Information Science (LIS) using confirmatory factor analysis of social tags from Delicious.com. This study is one of the few studies to employ structural equation modelling (SEM) in investigating dimensions of Web spaces based on social tagging data. This study examines the post data collected from 34 LIS related websites bookmarked on delicious.com. Collected data was analysed using three statistical techniques: correlation analysis, exploratory factor analysis and Structural Equation Modelling (SEM), to confirm the structure of the social tagging space. Preliminary analysis shows that the semantic structure of the tagging data shows similar connections to those present in the real world. These methodologies can be used to identify the strength of connections between related tagged websites.

INTRODUCTION

Social tagging has become an important topic in informetrics and many researchers have tried to explore the distribution and patterns of user-generated tags in different environments. Previous studies have contributed greatly to mathematical modelling of tagging using different mathematical techniques. However, little research has applied multivariate statistical methods, especially factor analysis and structural equation modelling. This paper presents the preliminary results of a semantic structure exploration of LIS tag space using delicious.com.

PREVIOUS RESEARCH

Cappocci and Caldarelli (2007) explored the semantic patterns among tags and suggested using tripartite graphs and clustering coefficients in analysing CiteULike, a social bookmarking system. Kipp (2009) explored the unique distribution patterns of social tags from popular collaborative bookmarking sites. Cattuto et al. (2007) applied network analysis to examine the co-occurrence of social tags from online bookmarking systems. Cattuto et al. (2008) introduced the notion of resource distance based on the collective tagging activity of users to build a weighted network of resources and semantic relations. However, few studies have used multivariate statistical techniques for modelling semantic structures among websites using tag data.

RESEARCH QUESTIONS

  • What sites are closely related to each other based on social tags information?

  • What dimensions can be identified in the LIS field Web space based on social tags?

  • How do social tags represent the structure of Web space in the field of LIS?

METHODOLOGY

We collected tagging data for this pilot study from delicious.com for a set of 34 LIS related websites. We selected prominent schools, organisations and public libraries from the United States in consultation with domain experts for this pilot study.

We extracted 389 tag terms which occurred commonly amongst the 34 sites in order to achieve parsimony of the dataset and lower skewness and kurtosis. The original dataset was transformed using square roots and logarithms to control for skewness. We analysed the data using three different statistical methods: correlation analysis; exploratory factor analysis; and confirmatory factor analysis.

RESULTS

Correlation Analysis

To examine which sites are closely related to each other, Pearson r coefficients were computed between all possible pairs of the 34 selected websites. The results revealed that the correlation coefficients are relatively higher between the websites belonging to the same category or type than the websites in different categories. That is, the patterns of tag frequency distributions between same-category items are more similar than those in different categories.

Exploratory Factor Analysis

To determine common dimensions from the dataset, an exploratory factor analysis was conducted. The result of the factor analysis shows that the six factors accounted for 64.99% of the total variance at 0.95 of eigenvalue. Table 1 illustrates that the sites belonging to the same category are loaded by the same factor. The identified six dimensions are: 12 LIS program school sites (code begins with “s_”); 5 public library sites (code begins with “p_”); 5 organisation sites (code begins with “o_”); 5 academic library sites (code begins with “a_”; 3 information school sites(code begins with “i_); and 4 special library and library corporation sites (code begins with “sp_”).

Table 1. Results of exploratory factor analysis
 Dimensions
codeLIS schoolsPublic librariesOrganisationsAcademic librariesi-schools (not LIS)Special libraries/library corporation
s_uiuc.841     
s_fsu.834     
s_indiana.831     
s_sjsu.766     
s_unc.757     
s_kent.740     
s_drexel.729     
s_texas.714     
s_washington.698     
s_ucla.650     
s_michigan.602     
s_syracuse.533     
p_chicago .798    
p_boston .796    
p_seattle .753    
p_los angeles .710    
p_new york .616    
o_sla  .865   
o_aall  .784   
o_asis  .729   
o_ifla  .722   
o_ala  .719   
a_yale   .772  
a_cornell   .704  
a_stanford   .699  
a_chicago   .681  
a_harvard   .621  
s_psu    .739 
s_uci    .699 
s_ucberkerley    .674 
sp_loc     .719
sp_worldcat     .637
sp_nih     .631
sp_oclc     .564

Confirmatory Factor Analysis

Finally, a confirmatory factor analysis was applied using SEM. To achieve a good model fit, six sites that showed relatively low factor loadings in the exploratory factor analysis were excluded. In order to fit the model free from the assumption of normality, the unweighted least squares method was adopted to control for skewness. This SEM model exhibits an adequate goodness of model fit. Figure 1 shows the results of the SEM analysis. All the obtained factor loadings and correlation coefficients were statistically significant at 0.05 alpha level. The correlation coefficients amongst “public library”, “academic library”, and “special library and library corporation” were relatively higher. Also, the correlation coefficients between the school constructs were relatively higher.

Figure 1.

Semantic structure of LIS related websites derived from tagging data

DISCUSSION AND CONCLUSIONS

This pilot study suggested new statistical approaches in tag structure studies, and answered the three identified research questions empirically. First, LIS Web sites that belong to the same category in the real world show similar patterns of social tags in tag space. Second, social tags show six dimensions in the LIS field, LIS schools, iSchools, LIS-related organisations, academic libraries, public libraries, and special libraries and library corporation. Third, social tags confirmed the real world structure in the field of LIS. That is, the structure of the tagging data shows similar connections to those present in the real world. These methodologies can be used to identify the strength of connections between related websites or related items, which could be useful in search and retrieval, and could be included in the design of search systems. Tagging data could also be used for competitive intelligence, where organisations could identify closely linked organisations or competitors by the strength of the correlations.

Ancillary