The copyright of this document remains with the authors and/or their institutions. By submitting their papers to the ASIST 2010 Annual Meeting, the authors hereby grant a non-exclusive license for ASIST to post and disseminate their papers on its web site and any other electronic media. Contact the authors directly for any use outside of downloading and referencing this paper.
Collecting legacy corpora from social science research for text mining evaluation†
Article first published online: 3 FEB 2011
Copyright © 2010 by American Society for Information Science and Technology
Proceedings of the American Society for Information Science and Technology
Volume 47, Issue 1, pages 1–2, November/December 2010
How to Cite
Yu, B. and Ku, M.-C. (2010), Collecting legacy corpora from social science research for text mining evaluation. Proc. Am. Soc. Info. Sci. Tech., 47: 1–2. doi: 10.1002/meet.14504701368
- Issue published online: 3 FEB 2011
- Article first published online: 3 FEB 2011
In this poster we describe a pilot study of searching social science literature for legacy corpora to evaluate text mining algorithms. The new emerging field of computational social science demands large amount of social science data to train and evaluate computational models. We argue that the legacy corpora that were annotated by social science researchers through traditional Qualitative Data Analysis (QDA) are ideal data sets to evaluate text mining methods, such as text categorization and clustering. As a pilot study, we searched articles that involve content analysis and discourse analysis in leading communication journals, and then contacted the authors regarding the availability of the annotated texts. Regretfully, nearly all of the corpora that we found were not adequately maintained, and many were no longer available, even though they were less than ten years old. This situation calls for more effort to better maintain and use legacy social science data for future computational social science research purpose.