Comparing Methods for Single Paragraph Similarity Analysis
Version of Record online: 18 AUG 2010
Copyright © 2010 Cognitive Science Society, Inc.
Topics in Cognitive Science
Volume 3, Issue 1, pages 92–122, January 2011
How to Cite
Stone, B., Dennis, S. and Kwantes, P. J. (2011), Comparing Methods for Single Paragraph Similarity Analysis. Topics in Cognitive Science, 3: 92–122. doi: 10.1111/j.1756-8765.2010.01108.x
- Issue online: 10 JAN 2011
- Version of Record online: 18 AUG 2010
- Received 27 February 2009; received in revised form 6 July 2009; accepted 8 September 2009
- Semantic models;
- Paragraph similarity;
- Corpus preprocessing;
- Corpus construction;
- Wikipedia corpora
The focus of this paper is two-fold. First, similarities generated from six semantic models were compared to human ratings of paragraph similarity on two datasets—23 World Entertainment News Network paragraphs and 50 ABC newswire paragraphs. Contrary to findings on smaller textual units such as word associations (Griffiths, Tenenbaum, & Steyvers, 2007), our results suggest that when single paragraphs are compared, simple nonreductive models (word overlap and vector space) can provide better similarity estimates than more complex models (LSA, Topic Model, SpNMF, and CSM). Second, various methods of corpus creation were explored to facilitate the semantic models’ similarity estimates. Removing numeric and single characters, and also truncating document length improved performance. Automated construction of smaller Wikipedia-based corpora proved to be very effective, even improving upon the performance of corpora that had been chosen for the domain. Model performance was further improved by augmenting corpora with dataset paragraphs.