The huge amount of free-form unstructured text in the blogosphere, its increasing rate of production, and its shrinking window of relevance, present serious challenges to the public policy analyst who seeks to take public opinion into account. Most of the tools which address this problem use XML tagging and other Web 3.0 approaches, which do not address the actual content of blog posts and the associated commentary. We give a tutorial review of latent semantic analysis and the self-organizing maps, as considered in this context, and show how to apply the self-organizing map over a probabilistic latent semantic space to the problem of completely unsupervised clustering of unstructured text in such a way as to be entirely independent of spelling, grammar, and even source language. This provides an algorithm suitable for clustering free-form commentary with a well-structured test environment. The algorithm is applied to academic paper abstracts instead, treated as unstructured text as though they were blog posts, because this set of documents has a known ground truth. The algorithm constructs a word category map and a document map in which words with similar meaning and documents with similar content are clustered together. WIREs Data Mining Knowl Discov 2014, 4:71–86. doi: 10.1002/widm.1112
Conflict of interest: The authors have declared no conflicts of interest for this article.
For further resources related to this article, please visit the WIREs website.