Parametric mixture model for multitopic text

Authors


Abstract

In general, text has multiple topics. Thus, automatic topic detection from text is harder than the traditional pattern classification tasks because multiple categories must be considered in text categorization. Since the conventional methods do not consider a generative model of multicategory text, they have an important limitation when applied to the multicategory detection problem. In this paper, we propose new probabilistic generative models, parametric mixture models (PMM1 and PMM2), and then present a method for simultaneously detecting multiple topics from text using PMMs. In PMMs, all multitopic classes can be completely represented by basis vectors each of which corresponds to a single-topic class. Moreover, the global optimality of estimated parameter values is theoretically guaranteed in PMM1. Furthermore, parameter estimation and topic detection algorithms are quite efficient. We also empirically show the usefulness of our method through multitopic categorization of World Wide Web pages, focusing on those from the “yahoo.com” domain. © 2006 Wiley Periodicals, Inc. Syst Comp Jpn, 37(2): 56–66, 2006; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.20259

Ancillary