Document clustering using the LSI subspace signature model



We describe the latent semantic indexing subspace signature model (LSISSM) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top-ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing (LSI) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and self-organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.