## Introduction

Effective and efficient browsing methods for large text collections have been widely examined in recent years. Among existing implementations of various browsing methods, Scatter/Gather browsing is well known for its ease of use and effectiveness in situations where it is difficult to precisely specify a query (Cutting et al., 1992, Hearst & Pedersen, 1996). It combines search and interactive navigation by gathering and reclustering user-selected clusters.

The Scatter/Gather browsing method was first proposed by Cutting et al., (1992). In each iteration of this browsing method, the system scatters the dataset into a small number of clusters/groups, and presents short summaries of them to the user. The user can select one or more of the groups for future study. The selected groups are then gathered together and clustered again using the same clustering algorithm. With each successive iteration the groups become smaller and more focused. Iterations in this method can help users refine their queries and find the desired information from a large data collection.

Since the Scatter/Gather method requires online clustering on a large data corpus, fast clustering algorithms are essential. Two linear time clustering algorithms, namely the Buckshot and the Fractionation, were implemented for the original Scatter/Gather method (Cutting et al., 1992). Both algorithms have *O(kn)* time complexity, where *k* is the number of desired clusters and *n* the total number of documents. As compared to the Buckshot, the Fractionation algorithm is a little slower but with higher accuracy. Although better than a quadratic time complexity, *O(kn)* is not fast enough for large document collections. Jensen, Beitzel, Pilotto, Goharian, and Frieder (2002) proposed and evaluated a parallel version of the Buckshot algorithm, which achieved a *O(n*log*n)* time complexity.

Cutting, Karger, and Pedersen (1993) proposed an algorithm that used a precomputed hierarchy of meta-documents for further expansion of selected items and reclustering of the subset. Only dealing with a subset of *M* meta-documents in each iteration, the algorithm achieved constant interaction-time for Scatter/Gather browsing. However, the reclustering process is not efficient enough for real time interaction because *M* cannot be too small (M≫k, the number of clusters needed). On the other hand, by summarizing descendant documents, meta-documents might be too large to be reclustered efficiently, or too small to be accurately representative.