## INTRODUCTION

Aggregating like-entities is a way humans understand the world. Classification, as a knowledge creation and organization mechanism, represents an essential part of intelligence (Taulbee, 1965). Classification or categorization has been an important research area in machine learning (ML), information organization (IO), and information retrieval (IR). Text classification is critical to many information-related processes such as information extraction and knowledge discovery (Knight, 1999; Sebastiani, 2002). It is a fundamental function of digital libraries and can be applied to various information retrieval operations such as indexing and filtering (Yang and Pedersen, 1997; Ke et al., 2007).

In text clustering and classification research, TF*IDF has been extensively used for term weighting and document representation (Yang and Pedersen, 1997; Liu et al., 2003; Zhang et al., 2011). While term frequency (TF) indicates the degree of a document's association with a term, inverse document frequency (IDF) is the manifestation of a term's specificity, key to determine the term's value toward weighting and relevance ranking (Spärck-Jones, 2004). While many classification algorithms have been developed, TF*IDF and its variations remain the *de facto* standard for term weighting in classification (Yang and Pedersen, 1997; Liu et al., 2003; Zhang et al., 2011).

In IR, information and probability theories have provided important guidance to the development of classic techniques such as probabilistic retrieval and language modeling (Robertson and Zaragoza, 2009). Information-theoretic measures such as mutual information and relative entropy have also been used for various processes including feature selection and matching (Kullback and Leibler, 1951; Yang and Pedersen, 1997).

The probabilistic retrieval framework provides an important theoretical ground to IDF weights (Robertson, 2004). IDF () resembles the entropy formula in Shannon's information theory and several works have attempted to justify IDF from an information-theoretic view. IDF can be converted into Kullback-Leibler (KL) information (*relative entropy*) between term probability distributions in a document and in the collection (Aizawa, 2000). KL divergence measures information for discrimination between two probability distributions by quantifying the entropy change in a non-symmetric manner (Kullback and Leibler, 1951).

It has also been shown that a term's IDF is equivalent to the mutual information between the term and the document collection (Fano, 1961; Siegler and Witbrock, 1999). Mutual information can be translated into KL information that quantifies the difference between the joint probabilities and product probabilities of two random variables. The non-symmetry of KL information is due to the assumption that one of the two distributions is considered *closer* (truer) to the ultimate case and the information quantity should be weighted by that distribution. This leads to the consequence that the (absolute) amount of information is different if simply the direction of change is different.

In the KL information view of IDF, the asymmetry of KL and infinite information it quantifies in special cases have undesirable consequences in the IR context. Variations of TF*IDF such as BM25 include additional variables for normalization and smoothing, which often require additional training and tuning. While empirical studies have found various optimal parameter values for different data, it is worthwhile to investigate theoretical underpinnings of related models in order to innovate new term weighting schemes.

From an information-centric view, we develop a new model for document representation (term weighting). By quantifying the amount of semantic information required to explain probability distribution changes, the *least information* theory (LIT) offers a new measure through which terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities, namely LI Binary (LIB) and LI Frequency (LIF), which can be used separately or combined to represent documents. We conduct experiments on several benchmark collections for text classification to demonstrate the proposed methods' effectiveness compared to classic TF*IDF. The major contribution here is not only another term weighting scheme that is empirically competitive but also a novel way of thinking in information science research.