Least information document representation for automated text classification

Authors


Abstract

We propose the Least Information theory (LIT) to quantify meaning of information in probability distributions and derive a new document representation model for text classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, LIT offers an information-centric approach to weight terms based on probability distributions in documents vs. in the collection. We develop two term weight quantities in the document classification context: 1) LI Binary (LIB) which quantifies (least) information due to the observation of a term's (binary) occurrence in a document; and 2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. Both quantities are computed given term distributions in the document collection as prior knowledge and can be used separately or combined to represent documents for text classification. We conduct classification experiments on three benchmark collections, in which the proposed methods show strong performances compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, outperforms TF*IDF in several experimental settings. Despite its similarity to TF*IDF, the formulation of LIB*LIF is very different and offers a new way of thinking for modeling information processes beyond classification.

INTRODUCTION

Aggregating like-entities is a way humans understand the world. Classification, as a knowledge creation and organization mechanism, represents an essential part of intelligence (Taulbee, 1965). Classification or categorization has been an important research area in machine learning (ML), information organization (IO), and information retrieval (IR). Text classification is critical to many information-related processes such as information extraction and knowledge discovery (Knight, 1999; Sebastiani, 2002). It is a fundamental function of digital libraries and can be applied to various information retrieval operations such as indexing and filtering (Yang and Pedersen, 1997; Ke et al., 2007).

In text clustering and classification research, TF*IDF has been extensively used for term weighting and document representation (Yang and Pedersen, 1997; Liu et al., 2003; Zhang et al., 2011). While term frequency (TF) indicates the degree of a document's association with a term, inverse document frequency (IDF) is the manifestation of a term's specificity, key to determine the term's value toward weighting and relevance ranking (Spärck-Jones, 2004). While many classification algorithms have been developed, TF*IDF and its variations remain the de facto standard for term weighting in classification (Yang and Pedersen, 1997; Liu et al., 2003; Zhang et al., 2011).

In IR, information and probability theories have provided important guidance to the development of classic techniques such as probabilistic retrieval and language modeling (Robertson and Zaragoza, 2009). Information-theoretic measures such as mutual information and relative entropy have also been used for various processes including feature selection and matching (Kullback and Leibler, 1951; Yang and Pedersen, 1997).

The probabilistic retrieval framework provides an important theoretical ground to IDF weights (Robertson, 2004). IDF (equation image) resembles the entropy formula in Shannon's information theory and several works have attempted to justify IDF from an information-theoretic view. IDF can be converted into Kullback-Leibler (KL) information (relative entropy) between term probability distributions in a document and in the collection (Aizawa, 2000). KL divergence measures information for discrimination between two probability distributions by quantifying the entropy change in a non-symmetric manner (Kullback and Leibler, 1951).

It has also been shown that a term's IDF is equivalent to the mutual information between the term and the document collection (Fano, 1961; Siegler and Witbrock, 1999). Mutual information can be translated into KL information that quantifies the difference between the joint probabilities and product probabilities of two random variables. The non-symmetry of KL information is due to the assumption that one of the two distributions is considered closer (truer) to the ultimate case and the information quantity should be weighted by that distribution. This leads to the consequence that the (absolute) amount of information is different if simply the direction of change is different.

In the KL information view of IDF, the asymmetry of KL and infinite information it quantifies in special cases have undesirable consequences in the IR context. Variations of TF*IDF such as BM25 include additional variables for normalization and smoothing, which often require additional training and tuning. While empirical studies have found various optimal parameter values for different data, it is worthwhile to investigate theoretical underpinnings of related models in order to innovate new term weighting schemes.

From an information-centric view, we develop a new model for document representation (term weighting). By quantifying the amount of semantic information required to explain probability distribution changes, the least information theory (LIT) offers a new measure through which terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities, namely LI Binary (LIB) and LI Frequency (LIF), which can be used separately or combined to represent documents. We conduct experiments on several benchmark collections for text classification to demonstrate the proposed methods' effectiveness compared to classic TF*IDF. The major contribution here is not only another term weighting scheme that is empirically competitive but also a novel way of thinking in information science research.

PROPOSED THEORY

In this section, we propose a new theory to quantify meaning of information via an extension of Shannon's entropy equation. We start with an example to motivate discussions on what to expect about the theory and introduce the least information theory in which expected characteristics are observed.

A Motivating Example

Let's start the discussion with a simple binary case. Suppose there are two exhaustive and mutually exclusive inferences A and B on a given hypothesis, with probabilities pa and pb respectively (e.g., the likelihood of each candidate winning an election in a one-on-one race). Given the probability distribution, it is straightforward to measure the uncertainty of the inference system using Shannon's entropy formula: H = –k Σ p ln p. When the outcome is known, the uncertainty is reduced to zero and the amount of (missing) information, according to Shannon, can be taken as the reduction of the uncertainty (Shannon, 1948). This entropy-based measure is essentially to determine the amount of missing information given a specified distribution regardless of the ultimate outcome (Baierlein, 1971).

However, the notion of information as a linear function of reducing uncertainty has counterintuitive implications when the meaning of outcome is taken into account. Suppose pa is much larger than pb (e.g., candidate A is more likely to win the election). Intuitively speaking, the outcome of B being the correct inference appears to require more information for explanation than does the ultimate inference of A – for example, the less likely (weaker) candidate winning an election is bigger news and requires more explanation than otherwise.

Model Expectation

If information is a function of linear uncertainty reduction, whatever the outcome is has no influence on the amount of information that explains the outcome, which is against our intuition. In the special case of the above example, the amount of information should not only depend on the uncertainty of inferences but also the ultimate outcome (the correct inference). Furthermore, we reason that, while uncertainly depends only on a specified probability distribution, the amount of information required to explain the outcome and more generally to explain a probability distribution change is beyond a linear function of uncertainty.

Indeed, using Shannon's entropy measure to quantify the amount of meaningful information is beyond the scope of classic information theory. The original purpose of Shannon's theory, as noted in his master piece, was for engineering communication systems where the “meaning of information was considered irrelevant” (Shannon, 1948, p. 379). Information retrieval is centered on the notion of relevance, which has an important semantic (meaning) dimension. Measuring “semantic quantities” of information requires an extension of Shannon's theory, better clarification of the relationship between information and entropy, and justification of this relationship. Efforts have been done with limited progress on identifying meaning quantitatively (Shaw and Davis, 1983).

While theories such as KL information (relative entropy) offer alternatives to the simplified entropy reduction view of information, some characteristics of relative entropy do not meet our expectations about such a measure. Specifically, the asymmetry of the KL function is due to an assumption about one distribution being truer than the other, which is not necessarily realistic. In addition, relative entropies over the course of continuous probability changes in one direction do not add up to the overall amount. Finally and very importantly, extreme probability changes (e.g., when a probability changes from a tiny value to nearly 1) lead to infinite KL information, which is a particularly undesirable property for term weighting in information retrieval.

Least Information (LI)

In this section, we present the proposed least information theory. Let X be prior (initially specified) probabilities for a set of exhaustive and mutually exclusive inferences: X = [x1, x2, …, xn], where xi is the prior probability of the ith inference on a given hypothesis. Let Y denotes posterior (changed) probabilities after certain information is known: Y=[y1, y2, …, yn] where yi is the informed probability of the ith inference. Uncertainties of the two distributions is computed by Shannon entropy:

equation image(1)
equation image(2)

The amount of information obtained from X to Y, in Shannon's treatment, can be measured via the reduction of entropy:

equation image(3)

Inferences are semantically exclusive and involve different meanings. When probabilities vary from X to Y, the two distributions are semantically different and it is obvious that some amount of information is responsible for the variance. Therefore, we need to examine the amount of information associated with individual inferences via the measurement of uncertainty change. With Equation 3, however, it is easy to show that when there are changes in the probabilities, there may be increases, decreases, or no change in the overall uncertainty. We observe that even when there is no change in the entropy, there is still an amount of information responsible for any variance in the probability distribution. To use the overall (system-wide) uncertainty for the measurement of information ignores semantic relevance of changes in individual inferences.

Here our new least information model departs from the classic measure of information as reduction of uncertainty (entropy). First, we reason that a change in the uncertainty of an inference, either an increase or decrease, requires a relevant amount of information that is semantically responsible for it. The overall information needed to explain changes in all inference probabilities is the sum of individual pieces of information associated with each inference.

Second, for an individual inference i, the probability may vary in one of the two semantic directions, i.e., to increase or to decrease it. In either case, there is always a (positive) amount of information responsible for that variance. If we assume inferences are semantically independent11 , the absolute values of these independent pieces of information add linearly to the overall amount of information.

In addition, it is reasonable for such an information quantity to meet the condition that continuous, smaller changes in one direction should add incrementally to a bigger change in the same direction. That is, pieces of information responsible for small, continuous changes of an inference probability in the same direction should add up to the amount of information for the overall change. For example, if the ith inference's probability increases from xi to yi and then to zi, the least amount of information required for the change from xi to yi and the amount from yi to zi should add up to the overall least information required for the change from xi to zi. We define d Hi as the amount of entropy change due to a tiny change d pi of probability pi:

equation image(4)

In the configuration view of entropy, this microscopic variance of entropy due to a small change in an inference's probability is the change of the weighted (pi) number of configurations (equation image) (Baierlein, 1971). In other words, it is the change in the number of configurations (equation image) due to a varied probability weight (pi).

Every tiny change in the probabilities requires some explanation (information). Aggregating (integrating) the small changes of uncertainty leads to the amount of information required for a macro-level change. A macroscopic uncertainty change due to a significant probability shift of an inference is therefore the sum (integration) of continuous microscopic changes in the variance range. Therefore, we define the least amount of information Ii required to explain the probability change of the ith inference as the integration (aggregation) of all tiny absolute (positive) changes of entropy d Hi:

equation image(5)
equation image(6)

We define informative entropy gi as a function of an inference's probability:

equation image(7)

The equation for least information Ii for the ith inference can be rewritten as:

equation image(8)

The total least Information I is the sum of partial least information in every inference:

equation image(9)
equation image(10)
equation image(11)

where n is the number of inferences, xi is the initially specified probability of the ith inference, and yi the revised probability of the ith inference.

Important Model Characteristics

It is worth noting that Equation 11 is to measure the least amount of information required to explain a probability distribution change for a set of inferences. Given that information may alter a probability distribution in various semantic directions and change the uncertainty in both positive and negative directions, the actual amount of information leading to such a change may consist of multiple pieces of information acting on different directions.

Without an exhaustive analysis of the process, the actual amount of information cannot be deduced solely from an investigation of probability distributions. It is only reasonable to quantify the least information needed for that change – that is, the sum of all needed amounts of information at the very least, every tiny piece of which contributes in the same direction of a change. In addition, this model does not consider the process of removing information, which, in effect, is equivalent to adding another piece of information that has perfectly opposite semantics22 in the same amount.

Based on Equation 11, several important characteristics of least information can be observed. Figure 1 compares the least information measure with entropy reduction in a two-exclusive-inference case. We summarize some of these characteristics below.

  • Absolute information and symmetry: The amount of least information required for a probability change from X to Y is the same as that from Y to X, though their semantic meanings are different.

  • Addition of continuous change: Amounts of least information for small, continuous probability changes in the same semantic directions add linearly to the amount of least information responsible for the overall change. In short, I(XZ) = I(XY) + I(YZ), if and only if XY and YZ are in the same semantic direction.

  • Unit Information: In the special case when there are two equally possible inferences, the amount of least information needed to explain an outcome (certainty) is one: equation image, regardless of the log base in the equation (see Figure 1).

  • In the special case of reducing uncertain inferences to certainty (with the ultimate case):

    • With equally likely inferences, when there are more choices, the least information needed to explain an outcome is larger.

    • The less likely the outcome, the larger the amount of least information needed to explain it.

  • Zero least information: The amount of least information is zero if and only if there is no change in the probability distribution.

Figure 1.

Least Information vs. Entropy: Reducing two exclusive uncertain inferences to certainty. Log functions in equations use the natural base. The asymmetry of least information in the plot is a manifestation of its dependence on the outcome. Compare to Shannon (1948).

Least Information for Term Weighting

Now we apply the proposed least information theory to information retrieval (IR) for term weighting and document representation. With a focus on quantifying semantics of information, the least information measure is theoretically compatible with the central problem in IR, which is about semantic relevance.

In the bag-of-words approach to IR, a document can be viewed as a set of terms with probabilities (estimated by frequencies) of occurrence. While the entire collection represents the domain in which searches are conducted, each document contains various pieces of information which differentiate itself from other documents in the domain. By analyzing a term's probability (frequency) in a document vs. that in the collection, we compute information presented by the document in the term to weight the term. In other words, taking domain distributions as prior knowledge, we can measure the amount of least information conveyed by a specific document when it is observed.

In particular, we conjecture that the larger amount least information is needed to explain a term's probability in a document, the more heavily the term should be weighted to represent the document. Hence, we transform the question of document representation to weighting terms according to their amounts of least information in documents. In this study, we propose two specific weighting methods, one based on a binary representation of term occurrence (0 vs. 1) and the other based on term frequencies. These two methods will be used separately and combined in fusion methods as well.

LI Binary (LIB) Model

In the binary model, a term either occurs or does not occur in a document. If we randomly pick a document from the collection, the chance that a term ti appears in the document can be estimated by the ratio between the number of documents containing the term ni (i.e., document frequency) and the total number of documents N. Let p(ti|C) = ni/N denotes the probability of term ti occurring in a randomly picked document in collection C; p(i|C) is the probability that the term does not appear:

equation image

When a specific document d is observed, it becomes certain whether a term occurs in the document or not. Hence the term probability given a specific document p(ti|d) is either 1 or 0. Given the the definition of gi in Equation 7, the least amount of information in term ti from observing document d can be computed by:

equation image(12)

The above equation gives the amount of information a term conveys in a document regardless of its semantic direction. When a query term ti does not appear in document d, the least information associated with the term should be treated as negative because it makes the document less relevant to the term. Hence, the weighting function should not only consider the amount of information but also the sign (positive vs. negative) of the quantity. Hence, LI Binary (LIB) can be computed by:

equation image(13)

Keeping only quantities related to ti (and removing those associated with t̄i), we simplify the LIB equation to:

equation image(14)
equation image(15)

The quantity depends on the observation of term ti in the document: g(ti|d) is 1 when ti appears in document d and 0 if otherwise, according to Equation 7. That is:

equation image(16)

where ni is the document frequency of term ti and N is the total number of documents. The larger the LIB, the more information the term contributes to the document and should be weighted more heavily in the document representation. LIB is similar in spirit to IDF and its value represents the discriminative power of the term when it appears in a document.

LI Frequency (LIF) Model

In the LI Frequency (LIF) model, we use term frequencies to model least information. Treating a document collection C as a meta-document, the probability of a term randomly picked from the collection being a specific term ti can be estimated by: p(ti|C) = Fi/L, where Fi is the total number of occurrences of term ti in collection C and L the overall length of C (i.e., the sum of all document lengths).

When a specific document d is observed, the probability of picking term ti from this document can be estimated by: p(ti|d) = tfi,d/Ld, where tfi,d is the number of times term ti occurs in document d and Ld is the length of the document. Again, for each term ti, there are two exclusive inferences, namely the randomly picked term being the specific term (ti) or not (t̄i). To quantify a term's LIF weight, we measure least information that explains the change from the term's probability distribution in the collection to its distribution in the document in question:

equation image(17)

We focus on the quantities g(ti|d) and g(ti|C) to estimate least information of each term when a specific document is observed. Without quantities g(t̄i|C) and g(t̄i|d), the LIF equation is simplified to:

equation image(18)
equation image(19)

where tfi,d is term frequency of term ti in document d and Ld is the document length. Fi is collection frequency of term ti (the sum of term frequencies in all documents) whereas L is the overall length of all documents. In a sense, LIF can be seen as a new approach to modeling term frequencies with document length and collection frequency normalization. In this study, we use raw term frequencies to estimate probabilities and do not use any smoothing techniques to fine tune the estimates.

Fusion of LIB & LIF

While LIB uses binary term occurrence to estimate least information a document carries in the term, LIF measures the amount of least information based on term frequency. The two are related quantities with different focuses. As discussed, the LIB quantity is similar in spirit to IDF (inverse document frequency) whereas LIF can be seen as a means to normalize TF (term frequency).

In light of TF*IDF, we reason that combining the two will potentiate each quantity's strength for term weighting. Hence we propose three fusion methods to combine the two quantities by addition and multiplication:

  • 1.LIB+LIF: To weight a term, we simply add LIB and LIF together by treating them as two separate pieces of information.
  • 2.LIB*LIF: In this fusion method, we follow the idea of TF*IDF by multiplying LIB and LIF quantities for each term. Because a least information quantity falls in the range of [–1, 1] and can be a negative value, we normalize LIB and LIF values to [0, 2] by adding 1 to each before multiplication.
  • 3.LIB*TF: This method multiplies the LIB quantity by a document length normalized TF (term frequency), similar to the above LIB*LIF method.

These fusion methods allow us to examine potential strengths and weaknesses of the proposed least information term weights for classification. We study LIB and LIF as well as the above fusion methods in experiments. And given the extensive use of TF*IDF in text classification research, we use them as baselines in the study.

EXPERIMENTAL SETUP

Data Collections

Three benchmark collections were used in the study to evaluate the effectiveness of proposed term weighting methods for text classification. These collections, including the WebKB 4 universities data, the 20 Newsgroups collection, and the RCV1 Reuters corpus, had been widely used for text clustering and classification research.

  • WebKB 4 Universities Data (WebKB): This data set contains 8, 282 web pages collected in 1997 from computer science departments of various universities, which were manually categorized into seven categories such as student, faculty, and department. This was developed by the WebKB project at CMU (Craven et al., 1998).

  • 20 Newsgroups (20News): The collection contains 20, 000 messages from 20 news groups (categories) (Lang, 1995). The messages were randomly picked to distribute evenly among the categories. We used a revised version which retained 18, 828 messages after duplicate removal. We used all 20 categories as gold standard labels.

  • Reuters Corpus Volume 1 (RCV1-v2): The RCV1 collection contains 804, 414 newswire stories made available by Reuters. RCV1-v2 is a corrected version of the original collection, in which documents were manually assigned to a hierarchy of 103 categories (Lewis et al., 2004). There are four top-level categories (under the hierarchical root), which we used as labels for evaluation.

System Settings

We developed an experimental classification system based on the Weka data mining framework (Witten and Frank, 2005). We implemented various document representation methods including the proposed term weighting schemes and TF*IDF based on a Weka vectorization filter. We relied on a KNN (k nearest neighbors) classifier for classification experiments (Aha et al., 1991). KNN is a classic classification method, which has demonstrated strong performances in previous research (Yang and Pedersen, 1997; Sebastiani, 2002).

We tokenized documents into single words, removed stop-words, and normalized terms using an iterated Lovins stemmer (Lovins, 1968). A number of top frequent words were selected as features (DF thresholding); 1,000 features were used in main experiments. We varied the number of features in experiments to study the influence of feature selection. All documents were normalized to unit vectors.

In KNN classification, we used the euclidean distance and set the number of neighbors k to 25 according to previous research (Yang and Pedersen, 1997; Sebastiani, 2002). We conducted 30 runs of KNN for each experimental setting, in which classification was performed on a random sample of 2,000 documents. We used existing labels in each data collection as target categories.

Evaluation Metrics

Using categorical labels available in data as the gold standard, we evaluated classification results based on several classic metrics, namely, precision, recall, F1, ROC, and Kappa. Computing these metrics is by viewing document classification as a series of decision making (Manning et al., 2008). Given the following table which summarizes the numbers of correctly and incorrectly classified document pairs:

1

Table 1. Decision table of classification
original image

Precision P evaluates the fraction of identified documents that are relevant in each class. Recall R measures the fraction of documents relevant to the class that are identified. F1 is the harmonic mean of precision and recall. They can by computed by:

equation image(20)
equation image(21)
equation image(22)

ROC measures the area under a the sensitivity-specificity curve. While sensitivity is equivalent to recall, specificity is given by TN/(FP + TN). ROC is an aggregate measure over the full retrieval/classification spectrum (Manning et al., 2008).

Let N denote the total number of documents, N = TP + FP + TN + FN. Kappa quantifies the agreement between the classifier and gold standard, and can be computed by:

equation image(23)

where P(A) = (TP + TN)/? is the proportion of times with agreement, and P(E) = [(FP + TN)/2N]2+ [(FN + TN)/2N]2 is the proportion of chance agreement.

Precision focuses on the internal accuracy within each class. Recall, on the other hand, addresses the effectiveness of having as many relevant document pairs as possible in one class. Whereas precision and recall emphasize the ability to find relevant answers/pairs (true positive), ROC and Kappa take into account other quantities in the decision table. With these various evaluation metrics, we were able to examine strengths and weaknesses of proposed methods in multiple perspectives.

RESULTS

We present results using KNN classification with 1,000 features in the following sections. Then we analyze the impact of feature selection on classification effectiveness, showing that overall best results were obtained with 1, 000 features in WebKB, 20News, and RCV1 collections. In each set of experiments presented here, best scores in each metric are highlighted in bold.

WebKB 4 Universities Data

Table 2 shows classification results on the WebKB 4 Universities data set. The proposed LIB*LIF method outperformed TF*IDF in terms of every evaluation metric. LIB and LIB+LIF also achieved very competitive results, especially in terms of ROC and precision.

20 Newsgroups Data

Classification experiments on the 20 Newsgroups data, as shown in Table 3, produced a slightly different picture. While TF*IDF was better in terms of Kappa and recall, LIF achieved better precision and ROC. LIB+LIF and LIB*LIF performed competitively in terms of precision as well.

Table 2. WebKB Document Classification
MethodROCKappaPRF1
TF*IDF0.9170.6390.7420.7500.731
LIB0.9160.6110.7490.7430.717
LIF0.8830.5140.6670.6710.648
LIB*TF0.8690.4860.6430.6390.621
LIB+LIF0.9210.5830.7520.7300.701
LIB*LIF0.9380.6670.7760.7760.756
Table 3. 20News Document Classification
MethodROCKappaPRF1
TST*ISTF0.9320.5480.7650.5720.603
LIB0.9000.4250.7640.4550.521
LIF0.9440.2920.9040.3290.410
LIB*TF0.9420.4970.7340.5230.575
LIB+LIF0.9000.4230.7780.4530.521
LIB*LIF0.9000.4310.7700.4600.526

RCV1 Reuters Corpus

Experiments on the RCV1 corpus showed a pattern quite similar to WebKB results. As shown in Table 4, LIB*LIF appeared to be the best method in terms of most metrics and achieved competitive ROC scores close to that of TF*IDF. Both LIB*LIF and LIB methods outperformed TF*IDF in precision, recall, F1, and Kappa.

Table 4. RCV1 Document Classification
MethodROCKappaPRF1
TF*IDF0.9730.6390.8460.7470.754
LIB0.9650.7290.8460.8240.814
LIF0.9240.5190.7880.7000.637
LIB*TF0.9430.3630.8080.5540.538
LIB+LIF0.9660.5720.7920.7120.680
LIB*LIF0.9660.7400.8520.8220.818

Impact of Feature Selection

Now we look at the impact of feature selection on the effectiveness of document representation for classification. In this work, we selected features based on their frequencies in the collection (DF thresholding), which was computationally simple and had been found in several studies to be a very effective feature selection technique for categorization-related tasks (Liu et al., 2003; Yang and Pedersen, 1997; Zhang et al., 2011). We varied the number of features Nf in each set of experiments, for which the top Nf most frequent terms were kept for document representation.

Figure 2, 3, and 4 show the influence of the number of selected features on classification effectiveness with the WebKB, 20Newsgroup, and RCV1 data respectively. Note that the X axis is logarithmic and the number of features Nf decreases from left to right.

In each presented figure, there in general exists an inflection point between Nf = 1000 and Nf = 100, where optimal classification results were achieved33 . With a large feature space (e.g., with 30,000 features for WebKB), unrelated documents were likely to be grouped together because of irrelevant common terms, leading to a large number of false positive (hence lower precision). Some degree of feature removal reduced the amount of noise in the feature space and improved classification effectiveness.

Figure 2.

WebKB: Impact of feature selection. X denotes # of features and is log-transformed. Y is the metric score.

Figure 3.

20News: Impact of feature selection. X denotes # of features and is log-transformed. Y is the metric score.

Previous research has observed that using various feature selection methods to eliminate up to 90% of term features resulted in either no loss or improvement of clustering and categorization accuracy (Liu et al., 2003; Yang and Pedersen, 1997). Further feature reduction from the inflection point degraded classification performance when there were insufficient features for accurate document representation. As shown in Figures 3 and 4, similar patterns about the influence of feature selection were found with the 20Newsgroup and RCV1 data.

Discussion

In the various experiments presented here, the proposed term weighting methods based on least information modeling performed very strongly compared to TF*IDF. In experiments on the three benchmark collections, top performance scores were mostly achieved among the proposed methods. LIB*LIF was overall the best method in the experiments and consistently outperformed TF*IDF. Both LIB*LIF and LIB were particularly competitive in terms of precision and ROC.

The LIB*LIF scheme is similar in spirit to TF*IDF. By modeling (binary) term occurrences in a document vs. in any random document from the collection, LIB integrates the document frequency (DF) component in the quantity. LIF, on the other hand, models term frequency/probability distributions and can be seen as a new approach to TF normalization.

Despite the similarity, our experiments showed LIB*LIF, based on the new least information formulation, were more effective than TF*IDF for document representation in the text classification context. Least information modeling can be applied to other important IR processes such as retrieval ranking, for which TF*IDF and its variations such as BM25 have produced strong empirical results.

Experiments showed that methods with the LIB quantity were more effective in terms of within-cluster accuracy (e.g., precision). By emphasizing the discriminative power (specificity) of a term, LIB reduces weights of terms commonly shared by unrelated documents, leading to fewer of these documents being grouped together (smaller false positive and higher precision). LIF, on the other hand, helped to boost recall with the integration of term frequency. The different strengths of LIB and LIF indicate that they can be combined or used separately to serve various classification purposes.

An additional interesting finding in this study is the inflection point in the classification performance vs. # features plots. In various data and experimental settings, optimal classification performance was achieved with roughly 1000 features (selected by DF thresholding). Increasing or decreasing the number of features from the inflection point degraded classification effectiveness. While similar patterns were observed in text clustering and categorization research, further investigation and discussion are needed in order to understand factors related to this phenomenon.

Figure 4.

RCV1: Impact of feature selection. X denotes # of features and is log-transformed. Y is the metric score.

CONCLUSION

We presented the least information theory (LIT), which quantifies the meaning of information in probability distributions. We observed several important characteristics of the proposed information quantity, which provides new insight into modeling of related IR problems. Two basic quantities were derived from the theory for term weighting and document representation, which we used separately and combined in various term weighting methods for document classification.

Research was conducted to evaluate the effectiveness of proposed methods compared to TF*IDF, which had been extensively used in text classification research. Experiments on three benchmark collections showed very strong performances of LIT-based term weighting schemes. In most experiments, the proposed LIB*LIF fusion method outperformed TF*IDF.

While we have demonstrated superior effectiveness of the proposed methods, the main contribution is not about improvement over TF*IDF. Of greater significance is the new approach to information measurement and term weighting based on the least information theory (LIT), which enables a different way of thinking and provides a new information-centric approach to modeling various information processes.

Footnotes

  1. 1

    1Inference probabilities are never perfectly independent of one another given the degree of freedom. But to simply the discussion and formulation, we take the independence assumption.

  2. 2

    The term opposite does not indicate true vs. false information. Opposite information semantics can be seen, in a sense, as good news vs. bad news.

  3. 3

    One exception is with 20Newsgroup data, on which reducing the feature space monotonically degrades classification precision.

Ancillary