Enriching text representation with frequent pattern mining for probabilistic topic modeling

Authors


Abstract

Probabilistic topic models have been proven very useful for many text mining tasks. Although many variants of topic models have been proposed, most existing works are based on the bag-of-words representation of text in which word combination and order are generally ignored, resulting in inaccurate semantic representation of text. In this paper, we propose a general way to go beyond the bag-of-words representation for topic modeling by applying frequent pattern mining to discover frequent word patterns that can capture semantic associations between words and then using them as additional supplementary semantic units to augment the conventional bag-of-words representation. By viewing a topic model as a generative model for such augmented text data, we can go beyond the bag-of-words assumption to potentially capture more semantic associations between words. Since efficient algorithms for mining frequent word patterns are available, this general strategy for improving topic models can be applied to improve any topic models without substantially increasing the computational complexity of the model. Experiment results show that such a frequent pattern-based data enrichment approach can improve over two representative existing probabilistic topic models for the classification task. We also studied variations of frequent pattern usage in topic modeling and found that using compressed and closed patterns performs best.

INTRODUCTION

These days, most documents and articles are recorded in digital text files. Since the Internet became the essential tool for our life, the Web also has become the precious source of information, and it keeps growing. However, the size of the data is too huge, and most of them are unstructured and duplicated text data. Therefore, the need for effective tools analyzing and summarizing those data is increasing.

Probabilistic topic models have been proven very useful for mining text data. A topic model is a generative probabilistic model which can be used to extract topics from text data in the form of word distributions. A word distribution can intuitively represent a topic by assigning high probabilities to words characterizing a topic; for example, a topic about battery life in reviews of iphone may have high probabilities for words such as “battery,” “hour,” and “life”. The two basic representative topic modeling approaches are Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 1999) and Latent Dirichlet Analysis (LDA) (Blei et al., 2003).

Topic models are widely used in various areas of text mining including opinion analysis (Titov and McDonald, 2008b, Blei and Mcauliffe, 2007, Titov and McDonald, 2008a, Lin and He, 2009), text information retrieval (Wei and Croft, 2006), image retrieval (Hörster et al., 2007), natural language processing (Boyd-Graber et al., 2007), social network analysis (Liu et al., 2009), and so on. For example, Topic sentiment mixture (Mei et al., 2007) used PLSA for modeling topics as well as opinion orientation, and Multi-grain topic model (Titov and McDonald, 2008b) used LDA for summarizing opinions.

Although many variants of topic models (Blei et al., 2004, Blei and Lafferty, 2006, Mimno et al., 2009) have been proposed, most of existing works are based on the bag-of-words representation of text. Bag-of-words representation is based on a unigram language model and considers one word as a unit. In this case, word combination and order are generally ignored, resulting in inaccurate semantic representation of text. There are some attempts to go beyond the bag-of-words model (Wallach, 2006, Wang et al., 2007), but the proposed methods only work for a few specific models and would increase complexity of models and the number of latent variables to infer significantly.

In this paper, we propose a frequent pattern-based data enrichment approach, a general way to break this limitation by enriching input data with using frequent pattern mining. Although there are a lot of studies in topic modeling and pattern mining respectively, there is no attempt to combine those two popular techniques; our work represents the first attempt to systematically study the benefit of combining them. Specifically, from the input data, we mine frequent patterns which can capture semantic associations between words and add pattern information to the input data as additional supplementary semantic units to augment the conventional bag-of-words representation.

By viewing a topic model as a generative model for such augmented text data, we can go beyond the bag-of-words assumption to potentially capture more semantic associations between words. Since efficient algorithms for mining frequent word patterns are available, this general strategy for improving topic models can be applied to improve any topic models without substantially increasing the computational complexity of the topic model.

Since the general idea of using frequent patterns to obtain more discriminative text representation is orthogonal to the choice of specific topic models, we chose to evaluate the proposed data enrichment technique using two baseline representative topic models, LDA and PLSA. Experiment results show that such an approach can improve over these two representative existing probabilistic topic models for text categorization on two representative news data sets. We also compared performance of various types of frequent patterns such as sequential patterns, ungapped patterns, closed patterns, and compressed patterns. The results show that using compressed frequent patterns performs best, outperforming strong baselines representing n-gram topic models.

PROBABILISTIC TOPIC MODELS

Basic Topic Models

A topic model is a generative model for documents: it specifies a probabilistic process by which documents can be generated from a set of topics, where a topic is represented by a multinomial distribution of words, i.e. a unigram language model. In the document generation process, one usually first chooses a distribution over topics. Then for each word in the document, one chooses a topic at random according to this distribution and draws a word from that topic.

More formally, assuming all the documents in the collection D cover a finite set of K topics, each topic j is associated with a multinomial word distribution φj over the vocabulary V. A document d is represented as a bag of words. We use θj to denote the multinomial distribution over topics for document d. The parameters φ and θ indicate which words are important for which topic and which topics are important for a particular document, respectively. As a background, we now briefly introduce the two representative basic topic models, PLSA and LDA.

PLSA

Probabilistic latent semantic analysis (PLSA) (Hofmann, 1999) regards a document d as a sample of the following mixture model.

equation image(1)

Estimation: The word distributions equation image and topic mixture distributions equation image can be estimated with an Expectation-Maximization (EM) algorithm (Dempster et al., 1977) by maximizing the (log) likelihood that the collection D is generated by this model:

equation image(2)

where c(w, d) is the number of times word w occurs in document d.

LDA

The PLSA model does not make any assumptions about how the mixture weights θ are generated; thus there is no natural way to predict unseen documents, and the number of model parameters grows linearly with the number of training documents, making the model susceptible to overfitting. To solve these limitations, (Blei et al., 2003) proposed Latent Dirichlet Allocation (LDA). The basic generative process of LDA is very similar to that of PLSA. The major difference is that the topic mixture θ in LDA is drawn from a conjugate Dirichlet prior rather than conditioned on each document as in PLSA. The parameters of this Dirichlet distribution are specified by α1,…,αK. It is mathematically convenient to use a symmetric Dirichlet distribution with a single hyper-parameter α such that α1 = α2 = … = αK = α. The probability density is given by:

equation image(3)

where the Γ functions are normalization factors that make sure the probabilities sum up to 1. By placing a Dirichlet prior on the topic distribution θ, LDA achieves a smoothed topic distribution, with the amount of smoothing determined by the α parameter. Similarly, a symmetric Dirichlet prior with the hyper parameter β can be placed on φ as well. Given the parameters α and β, the probability of the collection of documents D can be calculated by integrating over θ and φ: equation image

Estimation: The LDA model cannot be solved by exact inference, but there are quite a few approximate inference techniques proposed in the literature: variational methods (Blei et al., 2003), expectation propagation (Minka and Lafferty, 2002, Griffiths and Steyvers, 2004), and Gibbs sampling (Geman and Geman, 1984, Griffiths and Steyvers, 2004). A recent work (Teh and Görür, 2009) compares perplexities when using different estimation algorithms for LDA; their conclusion is that the perplexity performance differences among algorithms diminish significantly when hyperparameters are optimized.

Other Topic Models and Limitations

Because of their usefulness, in addition to two representative topic models, LDA (Blei et al., 2003) and PLSA (Hofmann, 1999), there were many attempts to improve topic models. Hierarchical topic model supports a multi-level topic analysis (Blei et al., 2004), Correlated topic model (Blei and Lafferty, 2006) considers correlations between topics in modeling, and Polylingual topic model (Mimno et al., 2009) uses alignments across different languages for topic modeling.

In almost all the existing works on topic models, bag-of-words representation was assumed. Although bag-of-words models showed its usefulness to some extent, they have limitations in modeling texts accurately. First, bag-of-words models use a single word as a unit, so they cannot capture the semantic meaning coming from combination of words. Sometimes single words are ambiguous and should be interpreted differently with combination of different words. Also, like compound nouns and proper nouns, multiple-word terms often become meaningless when separated. Second, ordering of words may affect meanings. Bag-of-words models just capture co-occurrence of words and do not consider the order of them. Even with the same word set, ordering can change the meaning of words. For example, ‘traveling salesman problem’ is a famous problem in combinatorial optimization studies. If we separate all the three words or ignore the order of words, the corresponding modeled topic may also include texts about business travel of salesmen.

There are some attempts to overcome the limitation of the bag-of-words model. Bigram LDA model (Wallach, 2006) uses bigram as a unit, and Topical n-gram (Wang et al., 2007) models phrases within topical context. Although these models have high modeling power and solve some problems of bag-of-words models, there are three limitations.

First, gaps between words need to be modeled in multi-word pattern finding. Depending on the writer, various types of writing habit influence articles. Simple n-gram models may miss patterns with modifiers. For example, an n-gram model which does not allow gaps would not capture the high similarity between ‘LCD screen is very bright’ and ‘LCD screen is big and bright’ since they have different words between ‘LCD screen is’ and ‘bright’.

Second, because most existing improved topic models modify model structure, the proposed techniques tend to be only applicable to the specific models modified. Moreover, as people develop more powerful models, they also complicate models and increase the number of variables to infer in the modeling process.

Third, these models are computationally expensive. If there are V words, there can be V2 bi-grams and Vn n-grams combinations. Using all the combinations of non-unigram units can be very expensive and not efficient because many of them are useless and do not occur a lot.

To solve these problems, in this paper, we propose a new and general way to improve topic models using frequent pattern mining.

FREQUENT PATTERN-BASED DATA ENRICHMENT

While there were various strategies in improving topic modeling, there has been no attempt to use pattern mining. In this paper, we propose a frequent pattern-based data enrichment approach, a new and general way to improve topic models. Our basic idea is to mine frequent patterns from the input data and expand input data with the mined pattern data. When we add pattern data, each pattern is added as a single unit. That is, instead of replacing words of original input data with patterns, we add patterns as supplementary data so as to retain the benefits of bag-of-words models. By adding the additional pattern information, our model can have more discriminative units.

Formally, let D = {d1, …, dn} be a collection of documents. Let θ be all the parameters of the topic model. The likelihood function is generally in the form of

equation image(4)

The idea of data enrichment is to augment word units with additional patterns. Formally, let T = {t1, … tm} be the set of pattern terms mined from the entire collection D. Let T(d)T be the subset of patterns occurring in d. The likelihood function for the topic model using data enrichment is given by: equation image

Thus any topic models can be used, and the computational complexity is about the same. Clearly, the main challenge here is to determine what kind of patterns to use. We will discuss several possibilities later.

Benefits

Using frequent patterns can solve many problems mentioned in the previous section. First, using patterns is a good way to express multi-word relationships. A pattern is a set or sequence of words. By considering several words as one unit, we can make co-occurrences more reflect true semantic associations. Because frequent patterns allow gaps between words in patterns, we can capture more meaningful phrases. By using sequential patterns, we can even consider word orders. Because patterns can capture word combinations and their orders, patterns serve well as discriminative units in ambiguous situations.

Second, we can model word patterns without increasing complexity of models by adding pattern information directly to the input data. Also, by modifying the input data set instead of the model itself, our strategy can be generally applied to any topic models. Moreover, because we use only ‘frequent’ patterns, the enhanced model does not waste time for rare insignificant phrase patterns.

Pattern Types

Table 1. Example data set to show the definition of frequent patterns
Text IdWords in text
1battery life should be long
2long battery life
3very long battery life
4good design
5price is high

There are many kinds of frequent patterns that we can potentially use in frequent pattern-based data enrichment. Because each pattern mining approach has its own characteristic, each has unique possible benefits in topic modeling

Frequent/Frequent Sequential Pattern

A pattern is a set or sequence of items that co-occur in context. In our problem setup, items are words, and contexts are texts. Patterns can be classified into two types by order-awareness. Simple pattern which captures co-occurrence regardless of the order is defined as a set of words. Sequential pattern which captures co-occurrence in some order is defined as a sequence of words. Because the distance between words is not considered in pattern mining, gaps between words are allowed in patterns. Support of a pattern is frequency or the number of occurrences of the pattern in a data set. When a pattern occurs in a data set more than a minimum support, we call the pattern a frequent pattern. If a pattern is frequent and order-respected, we call it frequent sequential pattern.

For example, we have a text data set in Table 1. We can find a pattern {battery life long} with 3 supports (by text 1, 2, and 3). With consideration of order, we can find sequential patterns such as ‘long battery life’ with 2 supports (by text 2 and 3) and ‘battery life’ with 3 supports (by text 1, 2, and 3). If we set the minimum support as 50%, patterns having more than or equal to 2.5 supports (50% of 5 data) will be selected as frequent patterns ({battery life long}) or frequent sequential patterns (‘battery life’).

Frequent pattern is the most flexible type of pattern. Frequent pattern mining allows gaps between words and ignores order of words. With a flexible threshold, frequent pattern mining can generate variable number of patterns as needed. Using many patterns can improve the discriminative power of models. However, because there can be noisy patterns, using too many patterns may decrease the performance of models. Frequent sequential pattern considers ‘order’ of words. Thus frequent sequential pattern mining can give us more discriminative meaningful phrases because it can distinguish phrases whose semantic meanings changes in different order.

Pattern Compression

To remove redundant patterns, pattern compression methods such as closed pattern (Pasquier et al., 1999) and compressed pattern (Xin et al., 2005) are used. When we mine patterns from data, one pattern can be a subpattern of another pattern. We call the longer pattern which includes the other pattern superpattern. If we use both subpatterns and superpatterns, it may overweigh information repeated in both patterns. Pattern compression can filter out meaningless patterns and let us use only important ones.

Closed pattern is a pattern whose superpatterns do not have the same support as it has. For example, patterns AB: 3, AC: 2, AD: 3, BC: 2, BD: 3, ABC: 2, ABD: 3 can be compressed as the closed patterns ABC: 2, ABD: 3 (format – ‘pattern: support’. assume each letter is a word, a number is a support of the pattern). We can regard this method as lossless compression because we can recover all the removed patterns with the exact original support. Closed patterns can effectively filter out the redundant patterns, which are present if and only if the superpattern is present. Such meaningless patterns may harm any tasks using pattern mining because it duplicates the same patterns, and the number of duplication may be as many as exponential in terms of the closed pattern.

Compressed pattern proposes a more sophisticated strategy for pruning overlapping patterns. Compressed pattern mining is similar to closed pattern mining, except that this method also removes patterns whose supports are close to the super-pattern's support in factor δ. If the support difference between one pattern and a subpattern of the pattern is within δ range, the subpattern is removed. That is, the larger δ is, the more patterns are compressed, and when δ = 0, it works in the same way as closed pattern. This method can be regarded as lossy compression because we lose small number of patterns whose support information may be meaningful. We cannot recover all filtered patterns from the compressed patterns. Nevertheless, compressed pattern has the advantage over closed pattern in compactness by reducing redundancy more aggressively. In our experiment, to show the effect of pattern compression clearly, we used δ = 0.3 which is the largest value among the values used in the original compressed pattern mining paper (Xin et al., 2005).

Ungapped Pattern Mining

Unlike all the patterns above, ungapped frequent sequential patterns are n-gram patterns with variable length that do not allow gaps between items. The mined patterns will be frequent n-grams. We can assume that ungapped sequential pattern-based topic models are simulations of bi-gram and n-gram topic models. In this way, we will be able to compare the performance of our frequent pattern-based data enrichment method with existing bi-gram and n-gram topic models indirectly.

EXPERIMENT DESIGN

One of the frequently used evaluation methods in topic modeling is measuring perplexity. However, in our problem set up, because our method adds additional units to compare into the data set, the test data set would be differnet. Therefore, directly comparing the perplexity of topic models with original data set with that with pattern enriched data set is not fair.

In this paper, to show the usefulness of our method directly, we use one application of a topic model in evaluation. In our experiment, we will perform a classification task using topic models and measure error rate of the task. For generality, we will use two standard data sets and two representative topic modeling methods, LDA and PLSA, as base topic models to apply patterns. To understand the frequent pattern-based data enrichment technique better, we will use various patterns and compare their performances.

Data set

Table 2. Statistics of TDT2 and Reuters Corpora
StatisticsTDT2Reuters
Number of Documents Used74568246
Number of Unique Words4704726269
Number of Categories Used1010
Average Document Length401117
Maximum Category Size18443945
Minimum Category Size167116
Mean Category Size746825
Figure 1.

Frequent pattern-based data enrichment method

We use two standard data sets: the TDT211 and the Reuters-2157822 (or simply, Reuters) document corpora. These data sets include news article data which is one of the most representative digital document types and is also general because it covers various topics. Every document in the corpora has been manually assigned one or more labels indicating which topic(s) it belongs to. The TDT2 corpus consists of 11201 on-topic documents which are classified into 96 semantic categories. In this experiment, since we want to study applications of topic models in the standard classification setting, where each document belongs to exactly one class, documents appearing in two or more categories are removed. We further select the largest 10 categories so as to eliminate outlier categories (which may cause high variances of categorization performance), leaving us with 7456 documents in total. The Reuters corpus contains 21578 documents which are grouped into 135 categories. It is much more unbalanced than TDT2, with some large clusters more than 30 times larger than some small ones. In our experiments, we discarded documents with multiple category labels, and only selected the largest 10 categories, which left us with 8246 documents in total. Table 2 provides some statistics of the two document corpora.

Method

The overview of the frequent pattern-based data enrichment method is in Figure 1. Frequent pattern-based data enrichment process can be divided into four steps; preprocessing, pattern mining, pattern addition, and topic modeling.

Preprocessing

The first step is preprocessing input data. Each document in input data set is cleaned by removing special characters. Each document is segmented by a sentence unit, and each sentence is also tokenized by word unit. We used porter stemmer for stemming and changed all characters to lower case.

Pattern Mining

The next step is pattern mining from the input data set. Each word and each sentence become an item and a transaction, respectively, in frequent pattern mining framework. Within one type of data set, we scan all the sentences and find frequent word patterns which occur in the sentences in the data set more than or equal to the minimum support threshold. That is, the maximum support would be the number of sentences, and the maximum length of the pattern would be the number of words in the longest sentence. For example, if there are 7000 sentences, and minimum support is 1%, then, word patterns occurring in more than or equal to 70 sentences are extracted as frequent patterns. Here, the document segmentation is ignored. The more sophisticated pattern mining strategy considering document boundaries is left as a future work.

We tried various types of patterns. We used FPGrowth (Han et al., 2000) for frequent pattern mining, PrefixSpan (Pei et al., 2001) for frequent sequential pattern mining, CloSpan (Yan et al., 2003) for closed frequent sequential pattern mining, and RPlocal (Xin et al., 2005) for compressed pattern mining. We used implementations of Illimine toolkit 33 and modified implementation for other patterns. For example, for ungapped sequential patterns mining, we modified PrefixSpan not to allow gaps.

Pattern Addition

After pattern mining, the next step is adding mined patterns to the preprocessed input data set. For each mined pattern, we assign unique pattern ID, for example, freqABCEF. We add the pattern ID to each document as many times as the pattern occurs in the document. Patterns will work like other plain words. In this way, we can let the topic model handle each pattern as one word. The modified input data can be used for any general topic modeling methods.

Topic Modeling

In our experiment, we performed PLSA and LDA topic modeling with the modified data by each pattern type. Topic modeling is based on estimation, and it finds the local optimum from the starting points. Therefore, modeling results may vary depending on the initial setup. Some initial settings may lead to better local maxima than others. Therefore, it is unfair to compare performances of topic models by a single run. To compare the best performance fairly, per experiment, we executed three model estimation trials and used the modeling result having the highest likelihood (P(Data|Model)). We used GibbsLDA++44 (Phan et al., 2008, Bíró et al., 2008) for LDA and our implementation based on Lemur toolkit 55 for PLSA.

Evaluation Task and Evaluation Metric

To evaluate the proposed method, we measured the performance of classification task. As an alternative to using all the words as features, we can use a topic-based representation with each P(z = j) as an element in the vector; we thus have K features, corresponding to K topics. More specifically, topic models can be applied to all the documents without reference to their true class label. After that, the dimensionality of the feature space is reduced to the number of topics, and the new feature set becomes θ(d) = P(z). Utilizing the SVMmulticlass toolkit66 , we train and test the classification model with the features described above and evaluate the error rate of the multi-class classification. The lower error rate indicates the better feature representation.

Parameter Setting

Pattern mining

In pattern mining, there are two additional parameters we can tune, minimum support and pattern length. Minimum support decides the threshold of frequentness. That is, frequent patterns are ones having more than and equal to minimum support. For example, with 10% minimum support, the target pattern should occur more than or equal to 10 times out of 100 cases. A pattern length parameter can limit the length of patterns to use. Here, we used maximum pattern length. For example, with maximum pattern length 4, patterns with length 2, 3, and 4 are used, and longer patterns are discarded.

With low minimum support and large maximum pattern length, more patterns will be used in frequent pattern-based data enrichment. More patterns may help to improve performance by providing more discriminative information; however, it also may incorporate noise or redundancy into data. Lowering minimum support may choose non-discriminative patterns as frequent patterns. Although long patterns are usually more discriminative because they are rare, too long patterns may come from noisy data with duplicated sentences.

In our experiment, to examine these trade-offs, we tried various minimum supports (0.01%, 0.1%, 1%, and 3%) and maximum pattern lengths (2, 4, 8, and 10). Our variations are based on the observation that there were no pattern having more than 5% minimum support and longer than 10. When we vary the minimum support, we set the maximum pattern length 2 or 10, and when we vary the maximum pattern length, we set the minimum support 0.01%.

Depending on parameter setting methods, we will show two performance measurements in experiment results. Best optimization results are the best result among ones with various parameter settings, which can represent the upper bound performance. In cross data optimization results, we used the parameters tuned with the other data set. This setting simulates the practical situation that we have training data for parameter tuning and use the tuned model for the test data.

Topic Modeling

In topic modeling, there are parameters such as the number of topics and the number of iterations which we can tune in topic modeling. We tune these parameters for the best performance in general unigram topic models, LDA and PLSA, which are our baselines, with the two data sets. With the optimal setting of existing methods, we can clearly claim performance improvement of our method over the existing techniques. In experiments on LDA with different number of topics, classification performance increased with more topics and converged in more than 50 topics. In case of PLSA, classification performance increased up to 50 topics and decreased in more than 50 topics. In experiments with different number of iterations, both modeling techniques showed very similar performance in more than 200 iterations. Therefore, we decided to use 50 topics and 200 iterations as the default setting.

EXPERIMENT RESULTS

Basic comparison

We first look into the question whether the proposed method of enriching data representation with frequent patterns is effective. To this end, we compare the pattern-based representation with the baseline unigram representation, which is the common practice when applying topic models. Table 3 shows the error rate of the proposed method with various types of patterns and baseline in both best optimized parameter setting and cross data optimized parameter setting. Below each performance result, we show the performance improvement from baselines. In cross data optimization, we use one data set to tune the parameters and use the other data set to test the performance of the parameter setting.

The results show that frequent pattern-based data enrichment outperforms the baseline models (no use of patterns) in almost all the cases. In particular, our data enrichment method consistently outperforms LDA, which is the stronger baseline (LDA is consistently better than PLSA). These results clearly show that using patterns actually helps to improve topic modeling performance in classification. With positive results on cross data optimization, we can also conclude that our frequent pattern-based data enrichment is robust, and it is possible to set parameters based on training data that is very different from test data.

Among all the 48 cases, there are only three cases where our methods showed performance decreases when it is applied to PLSA with cross-data optimization. Further examination of the results indicates that the corresponding results using “best optimization” (upper bound comparison) are all positive with clear improvement. Thus we can conclude that the occasional degradation of performance for PLSA is likely because it is difficult to tune parameters with PLSA. Indeed, it is known that compared with LDA, PLSA tends to over-fit data, which may have led to the difficulty in setting parameters with cross-data optimization.

Among the pattern types we tried, data enrichment with sequential patterns showed the best average performance improvement in best optimization, and compressed frequent pattern-based methods showed very high performance in both best optimization and cross data optimization consistently.

In the following subsection, we will further examine performance differences in using different types of patterns.

Effect of Pattern Compression

In our experiment two pattern compression methods are used: closed pattern and compressed pattern. Comparing two methods, compressed pattern approaches showed better performance than closed pattern.

Table 3. Classification error rate of the frequent pattern-based data enrichment and improvement rate from the baseline result. (Unit: %, FreqPat: frequent pattern, SeqPat: frequent sequential pattern, Closed: closed pattern, Comp: compressed pattern.)
original image

Moreover, according to the experiment results, pattern compression methods help to increase stability in parameter setting. In Table 3, compared with data enrichment using normal patterns, data enrichment using compressed pattern showed more stable performance in cross data optimization. While frequent pattern and frequent sequential pattern showed rather large average performance drop between the best optimization and cross data optimization (FreqPat: −18.28, SeqPat: −9.6), closed pattern showed less performance drop (ClosedFreqPat: −7.23, ClosedSeqPat: −12.33), and compressed pattern showed even lesser performance drop (CompFreqPat: −6.20, CompSeqPat: −7.35).

Simple frequent pattern and frequent sequential pattern mining may generate redundant and noisy patterns, and these noisy patterns may cause performance to decrease. Pattern compression methods can remove redundant subpatterns effectively. Therefore, even if parameters are not optimized, compressed pattern-based topic models are less affected by noise.

Comparing closed pattern and compressed pattern strategy, the compressed pattern model showed better stability as well as performance. This is because compressed pattern has more aggressive strategy to compress subpatterns than closed pattern. That is, while closed pattern just removes all the subpatterns which have the same support to superpatterns, compressed patterns removes subpatterns within threshold even if their supports are not the same as superpatterns' supports. Therefore, the generated compressed patterns are generally fewer than closed patterns, resulting in more discriminative and less redundant text representation.

Effect of Pattern Sequence

Patterns can be categorized into two types, one considering sequence of words and the other not considering sequence. Table 4 shows performance comparison between frequent pattern and frequent sequential pattern with two pattern compression strategies.

According to the experiment results, with no pattern compression, sequence-based patterns performed better than non-sequence-based patterns. However, with closed pattern, performance was improved in the half of cases but dropped in the other cases. With compressed patterns, using sequence did not help to improve performance in most cases.

Sequence-aware strategy has both advantages and disadvantages in text pattern mining. When word sequences have different meanings by different order, sequential pattern has advantage in distinguishing those patterns. However, considering sequence can distinguish one pattern from others although they have the same meaning. For example, let us assume that there are two sentences, ‘Battery life is long’ and ‘I need long battery life’. Two patterns mined from the two sentences, ‘battery long’ and ‘long battery’, have the same meaning but will be considered as different patterns in sequential patterns. Moreover, sequential patterns may have fewer patterns than frequent patterns. For example, sequential patterns, ‘battery long’ and ‘long battery’, may be discarded because they occurred less than minimum support. However, in frequent pattern mining, {battery, long} pattern may be selected when the sum of supports of ‘battery long’ and ‘long battery’ is larger than minimum support. Similarly, sequential patterns may have fewer long patterns.

As we explained in the previous sub-section, with no pattern compression, because there can be noisy patterns, finding more accurate patterns is important. Therefore, the advantage of using sequence outweighs its disadvantage. However, with pattern compression, because redundant patterns are filtered, finding more patterns is more useful. The effect of sequence was more harmful in compressed pattern because compressed pattern has more sophisticated strategy for removing redundant patterns.

Table 4. Classification error rate difference between sequential and non-sequential pattern with two pattern compression strategies. (Unit: %, FreqPat: frequent pattern, SeqPat: frequent sequential pattern, Closed: closed pattern, Comp: compressed pattern.)
original image
Table 5. Count of best performance runs depending on different maximum pattern length
Max Pattern Length10842
Counts in Best Performance52215

Parameter Setting

To better understand the impact of parameter setting, we counted the number of best runs for each parameter setting. We set all other parameters fixed and changed the observing parameters with the target model. Then, we checked the observing parameter when the target model showed the best performance. In this way, we can show how many times each parameter setting achieved the best performance.

We considered frequent patterns, frequent sequential patterns, closed frequent patterns, closed frequent sequential patterns, compressed frequent patterns, and compressed frequent sequential patterns. For each pattern variation, we tested with two data sets and two models. Therefore, in total, 24 runs were used for best pattern length parameter counting, and 48 runs were used for best minimum support counting because we used two maximum pattern length.

Table 5 shows counts of best runs with different maximum pattern length. According to the experiment results, using short patterns (maximum pattern length = 2) is usually enough to get the best performance. Note that since the pattern mining allows gaps, it can cover more than bigram.

Using all the long patterns (maximum pattern length = 10) also showed several best performing runs. Longer patterns are usually rarer; thus, they could be discriminative if they are frequent.

Table 6 shows counts of best runs with different minimum support. According to the experiment results, in general, using 0.01% or 0.1% minimum support performed best. Too low support (< 0.01%) may introduce noisy patterns, and too high support (1%, 3%) discards even useful patterns.

Table 6. Count of best performance runs depending on different minimum support
Min Support (%)0.010.113
MaxPatLen=1091131
MaxPatLen=215450
Sum241581
Table 7. Classification error rate comparison between compressed frequent pattern-based data enrichment and ungapped bigram/n-gram pattern-based data enrichment. (Unit: %, FreqPat: frequent pattern, SeqPat: frequent sequential pattern, Closed: closed pattern, Comp: compressed pattern.)
original image

In more detail, with maximum pattern length = 10,0.1% minimum support performed best, and with maximum pattern length = 2, 0.01% minimum support performed best. This shows that it is better to use a little bit higher minimum support when we use longer patterns.

Comparison with Bigram/N-gram Models

Here, we will compare compressed frequent pattern-based data enrichment with the state-of-the-art bigram and n-gram methods. The variants of the proposed method where we use ungapped sequential frequent pattern essentially represent the current way of using n-gram in a topic model. To simulate a bigram model, we simply limit maximum pattern length to two.

Table 7 shows performance comparison between the topic model using compressed frequent pattern-based data enrichment and bigram/n-gram topic models. According to the experiment results, compressed frequent pattern-based data enrichment consistently outperformed the state-of-the-art bigram/n-gram topic models in both best optimization and cross data optimization. This suggests that the special characteristics of compressed pattern-based data enrichment such as allowing gaps, careful pattern selection by compression, and flexible order, which are not available in n-gram models, are indeed beneficial.

Example Output

Table 8 shows some example modeling output, where the data was modeled by 10 topics, and top probability word/patterns in 3 topics are presented as examples in the table. We can observe meaningful patterns such as nuclear weapon, white hous(e), monica lewinski, kong hong (hong kong), hussein saddam (saddam hussein), and so on. These phrases can be a good summary or label of each topic. In addition to their improvement of topic modeling performance as we discussed earlier, these patterns also help us to understand topics easily.

Summary

Frequent pattern-based data enrichment showed performance improvement over unigram topic models with robustness of parameter setting. Among the variations of patterns used in data enrichment, compressed frequent pattern showed robust performance in parameter setting; moreover, it consistently outperformed the unigram model and bigram/n-grammodels. Frequent pattern-based data enrichment is a general, efficient, and effective method in going beyond bag-of-words representation.

Table 8. Example topic modeling result. Top 30 words/patterns for 10 topic modeling results. Patterns in brackets. Used TDT data set and LDA. Used patterns with max pattern length=10, minimum support=0.01%.
Topic 1Topic 2Topic 3
nuclearpresidiraq
indiaClintonu
testlewinskin
Pakistanhous(nu)
saidwhiteweapon
say(white hous)iraqi
cubastarrinspector
popesaidsaid
statelawyerbaghdad
countriinvestigcouncil
visitmrinspect
(test nuclear)msannan
cuban(clinton presid)secur
worldmonicasite
ministgrandunit
weaponjuri(council secur)
nationjone(nation unit)
indiancasenation
(pakistan india)saysanction
castroofficdiplomat
(india nuclear)counselagreement
unitformergener
johnindepend(n iraq u)
power(juri grand)butler
(nuclear weapon)sexualteam
(test india)(monica lewinski)saddam
sanctionquestion(inspector n u)
newreport(gener secretari)
tworelationshipsay
governWashington(inspector weapon)

RELATED WORK

Previous topic modeling techniques are already introduced in the second section. In this section, we will discuss more about existing frequent pattern mining algorithms.

Frequent pattern mining is one of the most popular topics in data mining. Frequent pattern is a set of items that occur frequently in a data set. In general frequent pattern analysis, it assumes the input data as a set, so the order of items is ignored. There are many algorithms for frequent pattern mining such as Apriori (Agrawal and Srikant, 1994), DHP (Park et al., 1995), DIC (Brin et al., 1997), and FPgrowth (Han et al., 2000). Frequent sequential pattern is a pattern considering sequence of items. Many algorithms have been developed, including the generalized sequential pattern mining algorithm (GSP) (Srikant and Agrawal, 1996), sequential pattern discovery using equivalent class (SPADE) (Zaki, 2001), and PrefixSpan (Pei et al., 2001).

Algorithms for closed pattern mining were also studied for efficient pattern mining. Because closed pattern is lossless compression, all the removed patterns can be recovered by obtaining subpatterns of closed patterns. Representative algorithms were CLOSET (Pei et al., 2000) and CHARM (Zaki and jui Hsiao, 2002) for closed frequent pattern mining, and CloSpan (Yan et al., 2003) for closed frequent sequential pattern mining. As another method for pattern compression, compressed patterns (Xin et al., 2005) were also proposed.

There were other pattern mining strategies which can be applied to frequent pattern-based data enrichment in the future work. (Cheng et al., 2007) mined discriminative patterns by filtering patterns based on classification power. They selected patterns based on discriminative power using training data. (Ding et al., 2009) proposed a method to avoid overestimating the frequency when there were repetitive patterns in sequence. This type of pattern can be beneficial in special data set having many repetitions in sequence. In this paper, we focused on introducing a new concept, frequent pattern-based data enrichment. We experimented with basic pattern mining methods which can be applied to general situations. Therefore, we left those patterns which need training data or are good for the special repetitive data set as future works.

CONCLUSION

In this paper, we proposed frequent pattern-based data enrichment, a general method for improving topic model performance with frequent pattern mining. Experiment results showed that our method helped to improve topic modeling performance. Among the variations of patterns, the proposed compressed/closed frequent pattern-based data enrichment consistently outperformed two representative topic models. In addition to its performance improvement, frequent pattern-based data enrichment has advantage in its generality, which is shown in our experiment that we applied frequent pattern-based data enrichment to both PLSA and LDA topic models. Pattern mining can be a good way to go beyond bag-of-words models.

As a future work, we may apply other sophisticated patterns to improve the performance of our frequent pattern-based data enrichment. There are many ideas of more advanced patterns for topic models. First, applying discriminative patterns and repetitive gapped sequence mining methods to our topic model would help to improve performance. Second, we can handle patterns differently depending on their significance. For example, a longer pattern should get more weights than a shorter pattern because it is statistically rare to be a frequent pattern. Third, we can handle overlapped patterns in a better way. Instead of using compressed pattern, we can try our own way to manage overlapped pattern. For example, if one document has two patterns and one is a subpattern of the other, we can give discounted weights on those overlapped pattern.

As a second future direction, more evaluation and comparison to other models can help to understand our models. Exploration of other types of ungapped patterns would help us to further explore the effect of gaps. We performed experiment with ungapped sequential patterns as a simulation of n-gram models; in addition to the variation between gap and no-gap, introducing a new type of pattern having distance constraints (e.g. allow only 2 gaps) between words in pattern mining would also be interesting.

Acknowledgements

We thank Prof. Jiawei Han and Prof. Julia Hockenmaier for their helpful comments. This material is based in part upon work supported by the National Science Foundation under Grant Number CNS 1027965, by MURI award FA9550-08-1-0265, by the MIAS Center at UIUC, part of CCICADA, a DHS Science and Technology Center of Excellence, and by an HP Innovation Research Award.

Footnotes

  1. 1

    http://projects.ldc.upenn.edu/TDT2/

  2. 2

    http://www.daviddlewis.com/resources/testcollections/reuters21578/

  3. 3

    http://illimine.cs.uiuc.edu/

  4. 4

    http://gibbslda.sourceforge.net/

  5. 5

    http://www.lemurproject.org/

  6. 6

    http://svmlight.joachims.org/svm_multiclass.html

Ancillary