SEARCH

SEARCH BY CITATION

Keywords:

  • Semantic models;
  • Paragraph similarity;
  • Corpus preprocessing;
  • Corpus construction;
  • Wikipedia corpora

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

The focus of this paper is two-fold. First, similarities generated from six semantic models were compared to human ratings of paragraph similarity on two datasets—23 World Entertainment News Network paragraphs and 50 ABC newswire paragraphs. Contrary to findings on smaller textual units such as word associations (Griffiths, Tenenbaum, & Steyvers, 2007), our results suggest that when single paragraphs are compared, simple nonreductive models (word overlap and vector space) can provide better similarity estimates than more complex models (LSA, Topic Model, SpNMF, and CSM). Second, various methods of corpus creation were explored to facilitate the semantic models’ similarity estimates. Removing numeric and single characters, and also truncating document length improved performance. Automated construction of smaller Wikipedia-based corpora proved to be very effective, even improving upon the performance of corpora that had been chosen for the domain. Model performance was further improved by augmenting corpora with dataset paragraphs.


1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

The rate at which man [sic] has been storing up useful knowledge about himself and the universe has been spiralling upwards for 10,000 years. —Toffler, 1973, p. 37

Nearly four decades later, Toffler's remark is perhaps even more relevant in today's internet-driven world. ‘‘Information overload’’ may be regarded as pervasive in many professions, and filtering strategies such as the summarization of text are commonplace. Government leaders and company executives make informed decisions based on briefs or short summaries of complex issues, provided by department managers who have in turn summarized longer reports written by their staff. In academia, the abstract is used to provide an overview of a paper's contents, so that time-pressed researchers can filter and absorb information related to their fields of study. In many areas it is important to be able to accurately judge the similarity between two or more paragraphs of information.

Sorting and extracting useful information from large collections of these types of summaries can prove both overwhelming and time consuming for humans. In an attempt to address this issue, semantic models have been successfully employed at these tasks. For example, latent semantic analysis (LSA; Landauer, McNamara, Dennis, & Kintsch, 2007) has been used to grade student essay scripts (Foltz, Laham, & Landauer, 1999). Similarly, the Topic Model has been used to extract scientific themes from abstracts contained in the Proceedings of the National Academy of Sciences (Griffiths & Steyvers, 2004). In a surveillance application, nonnegative matrix factorization has been applied to the large ENRON e-mail dataset to extract topics or themes (Berry & Browne, 2005). Other models such as the vector space model (henceforth called ‘‘Vectorspace’’) were originally designed to index (or order by relevance to a topic) large sets of documents (Salton, Wong, & Yang, 1975).

Semantic models have also been shown to reflect human knowledge in a variety of ways. LSA measures correlate highly with humans’ scores on standard vocabulary and subject matter tests; mimic human word sorting and category judgments; simulate word-word and passage-word lexical priming data; and accurately estimate passage coherence (Landauer et al., 2007). The Topic Model has proven adept at predicting human data on tasks including the following: free association, vocabulary tests, lexical decision, sentence reading, and free recall (Griffiths, Tenenbaum, & Steyvers, 2007). Other models have been developed to reflect specific psychological processes. For example, the constructed semantics model (CSM) was developed as a global-matching model of semantic memory derived from the MINERVA 2 architecture of episodic memory (Kwantes, 2005).

1.1. Different types of textual language unit

When making similarity comparisons on textual stimuli with semantic models, several researchers have highlighted the need to delineate textual stimuli into different language units (Foltz, 2007; Kireyev, 2008; Landauer & Dumais, 1997; McNamara, Cai, & Louwerse, 2007). Past research has modeled human comparisons of similarity on four types of textual language units: words, sentences, single paragraphs, and chapters or whole documents (Foltz, 2007).

1.1.1. Word comparisons

Griffiths et al. (2007) found that the Topic Model outperformed LSA on several tasks, including word association and synonym identification. Griffiths and colleagues compared performance by the Topic Model and LSA on a word association task using norms collected by Nelson, McEvoy, and Schreiber (1998). The study used 4,471 of these words that were also found in an abridged1 37,651-document (26,243 word, 4,235,314 token) version of the Touchstone Applied Science Associates (TASA) corpus. Moreover, the TASA corpus was used as a knowledge base for both the Topic Model and LSA. Two measures were employed to assess the models’ estimates of word association. The first measure assessed central tendency, focusing on the models’ ability to rank word targets for each word cue. The other measure assessed the proficiency of each model's estimate of the most likely target response for each word cue. Griffiths and colleagues found that the Topic Model outperformed LSA on both of these performance measures. Furthermore, they reported that both models performed at levels better than chance and a simple word co-occurrence model.

In another study, Griffiths et al. (2007) compared the Topic model and LSA on a subset of the synonym section taken from the Test of English as a Foreign Language (TOEFL). The TOEFL was developed in 1963 by the National Council on the Testing of English as a Foreign Language, and it is currently administered by the Educational Testing Service.2 The synonym portion of TOEFL offers four multiple choice options for each probe word, Griffiths and colleagues only included items in which all five words also appeared in the aforementioned abridged version of the TASA corpus. Similarity evaluations between the probes and possible synonyms revealed that the Topics model (70.5%) answered more of the 44 questions correctly than LSA (63.6%). Furthermore, the Topic Model (0.46) predictions captured more of the variance found in the human responses than LSA (0.3).

The Topic Model is a generative model that assesses the probability that words will be assigned to a number of topics. One of the key benefits of this generative process is that it allows words to be assigned to more than one topic, thus accommodating the ambiguity associated with homographs (Griffiths et al., 2007). For example, using the Topic Model, the word ‘‘mint’’ may appear in a topic that contains the words ‘‘money’’ and ‘‘coins,’’ and in another topic containing the words ‘‘herb’’ and ‘‘plants.’’Griffiths et al. (2007) argue that this attribute gives the Topic Model an advantage over models like LSA which represent meanings of words as individual points in undifferentiated Euclidean space (pp. 219–220).

1.1.2. Sentence comparisons

McNamara et al. (2007) used several implementations of LSA to estimate the relatedness of sentences. The human judged similarity of these sentences decreased from paraphrases of target sentences, to sentences that were in the same passage as target sentences, to sentences that were selected from different passages to the target sentences. Likewise, comparing sentences using a standard implementation of LSA and the TASA corpus, these researchers found estimates of similarity were greatest for paraphrases, then same passage sentences, with different passage sentences judged least similar. When human estimates were correlated with the LSA estimates of sentences similarity, it was found that a version of LSA that emphasized frequent words in the LSA vectors best captured the human responses. Subsequently, using data collected in the McNamara et al. (2007) study, Kireyev (2008) found that LSA outperformed the Topic Model at this task.

1.1.3. Single paragraph comparisons

Lee et al. (2005) examined similarity judgments made by Adelaide University students on 50 paragraphs that were collected from the Australian Broadcasting Corporations news mail service. These paragraphs ranged from 56 to 126 words in length, with a median length of 78.5 words. Lee and colleagues compared several models’ estimates of similarity to the aforementioned human ratings. These models included word-based, n-gram, and several LSA models. Using a knowledge base of 364 documents also drawn from the ABC news mail service, LSA under a global entropy function was the best performing model,3 producing similarity ratings that correlated about 0.60 with human judgments in this study. LSA's result in this study was also consistent with the inter-rater correlation (approximately 0.605) calculated by these researchers.

More recently, Gabrilovich and Markovitch (2007) produced a substantially higher correlation with the human similarity judgments recorded for the Lee paragraphs (0.72) using the model they developed, explicit semantic analysis (ESA). The ESA model uses Wikipedia as a knowledge base, treating Wikipedia documents as discrete human-generated concepts that are ranked in relation to their similarity to a target text using a centroid-based classifier.

Kireyev (2008) used LSA and the Topic Model to estimate similarity of pairs of paragraphs taken from 3rd and 6th grade science textbooks. It was proposed that paragraphs that were adjacent, should be more similar than nonadjacent paragraphs. Difference scores were calculated between adjacent and nonadjacent paragraphs for both grade levels, with higher scores indicating better model performance. While it was not stated whether one model significantly outperformed the other at this task, on average LSA (0.75) scored higher on the 3rd grade paragraphs than the Topic Model (0.49). However, there was little difference between the two models on the 6th grade paragraphs (LSA 0.33, Topic Model 0.34).

1.1.4. Chapters or whole document comparisons

Martin and Foltz (2004) compared whole transcripts of team discourse to predict team performance during simulated reconnaissance missions. Sixty-seven mission transcripts were used to create the researcher's corpus (UAV-Corpus). LSA was used to measure the similarity between transcripts of unknown missions to transcripts of missions where performance scores were known. To estimate the performance of a team based on their transcript using LSA, an average performance score was calculated from the 10 most similar transcripts found in the UAV-corpus. Performance scores estimated using LSA were found to correlate strongly (0.76) with the actual team performance scores.

Kireyev (2008) compared the similarity estimates of LSA and the Topic Model using 46 Wikipedia documents. These documents were drawn from six different categories: sports, animals, countries, sciences, religion, and disease. While both models correctly found more similarity between within-category documents than across-category documents, Kireyev (2008) concluded that LSA performed this task consistently better than the Topic Model.

1.2. The dual focus of this paper

This paper describes the outcome of a systematic comparison of single paragraph similarities generated by six statistical semantic models to similarities generated by human participants. Paragraph complexity and length can vary widely. Therefore, for the purposes of this research, we define a paragraph as a self-contained section of ‘‘news’’ media (such as a précis), presented in approximately 50–200 words.

There are two main themes that are explored in this paper. At one level it is an evaluation of the semantic models, in which their performance at estimating the similarity of single paragraph documents is compared against human judgments. As outlined above, past research has indicated that performance of some models is clearly better depending on which type of textual units were used as stimuli. For example, the Topic Model was shown to perform better than LSA in word association research, where the textual unit was at the single word level. However, inherent difficulties such as homographs that affect models like LSA at the word unit level, may be less problematic for assessments made on larger textual units (sentences, paragraphs, and chapters or whole documents). These larger textual units contain concurrently presented words that may be less ambiguous and are thus able to compensate for a model's inability to accommodate homographic words (Choueka & Lusignan, 1985; Landauer & Dumais, 1997).

Research has indicated that LSA performs well at the paragraph level (Lee et al., 2005), but there are other models that may perform equally well if not better than LSA at this task. Therefore, in this research we compare six models’ efficiency at the task of modeling human similarity judgments of single paragraph stimuli. The models examined were word overlap, the Vectorspace model (Salton et al., 1975), LSA (Landauer, McNamara, Dennis, & Kintsch, 2007), the Topic Model (Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2002), sparse nonnegative matrix factorization (SpNMF; Xu, Liu, & Gong, 2003), and the CSM (Kwantes, 2005). Our evaluation of these models is tempered by factors such a model compilation speed, consistency of performance in relation to human judgments of document similarity, and intrinsic benefits such as producing interpretable dimensions.

At another level this paper explores the characteristics of the corpora or knowledge bases utilized by these models that may improve models’ performance when approximating human similarity judgments. With the exception of the word overlap model, a good background knowledge base is essential to the models’ performance. Past research has identified various aspects of corpus construction that affect the performance of the pointwise mutual information co-occurrence model on word-based tasks such as the TOEFL synonym test (Bullinaria & Levy, 2006). These factors included the size and shape of the context window, the number of vectors included in the word space, corpus size, and corpus quality. To address this issue, we have evaluated aspects of corpus composition, preprocessing, and document length in an attempt to produce suitable background corpora for the semantic models.

To this end, four studies are described in this paper that examine the semantic models’ performance relative to human ratings of paragraph similarity. In the first study, semantic models use domain-chosen corpora to generate knowledge spaces on which they make evaluations of similarity for two datasets of paragraphs. Overall, the models performed poorly using these domain-chosen corpora when estimates were compared to those made by human assessors. In the second study, improvements in the models’ performance were achieved by more thoroughly preprocessing the domain-chosen corpora to remove all instances of numeric and single alphabetical characters. In the third study, smaller targeted corpora (subcorpora) constructed by querying a larger set of documents (Wikipedia4) were examined to assess whether they could produce sufficient performance to be generally useful (Zelikovitz & Kogan, 2006). In many applications the hand construction of corpora for a particular domain is not feasible, and so the ability to show a good match between human similarity evaluations and semantic models’ evaluations of paragraph similarity using automated methods of corpus construction is a desirable outcome. Furthermore, document length of the essay-like Wikipedia articles was manipulated to produce better approximations of human judgment by the semantic models. Finally, in the fourth study, several of the models were found to produce better estimates of paragraph similarity when the dataset paragraphs were included in the models’ backgrounding corpus.

2. Semantic models, human datasets, and domain-chosen corpora

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

2.1. Semantic models

The semantic models examined were word overlap, the Vectorspace model (Salton et al. 1975), LSA (Landauer, McNamara, Dennis, & Kintsch, 2007), the Topic Model (Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2002), SpNMF (Xu, Liu, & Gong, 2003), and the CSM (Kwantes, 2005).

Word Overlap: Simple word overlap was used as a baseline in this research. It is the only model that does not use a corpus or knowledge base. Instead, it is a word co-occurrence model. Term frequencies are calculated for each paragraph in the dataset, and similarities are then measured as cosines (see Eq. 1) of the resulting paragraph vectors.

  • image(1)

The Vectorspace model (Salton, Wong, & Yang, 1975): The Vectorspace model assumes that terms can be represented by the set of documents in which they appear. Two terms will be similar to the extent that their document sets overlap. To construct a representation of a document, the vectors corresponding to the unique terms are multiplied by the log of their frequency within the document, and divided by their entropy across documents, and then added. Using the log of the term frequency ensures that words that occur more often in the document have higher weight, but that document vectors are not dominated by words that appear very frequently. Dividing by the entropy or inverse document frequency reduces the impact of high-frequency words that appear in many documents in a corpus. Similarities are measured as the cosines between the resultant vectors for two documents.

Latent semantic analysis (Landauer, McNamara, Dennis, & Kintsch, 2007): LSA starts with the same representation as the Vectorspace model—a term by document matrix with log entropy weighting.5 In order to reduce the contribution of noise to similarity ratings, however, the raw matrix is subjected to singular value decomposition (SVD). The SVD decomposes the original matrix into a term by factor matrix, a diagonal matrix of singular values, and a factor by document matrix. Typically, only a small number of factors (e.g., 300) are retained. To derive a vector representation of a novel document, term vectors are weighted, multiplied by the square root of the singular value vector and then added. As with the Vectorspace model, the cosine is used to determine similarity.

The Topic Model (Topics; Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2002): The Topic Model is a Bayesian approach to document similarity that assumes a generative model in which a document is represented as a multinomial distribution of latent topics, and topics are represented as multinomial distributions of words. In both cases, Dirichlet priors are assumed. The parameters of these models can be inferred from a corpus using either Markov Chain Monte Carlo techniques (MCMC; Griffiths & Steyvers, 2002) or variational Expectation Maximization (Blei, Ng, & Jordan, 2003). We implemented the former. Ideally, document representations should then be calculated by running the MCMC sampler over a corpus augmented with information from the new document. To do this on a document-by-document basis is impractical. In the first instance, we choose to run the sampler over the corpus and then average the word distributions to calculate topic distributions for novel documents. Later in the paper, we investigate the impact of this decision by running the sampler over an augmented corpus containing all of the dataset paragraphs.

To calculate the similarity of the topic distributions representing documents, we employed both the Dot Product (see Eq. 2) and Jensen-Shannon Divergence (JSD, see Eq. 3). While the Dot Product was employed for convenience, the JSD is a symmetric form of the Kullback-Leibler Divergence (D), which derives from information theory and provides a well-motivated way of comparing probability distributions.

  • image(2)
  • image(3)

Sparse nonnegative matrix factorization (Xu, Liu, & Gong, 2003): Nonnegative matrix factorization is a technique similar to LSA, which in this context creates a matrix factorization of the weighted term by document matrix. This factorization involves just two matrices—a term by factor matrix and a factor by term matrix—and is constrained to contain only nonnegative values. While nonnegative matrix factorization has been shown to create meaningful word representations using small document sets, in order to make it possible to apply it to large collections we implemented the sparse tensor method proposed by Shashua and Hazan (2005). As in LSA, log entropy weight term vectors were added to generate novel document vectors and the cosine was used as a measure of similarity.

The CSM (Kwantes, 2005): The final model considered was the CSM (Kwantes, 2005). CSM was developed as a global-matching model of semantic memory derived from the MINERVA 2 architecture of episodic memory. Therefore, CSM is unique in that it was created primarily as a cognitive model to explain the emergence of semantics from experience. To this end, CSM uses a retrieval operation on the contexts in which words occur to generate semantic representations. It operates by taking the term by document matrix (using just log weighting) and multiplying it by its transpose. Consequently, terms do not have to appear together in order to be similar as is the case in the Vectorspace model. Again terms are added to create novel document vectors and the cosine is used as a measure of similarity.

2.2. The datasets

Two datasets of human ratings of paragraph similarity were used in this study. The first, which we will refer to as the WENN dataset, was composed of similarity ratings generated by subjects comparing celebrity gossip paragraphs taken from the World Entertainment News Network (WENN). The second dataset, which we will refer to as the Lee dataset, was archival data collected by Lee et al. (2005).

2.2.1. The WENN dataset

Students who were recruited by advertising the experiment on a local university campus, along with employees of Defence Research and Development Canada—Toronto (DRDC), provided paragraph similarity ratings from 17 participants to form the WENN dataset. Participants were paid CA$16.69 for taking part in the study. Twenty-three6 single paragraphs were compared by participants that were selected from the archives of WENN made available through the Internet Movie Database7 (see Appendix A in the Supporting Information). Paragraphs were not chosen randomly. First, each paragraph was chosen to be approximately 100 words long. The median number of words contained in paragraphs in the WENN dataset was 126, with paragraph lengths ranging from 79 to 205 words. Paragraphs were also chosen in such a way to ensure that at least some of the paragraphs possessed topical overlap. For example, there was more than one paragraph about health issues, drug problems, stalkers, and divorce among those represented in the stimuli.

Participants were shown pairs of paragraphs, side by side, on a personal computer monitor. Pairs were presented one at a time. For each pair, participants were asked to rate, on a scale of 0 to 100, how similar they felt the paragraphs were to each other. Participants were not given any instructions as to the strategy they should use to make their judgments. Once a similarity judgment had been made, the next pair was presented. Each participant rated the similarity of every possible pairing of different paragraphs for a total of 253 judgments. Pearson correlations were calculated between participants’ pairwise comparisons of the paragraphs in the WENN dataset, the average of these correlation coefficients (0.466) indicates that there was only moderate inter-rater reliability for the WENN dataset.

2.2.2. The Lee dataset

Lee et al. (2005) recorded observations of paragraph similarity made by 83 Adelaide University students to form the Lee dataset. The dataset consists of 10 independent ratings of the similarity of every pair of 50 paragraphs selected from the Australian Broadcasting Corporations news mail service (see Appendix B in the Supporting Information), which provides text e-mails of headline stories. The 50 paragraphs in the Lee dataset range in length from 56 to 126 words, with a median of 78.5 words. Pairs of paragraphs were presented to participants on a computerized display. The paragraphs in the Lee dataset focused on Australian and international ‘‘current affairs,” covering topics such as politics, business, and social issues. Human ratings were made on a 1 (least similar) to 5 (most similar) scale. As mentioned above, Lee et al. (2005) calculated an inter-rater reliability of 0.605.

2.3. Domain-chosen corpora: WENN (2000–2006) and Toronto Star (2005)

Two corpora were chosen to act as knowledge bases for the semantic models to allow similarity estimates to be made on the paragraphs contained in the WENN and Lee datasets. The larger set of 12,787 documents collected from WENN between April 2000 and January 2006 was considered a relevant backgrounding corpus for the 23 paragraphs contained in the WENN dataset; this larger set of documents is henceforth called the WENN corpus. It was not possible to resource the original set of 364 headlines and précis gathered by Lee et al. (2005) from the ABC online news mail service. Therefore, in an attempt to provide a news media-based corpus that was similar in style to the original corpus of ABC documents used by Lee and colleagues, articles from Canada's Toronto Star newspaper were used. Moreover, the Toronto Star corpus comprised of 55,021 current affairs articles published during 2005.

Initially, both corpora were preprocessed using standard methods: characters converted to lower case, numbers were zeroed (i.e., 31 Jan 2007 became 00 jan 0000), punctuation and words from a standard stop-list (see Appendix C in the Supporting Information) were removed, and words that appear only once in a corpus were also removed. Descriptive statistics for both the WENN corpus and the Toronto Star corpus are displayed in Appendix D (see the Supporting Information).

3. Study One: Comparison of models on domain-chosen corpora

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

Comparisons made between all semantic models and human evaluations of paragraph similarity for both datasets are presented in the following two subsections of this paper. For the more complex models (LSA, Topics, and SpNMF) one must select a number of dimensions in which to calculate similarities. Performance is likely to be influenced by this choice; therefore, in each case comparisons were made using 50, 100, and 300 dimensional models.

3.1. WENN dataset and WENN corpus

Using the WENN corpus, correlations between similarity ratings made by humans and the models on paragraphs in the WENN dataset were low (see Fig. 1) for all models except the simple word overlap (0.43). Of the other models, CSM (0.26) and LSA at 50 dimensions (0.21) performed best. Using the Jensen-Shannon metric improved the performance of the Topic Model in all cases when compared to the dot product measure of similarity. It could be argued that both Vectorspace (r = .17, t(250) = 1.61, n.s.)8 and LSA at 50 dimensions (r = .21, t(250) = 1.05, n.s.) performed as well as the CSM on this document set. For LSA, the Topic Model, and SpNMF, increasing the dimensionality or number of topics did not significantly increase or decrease model performance at this task (see Appendix Table E1 in the Supporting Information).

image

Figure 1.  Correlations (r) between the similarity ratings made on paragraphs in the WENN dataset by human raters and the those made by word overlap, LSA, Topics, Topics-JS (with Jensen-Shannon), SpNMF, Vectorspace, and CSM. All models, except word overlap, used the WENN corpus. The effects of dimensionality reduction are displayed at 50, 100, and 300 dimensions for the more complex models that incorporate this reductive process. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.

Download figure to PowerPoint

3.2. Lee dataset and Toronto Star corpus

Again, except for the word overlap (0.48), the correlations between similarity ratings made by human participants and the models on the paragraphs in the Lee dataset were very low (see Fig. 2). CSM and SpNMF (300 dimensions) were the next best performing models, correlating 0.15 and 0.14 with human judgments, respectively. In addition, Vectorspace had higher correlations than both LSA and the Topic Model using the dot product similarity measure. In 9 of the 12 possible comparisons, increased dimensionality produced significantly better estimates of paragraph similarity by models when compared to human ratings (see Appendix Table E2 in the Supporting Information).

image

Figure 2.  Correlations (r) between the similarity ratings made on paragraphs in the Lee dataset by human raters and those made by word overlap, LSA, Topics, Topics-JS (with Jensen-Shannon), SpNMF, Vectorspace, and CSM. All models, except word overlap, used the Toronto Star corpus. The effects of dimensionality reduction are displayed at 50, 100, and 300 dimensions for the more complex models that incorporate this reductive process. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.

Download figure to PowerPoint

3.3. Summary of Study One

Overall, the simple word overlap model outperformed the more complex semantic models when paragraph similarities were compared to human judgments made on both WENN and Lee datasets. On the Lee dataset, semantic models generally performed better when semantic spaces were compiled with higher dimensionality. However, when model dimensionality was increased on the WENN dataset, a similar increase in performance was not found. The generally poor results for the more complex models could be the product of at least one of the following circumstances:

  • 1
     The models are unable to generate similarity calculations that are comparable with human judgments.
  • 2
     The preprocessing of corpora may have been inadequate, to the extent that noise remained in the corpora which prevented the semantic models from making reasonable estimates of paragraph similarity.
  • 3
     Or the corpora did not represent the knowledge required to make similarity estimates on the paragraph contained in WENN and Lee document sets.

Other studies have reported more encouraging results when comparing estimates of paragraph similarity generated by semantic models and humans (Gabrilovich & Markovitch, 2007; Lee et al., 2005). Therefore, the first possible conclusion is likely to be inaccurate, indicating semantic models can make a reasonable estimate of the similarity of paragraphs when compared to human judgments. While this was not the case in this study, poor performance by the semantic models may have been driven by a suboptimal match between the background corpus and the paragraphs being tested. The likelihood of this scenario is supported by the generally low correlations with human results obtained by all of the models that required a background corpus. The following three studies explore the latter two possibilities. In Study Two, a more stringent corpus preprocessing method is used to improve on the results presented in Study One. In Study Three, Wikipedia is used to generate better backgrounding corpora, and this method again improves model estimates of paragraph similarity when compared to the human judgments. Then, in Study Four, paragraphs from the datasets are added to the models’ knowledge base to again improve model performance at this task.

4. Study Two: Corpus preprocessing

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

Generally, corpus preprocessing identifies words that are likely to be informative to the semantic model. In the field of information retrieval there have been many types of sophisticated term selection functions employed by researchers (Sebastiani, 2002, p. 15). Other methods such as employing a stop-list are less complex, requiring no mathematical calculation, and simply remove words from the corpus which are deemed uninformative by the researcher. Stop-lists are usually applied to remove words such as articles, pronouns, and conjunctions (Moed, Glänzel, & Schmoch, 2004). Bullinaria and Levy (2006) found that stop-lists reduced model performance when the textual unit under comparison is at a word-word level (such as the TOEFL task described above). However, working with paragraph comparisons, Pincombe (2004) states that ‘‘[u]se of a stop word list almost always improved performance’’ when comparing models estimates of similarity and human judgments (p. 1). A closer inspection of the stop-list (Appendix F in the Supporting Information) and preprocessing techniques (p. 14) used by Pincombe (2004) was conducted.9 This review revealed that single letters had been removed by the author and only alphabetical characters had been used in his corpora. The difference between the preprocessing used in Study One (allowing the inclusion of zeroed numbers and single characters) and that used in Pincombe's research begs the question:

Can the removal of single letters and numbers from the background corpus improve a semantic model's ability to estimate paragraph similarity?

It is possible that the presence of these types of information (numbers and single letters) in a corpus can create noise for the models. For example, the American Declaration of Independence in 1776 has little to do with Elvis Presley's birthday in 1935. Although using the preprocessing method of zeroing numbers, models comparing texts that describe these two occasions would erroneously find some similarity between them. Moreover, the zeroing of the aforementioned dates could also suggest commonality with a document describing the distance between two cities, obviously creating noise in the corpus even if this new document described a 1,000 mile drive between Philadelphia (Pennsylvania) and Tupelo (Mississippi). Similarly, the ‘‘Js’’ in ‘‘J F K’’ and ‘‘J K Rowling’’ should not indicate semantic similarity between documents that make reference to these well-known individuals. Therefore, the removal of these items may benefit a model's ability to perform similarity ratings between paragraphs.

4.1. Removing numbers and single letters

All numbers and single letters were removed from both the WENN and Toronto Star corpora10 to test the hypothesis that removing these characters would improve the semantic models’ performance when similarity ratings were compared to human judgments. Figs. 3 and 4 display comparisons between the results generated in Study One (ALL) and the results for spaces compiled on corpora without number and single letters (NN-NSL, No numbers-No Single Letters). Only the results for models compiled at 300 dimensions (where dimensionality is a parameter of the model) are displayed in these figures. It should be noted that while the models compiled at 300 dimensions generally produced the best results,11 models compiled at both 50 and 100 dimensions displayed an identical trend (see Appendix Table G1 in the Supporting Information) of better performance when using the more stringent preprocessing method.

image

Figure 3.  Correlations between similarity estimates made by human and models on paragraphs in the WENN dataset. Models that employ a knowledge base used the WENN corpus. ‘‘ALL’’ depicts standard corpus preprocessing used in Study One; ‘‘NN-NSL’’ corpora have also had numbers and single letters removed. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.

Download figure to PowerPoint

image

Figure 4.   Correlations between similarity estimates made by human and models on paragraphs in the Lee dataset. Models that employ a knowledge base used the Toronto Star corpus. ‘‘ALL’’ depicts standard corpus preprocessing used in Study One; ‘‘NN-NSL’’ corpora have also had numbers and single letters removed. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.

Download figure to PowerPoint

Although it may seem counterintuitive to remove information from a knowledge base or corpus, the removal of numbers and single letters improved correlations between human judgments and similarity ratings produced from models in nearly all comparisons that were made for both the WENN (see Fig. 3) and Lee (see Fig. 4) datasets. The only model that did not improve in performance was CSM on the WENN dataset. This difference for CSM between ALL (0.26) and NN-NSL (0.16) corpora was significant (t(250) = −2.48, p < .05). A more promising trend was displayed by the other models, especially on the WENN dataset with the LSA (0.48) and SpNMF (0.43) models performing best of the more complex semantic models. However, this trend was also displayed by the simple word overlap model which continued to clearly outperform the other models. When numbers and single letters were removed from the paragraphs used by the overlap model, correlations between this model and the human judgments improved to 0.62 on the WENN dataset and 0.53 on the Lee dataset. In 4 of the 12 comparisons on the WENN dataset, and 5 of the 12 comparisons on the Lee dataset, increased dimensionality led to significant improvements to models’ performance (see Appendix Tables G1 and G2 in the Supporting Information).

Notwithstanding this general improvement in the more complex semantic models’ performance, correlations with human judgments of similarity were still low using the Toronto Star (NN-NSL) corpus on the Lee dataset, with the highest being the Vectorspace model (0.2). This suggests that while corpus preprocessing was hindering the models’ ability to provide reasonable estimates of paragraph similarity, there are also other factors that are impeding the models’ performance. Clearly, the information and themes contained within corpora certainly constrain the performance of semantic models. However, suitable knowledge bases are not always easy to obtain. In an attempt to address this issue, the third study examines an alternative method of generating corpora that draws sets of knowledge-domain related documents (subcorpora) from the online encyclopedia Wikipedia.

5.  Study Three: A better knowledge base?

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

Smaller, more topic-focused subcorpora may provide context for polysemous words that may otherwise take on several meanings in a larger corpus. To this end, Wikipedia was utilized as a generic set of documents from which smaller targeted subcorpora could be sampled and compiled. Wikipedia is maintained by the general public, and it has become the largest and most frequently revised or updated encyclopedia in the world. Critics have questioned the accuracy of the articles contained in Wikipedia, but research conducted by Giles (2005) did not find significant differences in the accuracy of science-based articles contained in Wikipedia when they were compared to similar articles contained in the Encyclopedia Britannica. Furthermore, the entire collection of Wikipedia articles are available to the general public and can be freely downloaded.12 All Wikipedia entries current to March 2007 were downloaded for this research. In total there were 2.8 million Wikipedia entries collected; however, the total number of documents was reduced to 1.57 million after the removal of incomplete articles contained in the original corpus. Moreover, incomplete articles were identified and removed if they contained phrases like ‘‘help wikipedia expanding’’ or ‘‘incomplete stub.’’ The resulting Wikipedia corpus was further preprocessed in the same manner as the NN-NSL corpora in Study Two: removing stop-words, punctuation, words that only appeared once in the corpus, and finally removing all numbers and single letters.

To enable the creation of subcorpora, Lucene13 (a high-performance text search engine) was used to index each document in the Wikipedia corpus. Lucene allows the user to retrieve documents based on customized queries. Like the search results provided by Google, the documents returned by Lucene are ordered by relevance to a query.

Targeted queries were created for each paragraph rated by humans in the WENN dataset. This WENN-based query was constructed by removing stop-words and punctuation from the title14 that accompanied each paragraph, and then joining the remaining words with ‘‘OR” statements (see Appendix H in the Supporting Information). In contrast, the query devised for the paragraphs in the Lee dataset was more complex. For the Lee-based query, the researcher chose several descriptive keywords15 for each paragraph in the Lee dataset, and used ‘‘AND” and ‘‘OR’’ operators to combine these keywords. Moreover, the Lee-based query used Lucene's ‘‘star’’ wild-card operator to return multiple results from word stems. For example, the stem and wild-card combination ‘‘research*” would match documents containing the words ‘‘research,”‘‘researcher,” and ‘‘researchers” (see Appendix I in the Supporting Information).

5.1. Wikipedia subcorpora

Four subcorpora were created using the Lucene queries (described above) on the Wikipedia document set. For each dataset (WENN & Lee), a 1,000-document and a 10,000-document subcorpus was generated. The structure of the Wikipedia articles contained in these subcorpora was substantially different from the documents contained in either the WENN or Toronto Star corpora (see Table D1 in the Supporting Information). Wikipedia articles tend to be longer in format, with documents that approximate the length of a short essay (on average 1,813–2,698 words per document). In contrast, the documents contained in both the WENN and Toronto Star corpora are similar in length to a journal article's abstract (on average 74 to 255 words per document). In addition to the Wikipedia documents being generally much longer than the WENN or Toronto Star documents, the Wikipedia documents also contain on average many more unique words.

The greater size and complexity of the Wikipedia documents may produce noise for the semantic models. However, Lee and Corlett's (2003) findings indicate that decisions about a document's content can be made using only the beginning of a document's text. In their study of Reuters’ documents, words found in the first 10% of a document's text were judged to hold greater ‘‘mean absolute evidence’’ characterizing a document's content. Lee and Corlett calculated the ‘‘evidence’’ value of a word given a particular topic. This calculation was made by comparing how often a word appeared in documents related to a topic, relative to the word's frequency in documents that were not topic-related. Their finding may reflect a generally good writing style found in Reuters’ documents, where articles may begin with a précis or summary of the information that is contained in the rest of the document. Documents in a Web-based medium such as Wikipedia may also conform to this generalization. Intuitively, it seems likely that important descriptive information displayed on a Web page will be positioned nearer the top of a page (probably within the first 300 words), so as not to be overlooked by the reader as the Web page scrolls or extends beneath screen.16

To explore the possible effect of document length (number of words) on semantic models, corpora were constructed that contained the first 100, 200, 300, and all words from the Wikipedia subcorpora. To illustrate this point, if the preceding paragraph was considered a document, in the first 100 word condition this document would be truncated at ‘‘…by comparing how often a word appeared in.’’ Furthermore, to test if corpus size influenced the similarity estimates generated by the semantic models, performance was compared on the 1,000 and 10,000 subcorpora for both datasets. Thus, making a 2 × 4 design (number of documents in a corpus BY number of words in each document) for each dataset. Each subcorpus was compiled using LSA at 300 dimensions. LSA was chosen for its quick compilation speeds and because of the generally good match that has been reported between LSA and human performance on tasks comparing paragraph similarity (Landauer & Dumais, 1997; Lee et al., 2005). Moreover, in general LSA was one of the best performing models that incorporates a knowledge base in the previous studies presented in this paper.17 This choice of dimensionality is supported by the findings of the first two studies in this paper, where increased dimensionality improved performance.

5.1.1. Document length

In general, LSA's performance was better as document length was shortened, with the best results produced by truncating documents’ length at 100 words. On both datasets, LSA produced the highest correlations with the human similarity judgments using the 1,000 document subcorpora truncated at 100 words (see Figs. 5 and 6). This configuration produced a result (0.51) that was significantly higher than all other document number and document length combinations for the Lee dataset. On the WENN dataset, the correlation for the 1,000 document corpora with documents truncated at 100 words was higher than all other cases; however, this result was not significantly higher in several cases. On both datasets, truncating documents at 100 words produced significantly higher correlations than the ALL word conditions (where document length was not truncated). These results show that improvements to model performance can be achieved by truncating documents to 100 words, and this improvement supports the earlier findings of Lee and Corlett (2003).

image

Figure 5.  Correlations between human judgments of paragraph similarity on the WENN dataset with estimates made using LSA (at 300 dimensions) using the WENN Wikipedia-based corpora containing 1,000 and 10,000 documents retrieved using Lucene with WENN-based query. Wikipedia documents have been truncated in four ways: first 100, 200, 300, and ALL words. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.

Download figure to PowerPoint

image

Figure 6.  Correlations between human judgments of paragraph similarity on the Lee dataset with estimates made using LSA (at 300 dimensions) using Lee Wikipedia-based corpora containing 1,000 and 10,000 documents retrieved using Lucene with Lee-based query. Wikipedia documents have been truncated in four ways: first 100, 200, 300, and ALL words. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.

Download figure to PowerPoint

5.1.2. Number of documents

LSA's performance on both datasets was best using the smaller 1,000 document subcorpora. On the Lee dataset, when documents are truncated at 100 words, the performance of LSA is better using the 1,000-document subcorpora than the 10,000-document subcorpora (t(1222) = 4.44, p < .05). On the WENN dataset, when documents are truncated at 100 words, performance was also better for the 1,000-document subcopora, although this difference failed to reach significance (t(250) = 1.63, n.s.).

5.2. All models compared on Wikipedia subcorpora

The results presented in Study Two of this paper for models using the WENN (NN-NSL) and Toronto Star (NN-NSL) corpora have also been included in the findings presented in Figs. 7 and 8 as points of comparison to judge the effectiveness of creating the 1,000- and 10,000-document subcorpora from Wikipedia.

image

Figure 7.  Correlations between human judgments of paragraph similarity on the WENN dataset with semantic model estimates made using Wikipedia corpora with 1,000- and 10,000-documents and the WENN Corpus (NN-NSL). Error bars are the 95% confidence limits of the correlation. These results are also presented in Appendix Table J1 in the Supporting Information. Correlations exclude Same–Same paragraph comparisons.

Download figure to PowerPoint

image

Figure 8.  Correlations between human judgments of paragraph similarity on the Lee dataset with semantic model estimates made using Wikipedia corpora with 1,000- and 10,000-documents and the Toronto Star (NN-NSL). Error bars are the 95% confidence limits of the correlation. These results are also presented in Table J2 in the Supporting Information. Correlations exclude Same–Same paragraph comparisons.

Download figure to PowerPoint

When the results for both the WENN and Lee datasets are taken into consideration, again none of the more complex semantic models performed significantly better than the simple word overlap model. While the best-performing model on the Lee dataset was Vectorspace (0.56) using the Wikipedia 10,000-document corpus, this was not significantly different (t(1222) = 1.31, n.s.) from the word overlap model's correlation (0.53) with human judgments. As is displayed in Figs. 7 and 8, of the corpus-based models Vectorspace, LSA and SpNMF performed the best on both datasets. It is unclear whether using the Jensen-Shannon metric as opposed to dot product measure with the Topic Model produced better results. On the Lee dataset, Topic Model with dot product (0.48) using the 1,000-document Wikipedia corpus significantly outperformed Topics model with the Jensen-Shannon metric (0.42) using the 10,000-document Wikipedia corpus (t(1222) = −2.08, p < .05). However, using the WENN (NN-NSL) corpus, there was not a significant difference between the two Topic Model similarity measures (t(250) = 0.53, n.s.) on the WENN dataset.

Latent semantic analysis performed well using both the WENN (NN-NSL) and Wikipedia-based Lee corpora. Given that LSA is built on Vectorspace, it is encouraging to see that in the case of the WENN dataset dimensionality reduction improved this LSA's performance (0.48) when compared to Vectorspace (0.41). However, this improvement was not found consistently, as indicated by the higher correlation with human judgments on Lee dataset achieved by Vectorspace using either 1,000- and 10,000-document Wikipedia-based corpora (see Fig. 8).

Using the WENN (NN-NSL) corpus as a knowledge base allowed the semantic models to produce better estimates of human similarity judgments than could be obtained using either 1,000- or 10,000-document Wikipedia-based corpora on the WENN dataset. In contrast, corpora retrieved from Wikipedia allowed the models to perform much better when making estimates of paragraph similarity on the Lee document set (see Fig. 8). For corpus-based models, the 10,000-document Wikipedia corpus was found to produce the highest correlation with human ratings on the Lee document set (Vectorspace 0.56); however, in the majority of cases the 1,000-document Wikipedia corpora was associated with better model performance at this task. All results presented thus far have consistently shown that the Toronto Star has provided a poor knowledge base on which to assess the paragraphs contained in the Lee dataset. These results indicate that when domain-chosen corpora are not a good fit to the knowledge required to make accurate estimates of similarity on paragraphs, using corpora drawn from Wikipedia can improve model performance.

6. Study Four: Corpora that include the dataset paragraphs

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

In the empirical studies we have reported, subjects were presented with document pairs to be rated. Documents were repeated in different pairs, so for the majority of ratings subjects had already been exposed to all of the test documents. In the previous studies, paragraphs contained in the WENN dataset were included in the WENN corpora, but not for the corpora used by models on the Lee dataset. Consequently, the models were at a disadvantage relative to participants. This inclusion of dataset paragraphs is potentially important for models like the Topic Model where context can select for the appropriate meaning of a word. To evaluate the efficacy of including stimulus paragraphs into the semantic models’ knowledge base as a method of corpus improvement, the 50 Lee dataset paragraphs were added to the most effective corpora with the most effective preprocessing found in the previous studies for the Lee dataset.

For this study, the 50 Lee paragraphs were prepended to both the 1,000- and 10,000-document Wikipedia corpora. These revised corpora were preprocessed using the same techniques described in Study Three for the Wikipedia subcorpora. While the 50 Lee paragraphs were not truncated at 100 words, preprocessing was used to remove punctuation, stop-list words, words that only appear once on the document set, numbers, and single letters. After preprocessing, the smaller corpus contained 1,050 documents with 8,674 unique words and 100,107 tokens, and the larger corpus held 10,050 documents comprised of 37,989 unique words from a total of 942,696 tokens.

Adding the 50 Lee paragraphs to the Wikipedia 1,000 corpora significantly improved correlations between model estimates and human judgments of similarity in nearly all cases (see Appendix Table K1 in the Supporting Information). While the Topics model improved one point, using the dot product measure, there was not a significant improvement using the Jensen-Shannon metric. The greatest improvement in model performance was displayed by Vectorspace which increased from 0.55 to 0.67, and LSA which rose from 0.51 to 0.60 (see Fig. 9).

image

Figure 9.  Correlations between human and model estimates of paragraph similarity on the Lee dataset using the standard Wikipedia 1,000 corpora (Wikipedia 1,000) and Wikipedia 1,000 corpora including the 50 Lee documents (Wikipedia 1,050). The overlap model has also been included in this bar graph to allow the reader another point of comparison. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.

Download figure to PowerPoint

Both significant performance increases and decreases were produced for all models by prepending the 50 Lee paragraphs to the 10,000-document Wikipedia corpora (see Appendix Table K2 in the Supporting Information). While all differences were significant when compared to nonaugmented Wikipedia subcorpora, the actual differences in performance were small for most models. In general, these differences ranged between 0.001 and 0.02 with the exception of Topics model using the Jensen-Shannon metric, which went up from 0.42 to 0.49 when the 50 Lee paragraphs were added to the Wikipedia 10,000 corpus (see Fig. 10).

image

Figure 10.  Correlations between human and model estimates of paragraph similarity on the Lee dataset using the standard Wikipedia 10,000 corpora (Wikipedia 10,000) and Wikipedia 10,000 corpora including the 50 Lee documents (Wikipedia 10,050). The overlap model has also been included in this bar graph to allow the reader another point of comparison. Error bars are the 95% confidence limits of the correlation. Correlations exclude Same–Same paragraph comparisons.

Download figure to PowerPoint

7. Overall summary

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

In Study One, moderate correlations were found between the word overlap model (WENN: 0.43, Lee: 0.48) and human judgments of similarity on both datasets. However, weaker performance was displayed by all of the more complex models when similarity estimates were compared on both the WENN (highest CSM, 0.26) and Lee (highest CSM 0.15) datasets. It was postulated that the semantic models’ performance may have been constrained by factors such as corpus preprocessing and a poorly represented knowledge domain (in the case of the Toronto Star corpus and the Lee dataset). In Study Two, the importance of corpus preprocessing was highlighted; removing the numbers and single letters from corpora improved correlations with human judgment on both datasets for all models with the exception of CSM. After the removal of these characters, the best performing of the more complex models were LSA (0.48) on the WENN dataset and Vectorspace (0.20) on the Lee dataset. However, the corpus-based models still failed to outperform the word overlap model, which also improved with the removal of numbers and single letters on both datasets (WENN: 0.62, Lee: 0.53).

In some ways it is unsurprising that the models’ performance in this study was better on the WENN dataset than the Lee dataset, because the paragraphs used in similarity judgments were drawn from the greater set of documents contained in the WENN corpus. That is, in the case of the WENN set there was a better match between paragraphs that were compared (WENN dataset) and the models’ knowledge base (the WENN Corpus). Conversely, the Toronto Star articles did not provide a good approximation of the knowledge required to make reliable inferences regarding the similarity of paragraphs contained in the Lee dataset. While the Toronto Star corpus contains extracts of current affairs, these articles (published in 2005) must vary substantially from the précis published in 2001 that are contained in the ABC news mail service that was used by Lee et al. (2005).

In an attempt to obtain a better representation of the knowledge base required to make accurate paragraph similarity comparisons, in Study Three Wikipedia subcorpora were generated to use on each dataset. The Wikipedia documents were found to be much longer and more like short essays than the summary or abstract length documents found in the WENN and Toronto Star corpora. Guided by the research findings of Lee and Corlett (2003), it was found that Wikipedia documents truncated at 100 words provided better corpora for LSA at 300 dimensions than when using all of the words contained in these documents. LSA's performance was also better using the smaller 1,000-document Wikipedia subcorpora. The decision to use 300 dimensions was in part based on the results of Study One and Study Two, which indicated that increased dimensionality often led to significant performance gains when model estimates of paragraph similarity were compared to human ratings.

Based on these findings, spaces were compiled for the models using Wikipedia corpora that contained documents truncated at 100 words. The semantic models’ performance on the WENN dataset did not improve using these Wikipedia subcorpora when compared to results achieved by models using the WENN corpus. However, there was a substantial improvement by nearly all models (except CSM) when similarity estimates were compared on the Lee dataset. Using the Wikipedia subcorpora, the best performing of the more complex models on the Lee dataset were Vectorspace using both 1,000 documents (0.55) and 10,000 documents (0.56) and SpNMF using 1,000 documents (0.53); all of which approach the inter-rater reliability (0.6) recorded for Lee and colleagues participants (Lee et al., 2005). The decrement in performance seen using the Wikipedia subcorpora, when compared to the WENN corpus on the WENN dataset, is again somewhat expected given that the documents in the WENN dataset were selected from the WENN corpus. When the results on both the WENN and Lee datasets are considered, Vectorspace, LSA, and SpNMF were the best performing of the corpus-based models. That said, even using corpora that allowed models to perform on a comparable level with the inter-rater reliability found in the WENN dataset, and approaching that calculated for the Lee dataset, these models still could not significantly outperform the simple word overlap model when estimating the similarity of paragraphs in comparison to human performance at this task.

The final study explored what effect including the dataset paragraphs into a corpus would have on models’ performance. This assessment was only undertaken for the Wikipedia corpora used on the Lee dataset, as the WENN documents were already included in the WENN corpora in previous studies. In particular, the Topic Model performance was expected to increase; however, this improvement in performance was only observed for the Topic Model using the Dot Product measure of similarity. Generally, performance increases associated with the inclusion of the 50 Lee paragraphs were greater on the smaller 1,050-document Wikipedia corpus when compared to those observed on the 10,050 Wikipedia corpus. It is possible that any benefit to a model's performance produced by adding these 50 paragraphs is negated by the volume of terms contained in the larger corpus. Overall, the best performance was observed for Vectorspace (0.67) and LSA (0.60) using the 1,050-document Wikipedia corpus containing the 50 Lee paragraphs. It was interesting to note that LSA's performance using the smaller Wikipedia corpus and 50 Lee paragraphs was almost exactly the same as the inter-rater reliability calculated for the Lee dataset. Furthermore, using this augmented 1,050-document Wikipedia corpus, both LSA (t(1222) = 3.20, p < .05) and Vectorspace (t(1222) = 7.81, p < .05) significantly outperformed the overlap model (0.53) when estimates of paragraph similarity were compared to the human judgments contained in the Lee dataset.

Fig. 11 displays scatterplots from the two best-performing models on WENN and Lee datasets. It was surprising that on both datasets, the simple word overlap model was among the two best-performing models. As is illustrated by Fig. 11B and D, the word overlap model is generally capturing human responses that have been made on paragraphs which have low or no similarity. It is also interesting to note that on the WENN dataset, LSA using the WENN corpus (NN-NSL) has in all cases estimated some similarity between the paragraph pairs (see Fig. 11A). This may indicate that greater dimensionality is needed by LSA to adequately delineate the themes presented in the WENN corpus documents. At another level, because the WENN paragraphs all focus on ‘‘celebrity gossip,’’ to some extent they may all be considered related. Alternatively, on the Lee dataset, Vectorspace appears to have provided a relatively good match to the average human estimates of paragraph similarity (see Fig. 11C).

image

Figure 11.  Scatterplots of the two best similarity estimates calculated for both the WENN and Lee datasets compared to the average similarity estimates made by humans for each pair of paragraphs. On the WENN dataset, (A) LSA using the WENN corpus (NN-NSL), and (B) the Overlap model. On the Lee dataset, (C) Vectorspace using the Wikipedia 1,050 (including Lee documents), and (D) the Overlap model. Note, on the Lee dataset, average human ratings have been normalized [0,1].

Download figure to PowerPoint

8. Discussion

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

Quite surprisingly, the simplest models (Vectorspace and word overlap) were the best-performing models examined in this research, both exceeding the inter-rater reliability calculated for human judgments. While exceeding the inter-rater reliability is an important milestone, it is possible for a model to perform better. The model is compared against the average rating of the subjects, which eliminates a significant amount of variance in the estimates of the paragraph similarities, whereas the inter-rater reliability is the average of the pairwise correlation of the subjects. On the WENN dataset, the overlap model (0.62) exceeded the inter-rater reliability (0.47). Similarly the Vectorspace model (0.67) using a corpus containing both truncated Wikipedia documents and the 50 Lee paragraphs also exceeded the inter-rater reliability found for the Lee dataset (0.605).

The Vectorspace model's performance on the Lee dataset using the smaller Wikipedia corpus that included the 50 Lee paragraphs was particularly encouraging. While the overlap model's good performance at these tasks can largely be accounted for by its ability to capture human ratings on paragraph pairs with low or no similarity, the Vectorspace model appeared to provide good estimates of both the similarity and dissimilarity of the Lee paragraphs when compared to human ratings. That said, the Vectorspace model did not perform as well on the WENN dataset, when compared to estimates produced by either the overlap model or LSA. However, the finding that no model performed as well as the overlap model on this dataset might indicate that even though the best results for the WENN dataset were found for most corpus-based models using the WENN corpus (NN-NSL), that this corpus still did not provide an adequate term representation for the models. Furthermore, it is possible that a better match to the background knowledge needed by models for the WENN paragraphs may have been accomplished had a more complex Lucene query been used to retrieve relevant Wikipedia documents.

One possible explanation for the success of the overlap and Vectorspace models in these studies may be found in the framework of the experiments. In each experiment, participants made pairwise comparisons of paragraphs displayed on a computer monitor. The side-by-side positioning of these paragraph pairs may have encouraged keyword-matching (or discrimination) between the paragraphs by the participants. That is, it is possible that the participants were skimming paragraphs for keywords with which they could make similarity judgments. Another related strategy which could result in the similar outcome, would be to read one paragraph thoroughly and then to skim the comparison paragraph for salient words presented in the first text. Masson (1982) indicates that when skimming, readers miss important details in newswire texts, and that visually unique features of text such as place names may increase efficiency of skimming as a reading strategy. Given that names of people and places were certainly present in all paragraphs presented to participants in this research, commonalities between participants’ similarity estimates (and also those of the overlap and Vectorspace models) may also be influenced by these proper nouns. In future research, eye-tracking technology could be employed to elucidate the possibility of skimming strategies in this type of experimental task. Alternatively, paragraphs could be presented in a serial sequence, rather than concurrently, and time spent reading each paragraph might act as an indicator of reading strategy.

In the introduction, we categorized the materials used in this class of research as having four types of textual unit: words, sentences, single paragraphs, and chapters or whole documents. Past research has indicated that the Topic Model performs better at word association tasks than LSA. Moreover, researchers have shown that Topic Model's ability to accommodate homographs is superior to other models at the single word textual unit level (Griffiths et al., 2007). While the ability to discriminate the intended meaning of ambiguous words is certainly desirable, it is possible that this attribute is not a prerequisite for successful model performance with larger textual units such as paragraphs. This may be because textual units such as sentences and paragraphs allow models access to a range of nonambiguous words whose informativeness may compensate a model's inability to capture the meaning of more ambiguous words (Choueka & Lusignan, 1985; Landauer & Dumais, 1997). In the studies reported above, four models (word overlap, Vectorspace, LSA, and SpNMF) that do not capture this type of word ambiguity all outperformed the Topic Model when compared to human ratings at the task of estimating similarity between paragraphs.

Besides a model's ability to make good approximations at human similarity judgments, another factor that must be considered when evaluating the usefulness of these semantic models is the ability to produce interpretable dimensions. For example, one of the criticisms of LSA is that the dimensions it creates are not always interpretable (Griffiths et al., 2007). Similarly, word overlap, Vectorspace, and CSM do not employ any dimensionality reduction, and thus provide word vectors that are difficult to interpret. In contrast, both SpNMF and Topic Model return interpretable dimensions. To illustrate this point, Table J3 in the Supporting Information file displays a sample of the dimensions created for the 10,000-document Lee-based Wikipedia corpus. As is made clear in this table, it would be easy to meaningfully label any of these dimensions.

Given its generally good approximations of human judgment and ability to provide interpretable dimensions, SpNMF could be regarded as one of the best models examined in this article. However, its slow compilation of spaces would certainly need to be addressed for it to be generally useful in either a research or an applied setting. In comparison to the Vectorspace model which takes 24 s to compile a space of the King James Bible using a 2.4 GHz CPU, the SpNMF model is very computationally expensive, taking just under 8 h. Future research may be able to utilize parallel programming techniques18 to sequence SpNMF calculations over multiple processing units to reduce the time needed to compile SpNMF spaces, and thus make SpNMF a more feasible model to use in tasks of this kind.

In several ways, CSM was the worst-performing model employed in this research. All models performed better than CSM when using either the Wikipedia subcorpora or WENN corpus (NN-NSL). In addition, the matrices contained within CSM spaces can be over two orders of magnitude larger than those compiled by other models. For example, the space compiled by CSM for the 10,000-document Wikipedia-based corpus with documents truncated at 100 words for the Lee dataset was 12 Gb in size. In stark contrast, the same corpus compiled by Vectorspace used 84 Mb of disk space. While files of this size are not unusable, accessing the dense vectors contained in CSM spaces is slower than retrieving vectors for comparisons using the other models.

One of the key strengths of the simple overlap model that performed so well in this research is that it is not reliant on a knowledge base, only extracting information from the surface structure of the textual stimuli. This paper has provided examples of the difficulties researchers face when attempting to create a suitable knowledge base for semantic models. This is not to mention the labor-intensive process undertaken to collect and format a large corpus. Furthermore, the simple overlap model is not without theoretical underpinning. Max Louwerse, in this issue of topiCS, has suggested that ‘‘support for language comprehension and language production is vested neither in the brain of the language user, its computational processes, nor embodied representations, but outside the brain, in language itself.’’ In arguing his claim, Louwerse provides examples of how first-order co-occurrences of terms can produce similar results to LSA on tasks of categorization. Similarly, it could certainly be argued that to some extent the good performance of the overlap model in the studies presented in this paper supports Louwerse's argument.

Overall, dimensionality reduction did not appear to improve the models’ estimate of paragraph similarity when compared to results produced by Vectorspace and overlap models. However, LSA's consistent performance, mimicking of human inter-rater reliability, and better performance on the WENN dataset when compared to Vectorspace all indicate that there is further research that must be done in this area. One aspect of this research that we intend to explore more fully is the possibility that subsets of participants’ judgment variance can be accommodated by different models. For example, it is possible that participants who are skimming the paragraphs may produce results more similar to either Vectorspace or the overlap model. In contrast, other participants who are reading carefully and not skimming over the text may produce results that are more similar to those calculated with LSA. While it is not possible to draw these conclusions with any certainty based on our current datasets, eye-tracking technology will be employed in future research to explore these possibilities.

The findings presented in this paper indicate that corpus preprocessing, document length, and content are all important factors that determine a semantic model's ability to estimate human similarity judgments on paragraphs. The online, community-driven Wikipedia encyclopedia also proved to be a valuable resource from which corpora could be derived when a more suitable domain-chosen corpus is not available. In many applications the hand construction of corpora for a particular domain is not feasible, and so the ability to show a good match between human similarity judgments and machine evaluations is a result of applied significance.

Footnotes
  • 1

     A standard stop-list was applied, and only words appearing 10 times or more were included in the final corpus.

  • 2
  • 3

     Dividing by the entropy reduces the impact of high-frequency words that appear in many documents in a corpus.

  • 4
  • 5

     The reader is directed to Martin and Berry (2007) for an example of how to create a term by document matrix for both the Vectorspace model and LSA.

  • 6

     Participants actually compared 25 paragraphs; however, a technical fault made the human comparisons of two paragraphs to the rest of the paragraphs in the set unusable.

  • 7
  • 8

     Two-tailed significance tests (α = 0.05) between nonindependent correlations were performed with Williams’ formula (T2) that is recommend by Steiger (1980).

  • 9

     These techniques were also used in the Lee, Pincombe, and Welsh (2005) study.

  • 10

     Both corpora had already been preprocessed with standard methods: removing stop-words, punctuation, and words that appear in only one document were also removed.

  • 11

     With the exception of Topic Model using the Jensen-Shannon metric, all models that incorporate dimensionality reduction performed better at 300 dimensions. Topics-JS at 100 topics was 0.29 compared to 0.28 with 300 topics.

  • 12
  • 13

     PyLucene is a Python extension that allows access to the Java version of Lucene: http://pylucene.osafoundation.org/

  • 14

     These titles were not included with the WENN paragraphs when similarity comparisons were made by either humans or the semantic models.

  • 15

     On average, four keywords were chosen per paragraph to form the Lee-based query.

  • 16

     In Web usability research and broad-sheet newspaper media terms, this positioning is often referred to as being ‘‘above the fold.”

  • 17

     In Study Two, LSA similarity estimates correlated 0.48 with human judgments of similarity on WENN document set.

  • 18

     CUDA, the nVidia graphics processing unit technology, presents as an architecture on which these parallel processing gains might be achieved while efficiently using sparse matrices (Bell & Garland, 2008).

Acknowledgments

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

The research reported in this article was supported by Defence Research & Development Canada (grant number W7711-067985). We would also like to thank Michael Lee and his colleagues for access to their paragraph similarity data. Finally, we wish to thank the reviewers, whose helpful comments have greatly improved this paper.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information
  • Bell, N., & Garland, M. (2008). Efficient sparse matrix-vector multiplication on CUDA (NVIDIA Technical Report No. NVR-2008-004). NVIDIA Corporation. Available at: http://www.nvidia.com/object/nvidia_research_pub_001.html. Accessed on April 10, 2009.
  • Berry, M. W., & Browne, M. (2005). Email surveillance using non-negative matrix factorization. Computational & Mathematical Organization Theory, 11(3), 249264.
  • Blei, D., Ng, A. Y., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(4–5), 9931022.
  • Bullinaria, J. A., & Levy, J. P. (2006). Extracting semantic representations from word co-occurrence statistics: A computational study. Proceedings of the National Academy of Sciences, 39(3), 510526.
  • Choueka, Y., & Lusignan, S. (1985). Disambiguation by short contexts. Computers and the Humanities, 19(3), 147157.
  • Foltz, P. W. (2007). Discourse coherence and LSA. In T. K.Landauer, D. S.McNamara, S.Dennis, & W.Kintsch (Eds.), Handbook of latent semantic analysis (pp. 167184). Mahwah, NJ: Lawrence Erlbaum Associates.
  • Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The intelligent essay assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2). Available at: http://imej.wfu.edu/articles/1999/2/04/index.asp. Accessed on April 2, 2008.
  • Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In M. M.Veloso (Ed.), Proceedings of the 20th international joint conference on artificial intelligence (pp. 16061611). Menlo Park, CA: AAAI Press.
  • Giles, J. (2005). Internet encyclopaedias go head to head. Nature, 438(7070), 900901.
  • Griffiths, T. L., & Steyvers, M. (2002). A probabilistic approach to semantic representation. In W. D.Gray & C. D.Schunn (Eds.), Proceedings of the 24th annual conference of the Cognitive Society (pp. 381386). Mahwah, NJ: Lawrence Erlbaum Associates.
  • Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1), 52285235.
  • Griffiths, T. L., Tenenbaum, J. B., & Steyvers, M. (2007). Topics in semantic representation. Psychological Review, 114(2), 211244.
  • Kireyev, K. (2008). Beyond words: Semantic representation of text in distributional models of language. In M.Baroni, S.Evert, & A.Lenci (Eds.), Proceedings of the ESSLLI workshop on distributional lexical semantics: Bridging the gap between semantic theory and computational simulations (pp. 2533). Hamburg, Germany: ESSLLI.
  • Kwantes, P. J. (2005). Using context to build semantics. Psychonomic Bulletin and Review, 12(4), 703710.
  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Platos problem: The Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211240.
  • Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (2007). Handbook of latent semantic analysis. Mahwah, NJ: Lawrence Erlbaum Associates.
  • Lee, M. D., & Corlett, E. Y. (2003). Sequential sampling models of human text classification. Cognitive Science, 27(2), 159193.
  • Lee, M. D., Pincombe, B. M., & Welsh, M. B. (2005). An empirical evaluation on models of text document similarity. In B. G.Bara, L. W.Barsalou, & M.Bucciarelli (Eds.), Proceedings of the 27th annual conference of the Cognitive Society (pp. 12541259). Mahwah, NJ: Lawrence Erlbaum Associates.
  • Martin, D. I., & Berry, M. W. (2007). Mathematical foundations behind Latent Semantic Analysis. In T. K.Landauer, D. S.McNamara, S.Dennis, & W.Kintsch (Eds.), Handbook of latent semantic analysis (pp. 3555). Mahwah, NJ: Lawrence Erlbaum Associates.
  • Martin, M. J., & Foltz, P. W. (2004). Automated team discourse annotation and performance prediction using LSA. In S. T.Dumais, D.Marcu, & S.Roukos (Eds.), HLT-NAACL 2004: Short papers (pp. 97100). Boston, MA: Association for Computational Linguistics.
  • Masson, M. E. J. (1982). Cognitive processes in skimming stories. Journal of Experimental Psychology: Learning, Memory, and Cognition, 8(5), 400417.
  • McNamara, D. S., Cai, Z., & Louwerse, M. M. (2007). Optimizing LSA measures of cohesion. In T. K.Landauer, D. S.McNamara, S.Dennis, & W.Kintsch (Eds.), Handbook of latent semantic analysis (pp. 379400). Mahwah, NJ: Lawrence Erlbaum Associates.
  • Moed, H. F., Glänzel, W., & Schmoch, U. (2004). Handbook of quantitative science and technology research: The use of publication and patent statistics in studies of S&T systems. Secaucus, NJ: Springer–Verlag New York.
  • Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1998). The University of South Florida word association, rhyme, and word fragment norms. Available at: http://w3.usf.edu/FreeAssociation/ Accessed February 2, 2009.
  • Pincombe, B. M. (2004). Comparison of human and LSA judgements of pairwise document similarities for a news corpus (Tech. Rep. No. DSTO-RR-0278). Adelaide, Australia: Australian Defense Science and Techology Organisation (DSTO), Intelligence, Surveillance and Reconnaissance Division. Available at: http://hdl.handle.net/1947/3334. Accessed on April 15, 2008.
  • Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613620.
  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 147.
  • Shashua, A., & Hazan, T. (2005). Non-negative tensor factorization with applications to statistics and computer vision. In L.De Raedt & S.Wrobel (Eds.), Proceedings of the 22nd international conference on machine learning (pp. 792799). New York: ACM Press.
  • Steiger, J. H. (1980). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87(2), 245251.
  • Toffler, A. (1973). Future shock. London: Pan.
  • Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In J.Callan, D.Hawking, A.Smeaton, & C.Clarke (Eds.), Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’03) (pp. 267273). New York: ACM Press.
  • Zelikovitz, S., & Kogan, M. (2006). Using web searches on important words to create background sets for LSI classification. In G.Sutcliffe & R.Goebel (Eds.), Proceedings of the 19th international FLAIRS conference (pp. 598603). Menlo Park, CA: AAAI Press.

Supporting Information

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Semantic models, human datasets, and domain-chosen corpora
  5. 3. Study One: Comparison of models on domain-chosen corpora
  6. 4. Study Two: Corpus preprocessing
  7. 5.  Study Three: A better knowledge base?
  8. 6. Study Four: Corpora that include the dataset paragraphs
  9. 7. Overall summary
  10. 8. Discussion
  11. Acknowledgments
  12. References
  13. Supporting Information

Comparing Methods for Single Paragraph Similarity Analysis.

FilenameFormatSizeDescription
TOPS_1108_sm_supmat.pdf62KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.