Comment centric news analysis for ranking



Ranking documents to feed users' information need is a challenging task, due to the dynamic nature of users' interests with respect to a query, which changes from time to time. In this paper, I will propose the innovative method to extract a real-time language model estimation of the community interest given a query from, and use this model to rank retrieved documents. In this experiment, user comments tagged news (by using passage retrieval algorithm) collection is employed to represent community. The interest based ranking is differing from traditional relevance based ranking.


Ranking is a key step in Information Retrieval (IR) systems. All ranking algorithms work to find the most important documents and show them to users at the top of the search results. Generally, existing ranking algorithms measure the importance of the document in the search results in several different ways:

  • a. Similarity or distance score between query and document.

  • b. Probability that the document generated the query, or vice-versa.

  • c. The popularity of the document or web page from an external context, such as citations, page links or user behavior data.

In the above computations, there are some basic hypotheses: first, the query contains the key information for ranking, which provides hints used to discriminate the important results from others. Second, some unique features on the document side can help rank the results, such as citations, page links, or blog citing.

However, there are also some limitations in these hypotheses. First, general web search engine queries tend to be short (Jansen et al., 1998; Silverstein et al., 1999) and the ranking algorithms can hardly detect the necessary ranking information in the query. On the other hand, some existing web elements, such as links or citations, are biased. For instance, a blog posting getting a high number of citations may due to two different reasons: the content is interesting (it should get the high rank), or the blogger is popular in a certain community (the content may be pedestrian and does not deserve the high rank out of the local community). The content-free citation based algorithm will favor these postings no matter which scenario they belong to.

In this paper, I will use “community interest” to represent this importance score; namely, compute the probability that global community interest in a specific document for a given query at a given time. Instead of employing a large number of users to make judgments as to what is interesting and what isn't, I use user oriented news comments or comment tagged the news to represent community interest.


Different techniques have been developed for ranking retrieved documents and web pages for a given query, such like classical vector space model (Salton & Yang, 1973). For the past decade, the network structure of a hyperlinked network has been successfully employed as an indicator for ranking. The related ranking algorithms are like PageRank (Page et al., 1998) and HITS (Hypertext Induced Topic Selection) algorithm (Kleinberg, 1999), which are based on traditional citation analysis and social network analysis.

User behavior data have been used and proved as an effective indicator for ranking. The relationship between implicit and explicit user data was studied by (Fox et al 2005), and two different Bayesian models were built to correlate different kinds of implicit measures and explicit relevance judgments for individual page visits and entire search session. Joachims (2002) employed clickthrough data to learn ranking function by using SVM, and his work proved that clickthrough data is a significant predictor of user interest for ranking. Similarly, Agichtein et al (2006) incorporated noisy user behavior data into the search process, and the user data were used to train the ranking functions. In this approach, user interest is partially represented by user behavior data.

In this paper, I will be focusing on representing user interest from the topic distribution over user generated text data, which is separated from the retrieved results. Instead of using statistics of user behavior data, I rank the documents in terms of the content of large amount of real time user generated text.


Intuitively, ranking retrieved documents may be corresponding to two different parts: the probability that the document is relevant to the query and the probability that the community may interested in this document at a given time (e.g. now).

equation image

From classical language prospective, we can define the first part (relevance probability) into:

equation image

Formula 2 used Jelinek-Mercer smoothing with uni-gram language model. And the top ranked documents are highly likely to be the relevant documents of the query. So, our focus in this paper is to compute the second part, namely the probability that community is interested in specific document at a given time. Unfortunately, the “community” is an unmeasureable variable, as we can hardly compute the real-time user interest probability for a given document.

In this paper, I will propose the innovative method to use user oriented text data to represent the real-time community interest. If each user generated text is viewed an “agent” of user, which contains the target query, then the collection of these text can be the representation of community interest (of the query), and we call this collection CI.

equation image

So, we can use the likelihood of the document according to the estimated language model of CI to compute the community interest probability (in equation 3). However, differ from traditional language model, CI is a continuous time-sensitive corpus, namely, the more recent user oriented text data may have stronger influence of community interest compared with the historical data.


Figure 1.

CI data and sub-collection of CI

As above diagram shows, if CI is a collection of user oriented text about target query, it can be equally divided into different time based segments, and each segment (sub-collection) can be viewed as the snapshot of community interest over time.

equation image

In formula 4, while current user generated text (CI-now) can best represent community interest, historical data (CI-x) are also helpful for this representation. But the representability of CI-x decreases while x increases.

I will use two different approaches to compute the probability that CI-x generate each document, unigram approach and topic approach.

For uni-gram approach, θCI is estimated by normalized uni-gram model, and each term in document is independently generated from the CI via the model, which is shown in equation 6 and freq(t,doc) is the TF of t in the document. By using Jelinek-Mercer smoothing, the term generation probability is smoothed by the whole CI collection.

equation image
equation image

For topic approach, by using LDA algorithm (Blei et al., 2003), we could change the traditional term-document distribution into two different distributions: α, topic-doc distribution P(topic|doc) and β, feature(term)-document distribution P(t|topic). Just like the following diagram shows.


Figure 2.

LDA algorithm

The current topics are extracted from CI-now instead of all CI sub-collections, because the ranking algorithm needs to capture the most up-to-date topics (such like the breaking news) for ranking, instead of all the background topics (stoptopics) of the target query. So, the term generation probability can be written as:

equation image
equation image

Equation 7 decompose the CI-x generates term t probability into topic generates term and CI-x generates topic probability respectively. Equation 8 computes the CI-x generates topic probability (the current topics extracted from CI-now by LDA).


The unsolved problem is how to build a high quality corpus (CI) to represent the community interest. The corpus should have the following characteristics: first, the corpus should be generated by user; second, each document (or posting) in the corpus should have attached timestamp; third, each document in the corpus should directly target on the interest topic(s) instead of the background information of the query.

As we know, blog is directly generated by users, whose blogs contain content of their interests. That means the representability of blog collection for global community interest is high. However, a problem of blog corpus is its low-quality writings in spoken language. Compared with user oriented text data, daily news is a much higher quality corpus. However, news data also have two major limitations. First, news corpus is not directly generated by user, and second, compared with blog, there are too much background information in the news. For example:

  • 1Obama government now focusing on economy…
  • 2Even with the pressure of economy, Obama have to worry about Iraq…

The first sentence is about the “economy” topic and the second one talks about “Iraq”. However, we may find the distribution distance between these two sentences (K–L divergence) is not large, as the second one mentions “economy” as the context of the central topic, which will negatively affect the performance of the topic extraction algorithm to compute community interest language model.

This study proposes a comment centric analysis method to solve these two problems. In most cases, a news comment provides reader's opinion towards specific part (passage) in news posting. For example: 3

Figure 3.

Comment centric analysis

I will use the user comments to identify the most important (reader most interested) parts by changing the weight of those words and entities, which is called “comment centric tagging”. Identification of boundaries of tagging is a critical step for this method. There have been methods of an arbitrary text window in the document such as (Kaszkiel & Zobel 1997), which achieved promising result for retrieval task. Alternatively, semantic boundaries can be inferred by shift of topic (Kaszkiel 1993; Mittendorf 1994; Knaus et al., 1998). In this experiment, I will use the passage detection algorithm for QA system (Lee et al., 2001), which is a density-based algorithm to compute the user interested sentences in the news data. The extracted sentences will be used to generate the final CI.


The goal of community interest language model ranking is to capture the dynamic community interest for real-world, real-time ranking. Instead of using the existing standard ranking evaluation corpus, such like TREC, we conducted user evaluation to examine the performance of the new ranking algorithm. Precision-at-document-n (Anh & Moffat, 2002) and Normalized Discount Cumulative Gain (NDCG) (Järvelin & Kekä lä inen, 2002) are used for this evaluation.

A group of queries will be chosen for evaluation, and the retrieved documents will be ranked by comment tagged news CILM, blog based CILM and other ranking algorithms. The top results will be accessed by users to judge in categories of “interesting”, “just OK”, “not interesting” or “duplicate”, based upon their opinion towards the results. The evaluation results across different algorithms will be compared and analyzed to test 1). Whether comment tagged news can better represent community interest and 2). whether CILM extracted from real-time comment tagged news or blog corpus can help system improve the ranking performance.