Large-scale multiple hypothesis testing in information retrieval: Towards a new approach to document ranking



Information retrieval (IR) may be considered an instance of a common modern statistical problem: a massive simultaneous hypothesis test. Such problems arise often in biostatistics where plentiful data must be winnowed to name a small number of potentially “interesting” cases. For instance, DNA microarray analysis requires researchers to filter thousands of genes, searching for genes implicated in a particular condition. This paper describes a novel approach to IR that is based on the notion of simultaneous hypothesis testing. In this case the test is performed on each document and the null hypothesis is that the document is non-relevant. After a mathematical derivation of the proposed model, we test its performance on three standard data sets against the effectiveness of two baseline IR systems, a vector space model and a language modeling-based system. These preliminary experiments show that the hypothesis testing approach to IR is not only philosophically appealing, but that it also operates at the state of the art in effectiveness.


Recent results in genomics (especially DNA microarray analysis) have brought the problem of massive, simultaneous hypothesis testing to the forefront of the statistical literature [4, 6, 14]. This paper argues that understanding information retrieval (IR) as a hypothesis-testing problem is not only an apt metaphor but also offers novel approaches to document ranking and IR1.

In DNA microarray analysis, researchers are presented with a potentially huge body of gene expression data. From these expressions, researchers are interested in finding a relatively small number of genes that are “interesting” with respect to a particular condition. This is similar to the IR problem, where a searcher hopes to find a manageable set of documents that are interesting with respect to his or her information need.

Due to the natural sciences' increasing concern with problems such as these, modern statistics has developed a body of literature on large-scale multiple hypothesis testing [4, 6]. To capitalize on this research it is fruitful to consider IR as a simultaneous hypothesis test on each document di where the null hypothesis is that di is not relevant to the query q.

We argue that approaching IR as a massive, large-scale hypothesis test constitutes a novel and effective method of ranking documents against a query. After a brief review of large-scale hypothesis testing in Section 2, Section 3 formally derives method for performing IR in a hypothesis testing framework. In Section 4 we offer preliminary experimental results suggesting that our proposed IR model operates at state of the art levels of accuracy. Finally Section 5 concludes with an interpretation of our findings and a treatment of planned research to help understand the relationship between IR and the statistical results drawn on throughout the paper.


Information retrieval presents problems that are increasingly common in modern data analysis. Particularly in the area of DNA microarray studies, identifying a small number of “interesting” items in an enormous database has become crucial. This paper argues that IR would be well served by taking account of the sophisticated statistical methods developed for such problems, especially those concerned with multiple simultaneous testing of hypotheses.

For contemporary statistical analysis, massive, sparse data of exceedingly high dimensionality is often the norm. The pervasiveness of massive data sets constitutes a sea change in undertaking statistical analysis [4]. However, in IR outsized, sparse data has been the norm for years. Thus IR researchers have suddenly found themselves on the cutting edge of mathematical statistics.

Nowhere is the change faced by modern statistics more evident than in the development of methodology for analyzing DNA microarrays [1, 2, 5, 6, 8, 9, 11]. In microarray research investigators typically search for a small number of genes that are implicated in the onset of a particular condition. To find these genes, researchers must sift through many thousands of non-interesting (null) gene expressions.

From a statistical standpoint, microarray analysis interrogates each gene g against the null hypothesis H0: g is not significantly, differentially expressed in affected patients. In microarray experiments the number of tests performed is often large, on the order of many thousands. To counter spurious “multiplicity effects,” these hypothesis tests must be conducted with care.

This paper argues that the apparatus developed to exercise such care is also appropriate in the context of IR. As in microarray analysis, IR presents a “needle in the haystack” problem. Given a searcher's query q we should like to find a relatively small number of documents that are relevant. This typically involves searching over a very large body of data, with correspondingly low probability that any one document is relevant.

We may thus understand IR in terms of the following hypothesis testing framework. Given a query q and a document d we test the hypothesis H0: d is not relevant to q. Rejecting the null hypothesis corresponds to the decision to retrieve a document. For a large corpus, the number of simultaneous hypothesis tests will thus be on the order of thousands, millions, or even billions.

Treating IR in a hypothesis testing framework yields two benefits. First, it allows us to use false discovery rate (FDR) theory for IR evaluation [3, 14]. As described in [7], FDR theory gives a firm probabilistic basis for traditional Cranfield-style evaluation. But perhaps more importantly, FDR theory allows us to proceed with performance evaluation even in the absence of relevance judgments.

Having explored the use of FDR theory for IR in [7], this paper focuses on another benefit of the proposed theory: hypothesis testing lends itself to a new method of document ranking. In traditional IR models such as the vector space approach [12, 13], query-document similarity is measured strictly as a function of the words shared between a searcher's query and each document. Modern approaches to IR such as the family of language modeling approaches [10, 15, 16] include reference to a “collection language model,” in many cases yielding a more nuanced estimate of document relevance. In the proposed model we rely explicitly on the question: given a document d are the query term expressions in d greater than we expect them to be given that d is not relevant? To test this hypothesis, we derive a similarity score based on a modified t-statistic. This is innovative and helpful insofar as it accounts explicitly for the null condition, interrogating the significance document-query term co-occurrence information.

Ranking Documents in a Hypothesis Testing Framework

We begin with a review of a simple hypothesis test for the equality of means of two populations, X and Y. For simplicity we assume that our samples x and y are of the same size, n. Let equation image be the sample mean and sx be the sample standard deviation computed from x, with analogous notation for the sample y. The test statistic is

equation image(1)

where the denominator of t is simply the standard error of the means. The statistic t follows the t distribution with n-2 degrees of freedom.

During a standard hypothesis test, the distribution of the statistic given in Equation 1 allows us to quantify the likelihood of finding the observed difference between x and y given that the means of X and Y are equal.

The motivation for the hypothesis testing approach to IR is analogous to the test described above. Instead of the equality of means, however, for IR we are interested in the null hypothesis: H0: document d is not relevant to query q. Instead of comparing means, in the case of IR we will compare evidence for and against relevance.

Let A be an m document by n term matrix. The matrix A may contain word frequencies, weights, etc. For now we assume it contains word frequencies. Let ai be the n-vector for the ith document. Also let q be an n-dimensional query vector corresponding to a query q, with non-zero values corresponding to the query terms. We assume the query vector contains query term frequencies. We first define

equation image(2)

which is simply the inner product between the query and each document. The magnitude of the ri, the ith element of r, quantifies the evidence against the null hypothesis that document i is not relevant to the query. This, of course, is the relevance metric used in the vector space model (VSM), without the VSM's length normalization.

However, we can improve our relevance estimate by considering evidence in favor of the null hypothesis. We operationalize this by defining a “null” vector— an n-vector containing the expected frequency of each term in a document without regard for that document's topic. In this paper we define the null vector simply as each term's average document frequency. In matrix notation we have

equation image(3)

where 1 is simply an n-vector of ones.

Two points are worth noting regarding Equation 3. First, the null vector is query-independent; we need only compute it once for the document collection. Second, under the hypothesis testing framework introduced here, the null condition could be formulated differently. Such formulations will form the basis of future work. The crucial point for the proposed methodology is the presence of a null condition.

Having defined our null vector, we create a “null query” equation image. The null query contains zero for all non-query words. Words that appear in the query are weighted proportionally to their expected frequency in a document under the null condition (i.e. non-relevance). Based on this definition we calculate

equation image(4)

Equation 4 quantifies the evidence in favor of the null hypothesis (non-relevance).

With these definitions in place, we may state our null hypothesis formally: H0: r1 - r0 = 0. For each document, we compute a statistic to test the hypothesis that it is not relevant to the query. Those documents that have strong evidence against the null hypothesis, we argue, have high likelihood of relevance based on the query.

To derive a bona fide test statistic for the null hypothesis we need to define the standard error of r1 and r0. The squared standard error of the statistic defined in Equation 2 is derived by the sum of squares

equation image(5)

where the sum is taken over the m documents (rows) in A. The sum of squares for the null case is defined analogously to that of Equation 5.

The test statistic for our null hypothesis will thus be

equation image(6)

The denominator of Equation 6 is simply the pooled standard error.

Alternative formulations of our t-statistic are possible, and in future work we shall pursue them. But for now the simple formulation given in Equation 6 is both principled and sufficient for our purposes.

Since the sums of squares in Equation 5 are scalars computed on the query-corpus level, they are constant across all documents for a particular query. For practical purposes, then, we may ignore the denominator and rank documents by the numerator of Equation 6.

Each element i of the m-vector t quantifies the evidence against the ith null hypothesis: document i is not relevant to the query. Because the number of tests is high (egqual to the number of documents), and because we expect the elements of A not to follow a Gaussian distribution, we cannot go so far as to use our t-statistic to derive a traditional p-value. Instead, evaluation of the probability of the null hypothesis must be approached with more appropriate inferential tools such as the false discovery rate, as discussed in [7]. But for the purposes of ad hoc IR, it suffices to rank documents in decreasing order of their magnitude on t.

The ranking statistic t defined in Equation 6 is similar to other IR models such as the vector space and language modeling approaches two main senses. In all of the major IR models, evidence of a document's relevance begins by identifying the words it shares with the query. This is also the case in our model, where Equation 2 exerts this influence.

As in the Kullback-Leibler (KL) IR model (a variant of the language modeling framework), the numerator of Equation 6 conditions the query-document similarity against evidence that the document is similar to a null condition. In the KL model we estimate the “distance” between the query model and the “collection model.” In the hypothesis testing approach we analyze the magnitude of the difference between query-document similarity and the similarity between the document and the “null query,” which is quite similar to the notion of a collection model.

The second function of Equation 6's numerator is as a form of length normalization. If a particular document i is long, the ith row of A is apt to have large values. The simple dot product between the query vector and the ith row vector will thus be large (assuming the document and query share terms). Using Equation 2 to rank document thus favors long documents. However, if document i has large values on the query terms, the magnitude of those terms in Equation 4 will also be high, and so the subtraction in Equation 6's numerator will dampen the effect of document length.

Experimental Evaluation

To test the proposed IR model we compared its performance to two other models on three data sets. Data analysis followed the standard Cranfield model, using precision and recall—especially mean average precision (MAP)—to gauge system effectiveness.

Data Sets

We compared IR performance over three data sets shown in Table 1. Medline2 and CACM3 both consist of academic articles' titles and abstracts. Reuters-215784 contains approximately 20,000 newswire articles, each of which is tagged by human editors with topical descriptors. Reuters queries were created by choosing the topic tags for each article. Each topic assigned to at least ten documents was used as a query—55 queries in all. Documents in the Reuters corpus were considered relevant to a query if they were tagged with that query term.

Table 1. Test Collections used for Experimentation
 # Docs# Queries

Baseline Models

To assess the performance of the proposed method we compared its effectiveness to two other methods: a simple vector space model using TF-IDF weighting [12, 13] and a language modeling system using Jelinek-Mercer smoothing [10, 15, 16]. Both of these baseline models have been shown to be effective for ad hoc retrieval, with the language modeling approach constituting the current state of the art.

No stemming was applied to documents during indexing, nor was a stoplist used. The TF-IDF model was tested using both normalized document vectors and non-normalized vectors. Our TF-IDF results report runs using the best-performing normalization (normalized for medline; non-normalized for CACM and Reuters).


Figure 1 and Table 2 summarize the results of our initial experiments. Each panel of Figure 1 shows precision-recall curves (over eleven recall points) for the models described above on a given data set. From the figure it is clear that the simple TF-IDF model did not perform as well as the language modeling system or the hypothesis testing model. The difference is especially stark on the larger data sets, CACM and Reuters, where TF-IDF's precision-recall curve is consistently below those of the other models.

Both Figure 1 and Table 2 suggest that there is very little difference between the language modeling system and our hypothesis testing model with respect to precision and recall: the precision-recall curves of these models are nearly identical.

Figure 1.

Precision and Recall for Three Models and Three Corpora

Table 2 shows in boldface the best-performing model for each corpus according to mean average precision (MAP). On CACM and Reuters the hypothesis testing system narrowly outperformed the language modeling system.

Table 2. Mean Average Precision for Three Models and Three Corpora
Lang. Model0.4690.1980.431
Hyp Testing0.460.20.438

The differences between the language modeling system and the hypothesis testing model were narrow, a fact reinforced by statistical testing. Undertaking one-sided, paired (i.e. comparing each query's average precision for each model) t-tests on the precision of the systems yielded p-values of 0.263, 0.535, and 0.847 for medline, CACM, and Reuters, respectively. Sign tests on the query by query performance yielded similarly high p-values, suggesting that there is no statistically significant difference between the average precision on these corpora using a language modeling framework or the hypothesis testing model proposed here.

Discussion and Conclusions

Our experimental results suggest that the hypothesis testing IR model proposed in this paper operates at a level comparable to the state of the art. However, our experiments do not offer the final word on this matter. In future work we propose two experimental changes and several modifications to the model itself.

First, we will compare the hypothesis testing model to other IR models on larger, more heterogeneous data sets (several of the TREC collections, specifically). Though we have no reason to suspect that larger data sets will show defects in the model, more realistic testing will surely point out the model's limitations.

Additionally, we only compared our model to Jelinek-Mercer smoothing with a single mixing parameterization (lambda = 0.5). In future work we will test our method against other IR models. With respect to language modeling systems we will add Dirichlet smoothing to the experiment. We will also test the hypothesis testing method against the standard Okapi probabilistic model.

From a theoretical standpoint, we plan to undertake research to define the elements of Equation 6 (our basic model) more rigorously. In this paper we have taken a simple interpretation of the t-statistic used to rank documents. In upcoming work, however, we expect to improve the method of defining the null condition. With this will come an alternative formulation of the standard error. We suspect that a more sophisticated null will improve the performance of the proposed model.

With these caveats in mind, we conclude by noting that the proposed model appears to operate at a level of accuracy comparable to the best modern IR methods. The influence of considering the null case during retrieval is the hallmark of the proposed model. Accounting for evidence in favor of the null (non-relevance) conditions relevance predictions on the relative frequency of words in the collection and of the length of documents. But perhaps more provocatively, the model as a whole constitutes a novel way of thinking about IR—as a massive simultaneous hypothesis test. As we continue to work on the model (especially with respect to maturing the null condition and its standard error) we plan to pursue the statistical benefits of this perspective.

  1. 1

    1In other work [7] we show that the hypothesis testing framework developed here also provides a novel means of conducting and interpreting IR experimentation.

  2. 2


  3. 3


  4. 4