## SEARCH BY CITATION

• informetrics

### Abstract

We present a model that describes which fraction of the literature on a certain topic we will find when we use n (n = 1, 2, …) databases. It is a generalization of the theory of discovering usability problems. We prove that, in all practical cases, this fraction is a concave function of n, the number of used databases, thereby explaining some graphs that exist in the literature. We also study limiting features of this fraction for n very high and we characterize the case that we find all literature on a certain topic for n high enough.

### Introduction

The coverage of databases of the literature on a certain topic is an important issue in information retrieval. Only in a few cases will one database cover all the literature on a topic. The smaller the coverage fraction is, the more databases we will have to use in order to cover a certain percentage of the complete literature that exists on this topic. Thereby we can use—besides field-dedicated databases—general databases such as, for example, the Web of Science (WoS).

One database can cover a fraction a (or 100a%) of the existing literature on a certain topic. A second database will cover another fraction of the existing literature on a certain topic, but here, in this introduction, we assume also that the second database covers a fraction a of the existing literature on a certain topic. However, using both databases will not yield a fraction 2a of the sought literature since several documents will be common to both databases.

What fraction of the existing literature on a certain topic will be found after the use of n (= 1, 2, 3, …) databases? Suppose here that all databases cover the same fraction a of the literature on the topic (this is not very realistic but it is the subject of this paper to extend the theory to different fractions, from the second section onward). The argument isas follows. The first database yields an expected1 fraction a of the existing literature on a certain topic, hence it does not yield the complementary fraction 1 − a of the literature. Using a second database will not yield a fraction (1 − a)2 of the literature. Indeed, both databases do not yield a fraction 1 − a of the sought literature, hence in both databases we have that 1 − a is the probability to miss a document on the topic. Due to independence we have that after the use of two databases we missed a fraction (1 − a)2 of the sought literature. This argument can be repeated to 3, 4, … , n databases yielding that after using n databases we missed a fraction (1 − a)n of the sought literature and hence we have found a fraction:

• (1)

of the sought literature.

This is similar to the following problem: how many users of a certain service (e.g., a library) must be interviewed to find a certain fraction of the usability problems of that service. Similar to the above we can assume that each user can inform us about a fraction a of usability problems. The same argument as above yields a fraction (1) of usability problems after interviewing n users (see, e.g., Nielsen & Landauer, 1993, or http://www.useit.com/alertbox/20000319.html; retrieved on January 5, 2012).

Requiring (1) to be as high as we wish (e.g., 0.9 or 90%) yields the needed number n of databases to be used (or users to be interviewed):

or:

hence:

• (2)

where any logarithm can be used.

The function (1) is a concavely increasing function of n and its limit (for n going to ∞) is 1, as is readily seen. This is a partial explanation of graphs as in Hood and Wilson (2001); see Figure 1 where several topics (indicated in the figure) are retrieved in 1, 2, 3, … databases and where the graphs indicate the percentage (fraction) of records retrieved after the use of n = 1, 2, 3, … databases.

This partial explanation of Figure 1 is important in information retrieval. It indicates how the recall increases with the number n of used databases. As formula (1) and Figure 1 indicate, to reach a recall close to 1 requires the use of a high number of databases and shows the inefficiency of such searches.

However, the explanation above is partial since we assumed that all databases yield a fixed fraction a of the sought literature. Similarly, it is not realistic to assume that all users of a certain service yield a fixed fraction of usability problems. Hence, both applications need variable fractions per database or per user. This is the topic of our paper. We will, henceforth, use the information retrieval terminology but the application to the detection of usability problems is similar.

In the next section we present the general formula for the fraction of sought literature after the use of n databases. We prove under which conditions we have a concavely increasing curve (in function of n) and we show that in the case of Figure 1 these conditions are satisfied, hence yielding a complete explanation of these graphs.

In the third section we study limiting problems of this formula for the fraction of sought literature. We give necessary and sufficient conditions for this formula (function of n) to go to 1 for n going to ∞. Only in this case we can be as close as we want to retrieving 100% of the sought literature, if we use enough databases. An example where this is the case and an example where this is not the case is given. The paper closes with some final remarks and suggestions for further research.

### The Fraction of Sought Literature After the Use of n Databases

Let us have a nonspecified number of databases that we can use for retrieving documents on a certain topic. The order in which we use these databases is important in practice but is not specified at this moment. We come back to this issue later on in this section.

We denote by ai (0 < ai < 1) the expected2 fraction of sought documents in database i = 1, 2, 3, …. Here we assume that when we use database i, we can retrieve the complete fraction ai of sought documents (otherwise the value ai is reduced. which is not important at this stage). In analogy with the argument yielding formula (1), we now have that, using only database 1, there is a fraction 1 − a1 of sought documents that is not retrieved. After using the first two databases we have a fraction (1 − a1)(1 − a2) of sought documents that is not retrieved (due to independence). After using the first n databases we hence have a fraction (1 − a1)(1 − a2) … (1 − an) of sought documents that is not retrieved. Consequently, after using the first n databases we hence have a fraction:

• (3)

of sought documents that is retrieved, where denotes the product (1 − a1)(1 − a2) … (1 − an).

As one of the referees pointed out, the above argument is not completely correct (as is the one that proves (1)) and can be made more correct by considering expected values. This can be done as follows. Let S be the set of the literature on a topic and A(j) be the subset covered by database j, j = 1, 2 … . Denote by B(j) the complement of A(j). For each element w ∈ S and each n = 1, 2 … , define the indicator function I(w, n) = 1 if w is included in at least one of the first n databases (i.e., if ) and I(w,n) = 0 otherwise (i.e., if ). The function I is a random variable of which we want to know the expected proportion in the first n databases (as in (3)).

This is:

by the assumed independence of compilation of different databases. Hence, we refind (3).

Note: One of the referees remarked that the above model assumes that “every item in the literature has the same chance of being included in a particular database as any other item.” This is not assumed in the above model. It is clear that some sought documents have a higher chance of being included in a database than others. But that does not prevent us from assuming that ai is the (expected) fraction of sought documents in database i. In fact, we simply extend the well-established model (1) of Nielsen and Landauer (1993).

Function (3) generalizes function (1) and it will turn out that it does not always have the property that it increases concavely nor that is goes to 1 for n going to ∞. The latter problem will be studied in the next section on the limiting properties of f(a1an) for n going to ∞; the former property will be studied here. We have the following proposition.

Proposition 1. The function f(a1an) is always increasing and is concave if and only if, for every n = 2, 3. …

• (4)

Proof. The function f(a1an) clearly increases (strictly) since 0 < an < 1 for all n = 1, 2, . … It is concave (in n, with fixed ai-values) if and only if, at each n = 2, 3, … , we have that:

• (5)

where we define f(a1an-2) = 0 for n = 2 (the starting point of f when zero databases are used). But (5) boils down to, for n = 3, …

or

or

from which (4) follows. This condition is also found if n = 2 (using that f(a1an-2) = 0). □ Cases in which (4) are valid are many.

1. Requirement (4) is valid if the sequence (an)n=1,2,… is decreasing. This is the case in Figure 1: per search, databases are used in decreasing order of their fraction of sought documents (see Hood & Wilson, 2001, p. 1246, search procedure (3)). So Proposition 1 gives a full explanation of the shapes of the curves in Figure 1—the small deviations of the concavity in the curves are due to the fact that an information retrieval process is a sample in the sought documents.
2. If the ai-values are large (i.e., close to 1), then for every n = 2, 3, … , an ≈ an−1 and an−1an ≈ 1 making (4) valid. Here any order in which the databases are used yields a concave function f(a1an). This case will occur often in practice for the following reasons. When trying to retrieve documents on a certain topic one uses only databases in the field of this topic or general databases (such as the WoS). In both cases the fraction of the sought documents in these databases is high. Make a distinction with the fraction of the documents in the database sought, which is usually low but these are not the ai-values: they are the fraction of the sought documents that are in database i. This is common sense: a mathematical topic will not be searched in, for example, a medical database and vice versa.
3. There are even cases where the sequence (an)n=1,2,… is increasing and where (4) is valid. Example: take for all n = 1, 2,. … Then the sequence (an)n=1,2,… increases strictly but condition (4) is valid:
since n ≥ 2 in condition (4). Here
a concave function of n. Indeed,
which is decreasing in n and hence f(a1an) is concave in n.
4. However, not every increasing sequence (an)n=1,2,… yields a concave f(a1an). Indeed, take n = 3, a1 = 0.1, a2 = 0.2, a3 = 0.3. Then condition (4) is not satisfied:
and indeed:
Hence f is not concave since 0.496 − 0.28 = 0.216 > 0.28 − 0.1 = 0.18. In fact, f is even convex in this case.

### Limiting Properties of the Function for n = 1, 2, …

This is an important issue. More specifically, we are interested in when

• (6)

in other words when

• (7)

If (6) is the case we are in a situation that, when using sufficient databases, we can reach (almost) complete coverage of the sought documents. Note that this is the case for all searches in Figure. 1 of Hood and Wilson (2001). We will, however, see that (6) (or (7)) is not always valid. In case (6) (or (7)) is not always valid, we have that:

• (8)

and in this case, no matter how many databases we are searching, we will never come close to complete coverage of the sought documents. In the sequel we will give an example of both cases: one where we have (6) and one where we have (8). Note that in the special case (1) we always have (6), which shows that our extension of f to formula (3) has its merits.

First we will give some definitions on convergent or divergent products. They can be found in Apostol (1974) (p. 206–209). We limit our definitions to the case studied here.

Definition 1. Denote by pn the product

• (9)

Then we say that this product converges if there exists a number p ≠ 0 such that . The number p is then denoted

• (10)

If we say that the product diverges to 0 (hence the case (7) or (6), the most interesting case since we are able to retrieve most documents on the topic by taking n high enough).

We can give a characterization of convergent or divergent products of the form (10) by quoting a Theorem in Apostol (1974), p. 209.

Theorem 1. Since all ai satisfy ai < 1, we have that the product converges if and only if the series converges. This represents the case (8), hence where we are not able to come close to a complete coverage of the sought documents (no matter how many databases that are used). Complete coverage (as in (6)) is hence possible using the next Theorem, which follows immediately from Theorem 1.

Theorem 2. We have that the product diverges (hence where (7) or (6) is valid) if and only if the series diverges.

A divergent series means in practice that the fractions ai must be “large enough” so that each database has “enough” coverage of the sought documents in order to make a complete coverage (6) possible. A convergent series means in practice that the fractions ai are too small, preventing complete coverage. We give an example of each case.

Example 1. Let , i = 1, 2, …. Hence diverges and, according to Theorem 2, (i.e., diverges), so (6) and (7) are valid and complete coverage of the sought documents is (in the limit) possible. We can verify this directly. We have, for every n = 1, 2, …

So

and hence

Since we can illustrate “how fast” we approximate the 100% coverage. Take, for example, n = 10 databases, then we can cover , hence more than 90% of the sought documents.

Example 2. Let , i = 1, 2, …. Now we have that is convergent and hence the product is convergent (i.e., is ≠0). This means that (8) is valid and that we cannot approximate complete coverage of the sought documents. We can here, concretely, calculate what fraction of the sought documents can be covered. We have, for every n = 1, 2, …

and hence . This also implies that

so that we certainly do not cover at least 50% of the sought documents (no matter how many databases we will use). This is due to the small coverage of the sought documents of each database i = 1, 2,. … This example shows the interest in the general model (3) above the limited model (1) where always

Note that in both examples f(a1an) is concavely increasing since the sequence (ai)i=1,2… decreases and by Proposition 1.

Remark. Since all ai satisfy ai < 1, we have that convergence of also means absolute convergence. This also means that the series converges unconditionally, that is, it converges in any order of the databases i. More exactly, let π denote any permutation of the natural numbers, that is, a function whose domain is the natural numbers and whose range is the natural numbers and which is a bijection. Then convergence of implies convergence of (see, e.g., Apostol, 1974, Theorem 8.32, p. 196) and hence, by Theorem 1, the product converges (and is equal to ). Similarly, if diverges, then , diverges and hence, by Theorem 2, the product diverges (i.e., its value equals 0). This means that the coverage of sought documents, in the limit, is not influenced by the order in which we use the databases. Of course, for every finite n = 1, 2, … , the values of f(a1an) are determined by the used order of the databases.

Note: Considering an infinite number of databases is, of course, only a theoretical issue. Yet our results on complete/incomplete coverage (Theorems 1 and 2) yield insight in the finite case where there are n databases (n: natural number and high).

### Conclusions and Suggestions for Further Research

In this paper we studied the topical coverage of multiple databases. We showed that, when ai (0 < ai < 1) denotes the fraction of the sought documents (on a certain topic) of the ith database, we cover a fraction

of the sought documents on a certain topic. We showed that in most practical cases this function is concavely increasing in n.

We also showed that the limiting case (for n going to ∞) does not always yields a complete coverage of the sought documents. This is only so if and only if the series diverges.

Examples of complete coverage and incomplete coverage are given and we also showed that this is independent of the order in which we use the databases.

We underline that these generalizations of the simple model (1) (originating from the idea of identifying a fraction of the usability problems associated with a given service) are also meaningful to this application. Indeed, it is much more likely that different users will give a different number of usability problems and hence model (1) is not applicable but model (3) and its applications must be used. Further research on this application (which is outside the field of information retrieval) would be interesting.

We would also welcome other new areas of application of this theory. In this context we could think of applications in the area of shopping in more than one supermarket or in the diffusion of information in several documents (e.g., reviews, books, etc.) on a certain topic.

### Acknowledgment

The author thanks an anonymous referee for the advice to consider expected values as interpretation of the fractions ai.

Footnotes
1. 1

From now on we will delete the adjective “expected” and work with these numbers as probabilities; see also the argument in the next section.

2. 2

As in the previous section, we will henceforth delete the adjective “expected” and work with these numbers as probabilities; see also the argument below.