Confidence Intervals for Linear Combinations of Poisson Means

Authors


Summary

Parametric confidence intervals are given for linear combinations of the means of independent Poisson variables and for their continuous versions. The performance of the intervals is assessed using simulation. A real data set is used to compare the proposed intervals with known ones. The proposed intervals are shown to be superior to known ones and comparable to exact intervals.

1 Introduction

Problems of finding confidence intervals for functions of Poisson means arise naturally in a variety of contexts. Here, are five such problems:

Problem 1. Multiple comparisons procedures for Poisson data with application to comparing defects at an electronics shop over different days (Scheaffer 1980), and to investigating the impact on cancer development of several treatments for Hodgkin's disease (Suissa & Salmi 1989).

Problem 2. In Azerbaijan, about 300 structures that could be oil fields are known onshore, and 66 structures are already recognized in the offshore region. Bagirov & Lerche (1998) wanted to know the fraction of these structures that can be expected to yield horizons with commercial value. To find an answer to this problem, they conducted a statistical analysis of data covering the last 100 years of oil production in Azerbaijan. The number of producing horizons per field was described by a linear combination of Poisson random variables.

Problem 3. A demerit rating system is used to simultaneously monitor counts of several different types of defects in a complex product. The demerit statistic is a linear combination of the counts of these different types of defects. The traditional recommendation is to plot the demerit statistic on a control chart with symmetric 3-sigma control limits. Jones, Woodall & Conerly (1999) proposed an alternative method for determining control limits for the demerit control chart, based on the exact distribution of linear combinations of independent Poisson random variables.

Problem 4. A standard method of estimation with applications to chemistry, the study of geothermal bores and other areas involves adding a radioactive isotope and measuring the number of counts before and after addition. We observe independent Poisson counts math formula in math formula seconds with means math formula, where math formula is the math formulath decay rate, math formula is background noise and math formula is background plus signal. A confidence interval is required for math formula, the signal decay rate.

Problem 5. An increase is observed in the per capita rate of first admissions to psychiatric care. Is the increase significant? Let math formula be the (known) population at time math formula. Let math formula be the number first admitted between times math formula and math formula. Assume for math formula small

display math

where math formula is the (unknown) rate at time math formula. Assume that numbers first admitted in different periods are independent. Then math formula is a Poisson process with mean math formula. The problem of significance can be answered if we have a confidence interval for math formula, where math formula, math formula and math formula are the periods being compared and math formula is an appropriate weight function satisfying math formula. Set math formula, so that math formula. If, in fact, math formula is only available at times math formula (for example, annually) for some math formula then math formula is only estimable if math formula can be expressed as the union of one or more intervals math formula) and math formula is chosen to be constant over each such interval.

The first four examples and the constrained form of the fifth example are special cases of finding a confidence interval for

display math(1)

where math formula is the number of Poisson variables or Poisson means involved (assumed known), math formula are the weights which are also assumed known and math formula are the unknown Poisson means. We assume that we have observations on independent Poisson variables with means math formula, math formula. The parameters of (1) are: math formula.

These examples are special cases of the following problem: given a known weight function math formula and an observed Poisson process math formula with unknown mean math formula find a confidence interval for

display math(2)

The problem of finding a confidence interval for

display math(3)

where math formula are observed independent Poisson processes with unknown mean functions math formula and known weight functions math formula, is reducible to (2) since we may combine math formula into a single process.

To the best of our knowledge, there are only two papers, Stamey & Hamilton (2006) and Krishnamoorthy & Lee (2010), giving parametric confidence intervals for (1937) and there are none giving parametric confidence intervals for (1999) and (2010). The confidence intervals in Stamey & Hamilton (2006) are variations of approximations based on the Central Limit Theorem (CLT). The confidence intervals in Krishnamoorthy & Lee (2010) are based on normal and chi-square approximations. These intervals may not be as accurate as those proposed here because they are based on the CLT whereas we use tools based on higher order approximations. We have empirical evidence that our intervals provide improved accuracy.

The aim of this paper is to provide accurate parametric confidence intervals for (1937), (1999) and (2010). Section 'Confidence intervals' contains the derivation of these confidence intervals. Section 'Numerical comparisons' assesses the performance of the derived intervals in terms of their widths and coverage probabilities. A part of this assessment is based on simulation. Section 'Data example' demonstrates the importance of the derived intervals using a real data set. Some conclusions are noted in Section 'Conclusions'. Some technical details required for the results in Section 'Confidence intervals' are provided in Appendix I.

2 Confidence intervals

In this section, we apply the results of Withers (1983a, 1983b, 1989) to obtain accurate confidence intervals for

display math

corresponding to situations (1937), (1999) and (2010) in terms of

display math

respectively, for math formula any non-negative real number.

The accurate confidence intervals are constructed by a method of successive approximation starting from a CLT for the studentized statistic, in this case

display math

The method was developed by Welch (1947) for the Behrens–Fisher problem and by Winterbottom (1979) and Withers (1982a, 1989) independently for the general parametric situation and by Withers (1983b, 1988) for one-sample and multi-sample non-parametric problems. The method rests on the expansions of Cornish & Fisher (1937). For the parametric situation relevant to (1937) we use the variation in Appendix I with

display math

where math formula and math formula is a parameter determined below. Then Theorem I.1 in Appendix I holds if math formula and math formula is bounded away from zero. Assuming math formula, these conditions are satisfied for

display math

This is substantially better than math formula, which is what one would normally expect.

Theorem 2.1 provides two-sided and one-sided confidence intervals for math formula of (1937). A symmetric version of these intervals is given in Theorem 2.2.

Theorem 2.1. A two-sided confidence interval for math formula of (1) is:

display math(4)

for math formula, where

display math
display math

and math formula and math formula are such that

display math

the nominal level of the confidence interval, and where in turn math formula, math formula, math formula, and math formula is the probability density function of a standard normal variable. One-sided confidence intervals for math formula of (1) are:

display math

for math formula.

Proof. Let math formula. The studentized statistic is

display math

where math formula

Thus, from Withers (1982b, 1983a, 1989) we obtain the result.

Note that math formula estimates the standardized math formulath cumulant of math formula. Note also that math formula and math formula is a function of (math formula, math formula).

Theorem 2.2. A symmetric confidence interval for math formula of (1) is:

display math(5)

for math formula, where math formula, math formula, and math formula.

Proof. The result follows from Withers (1982b).

Note that math formula is an odd function of math formula determined by math formula. Note also that only s1 and s2 are given. Subsequent calculations do not need sk for k > 2.

We shall refer to (4) as the math formulath order confidence interval with respect to math formula. The confidence interval given by (4) is the 2math formulath order confidence interval with respect to the same math formula.

The lengths of the intervals given by (4) for math formula and math formula are the same for math formula. The length of (4) for math formula is greater than that for math formula if and only if math formula for math formula. The length of (4) for math formula is greater than that for math formula if and only if math formula for math formula.

There is no guarantee that math formula and math formula for all math formula. These inequalities may hold for some math formula and may not hold for other math formula especially because of the discrete nature of the Poisson random variables. Thus, the lengths of the confidence intervals given by (4) and (4) may oscillate with increasing math formula. Eventually, math formula and math formula will diverge, and the lengths will become infinite.

The math formulath order confidence interval of nominal level math formula has actual level math formula as math formula. Thus, the coverage probabilities of (4) and (4) will generally take values closer to the nominal level with increasing math formula and with increasing math formula.

In practice, math formula should be chosen to maximize the coverage probability of the confidence intervals and to keep their lengths as short as possible. In other words, math formula should be chosen so large that the coverage probability is as large as possible, but not so large that the length diverges. The choice is a trade off between math formula being too large and math formula being too small. One possible choice is to take math formula as the largest integer for which both math formula and math formula decrease for all math formula.

The CLT confidence intervals correspond to setting math formula in (4) and (4). The confidence intervals of (4) and (4) improve on the CLT versions at least in terms of the coverage probability. The lengths of (4) and (4) may or may not be shorter than those of the CLT versions. This depends on how math formula compares with the rest of the math formulas and on how math formula compares with the rest of the math formulas. Comparisons based on lengths have been considered by several authors. For example, Winterbottom (1979) has shown that for a binomial problem the math formulath order confidence interval can give a significant improvement over the CLT version.

So far we have obtained math formulath order type confidence intervals for (1). Taking limits and using math formula we see that (4) and (4) give math formulath and 2math formulath order type confidence intervals for math formula with respect to math formula in terms of math formula and for math formula with respect to math formula in terms of math formula. In each case, the error is the corresponding limit of the error for (1). Of course, the math formula are assumed to exist.

3 Numerical comparisons

In this section we compare the performance of the two sided confidence intervals, (4) and (4), for math formula, math formula and math formula, where math formula corresponds to CLT confidence intervals. The comparison is performed partly through exact calculations (see Figs 1,2) and partly through simulations (see Figs 3,4).

Figure 1.

Expected widths of the two-sided intervals, (4), and ±1.96 standard deviation bars for math formula, math formula (solid curve), and math formula (curve of dots). The exact interval for math formula corresponds to the curve of long dashes. The vertical bars are offset for visibility.

Figure 2.

Expected widths of the two-sided intervals, (4), and ±1.96 standard deviation bars for math formula, math formula (solid curve), and math formula (curve of dots). The exact interval for math formula corresponds to the curve of long dashes. The vertical bars are offset for visibility.

Figure 3.

Simulated coverage probabilities of (4) for math formula, math formula (solid curve), math formula (curve of dashes), math formula (curve of dots), and math formula (curve of dots and dashes). The exact interval for math formula corresponds to the curve of long dashes.

Figure 4.

Simulated coverage probabilities of (4) for math formula, math formula (solid curve), and math formula (curve of dashes). The exact interval for math formula corresponds to the curve of long dashes.

We use two criteria to assess performance. The first is the expected width of the confidence interval and its standard deviation as computed from (4) and (4). For example, if math formula then the expected width of the confidence interval for (4) and its standard deviation are math formula and math formula, respectively, where math formula, math formula, math formula, math formula and math formula. The expected width of the confidence interval for (5) for math formula and its standard deviation can be computed using math formula and math formula. Also if math formula, an exact confidence interval for (1) is

display math(6)

say, with expected width math formula and standard deviation math formula. These can be used to assess the accuracy of expected widths and standard deviations from (4) and (4).

The second of the two criteria is the coverage probability of (4) and (4) obtained by simulating samples of size 10,000 from

display math

where the math formula are independent Poisson random variables with means math formula, math formula. We considered several choices for the weights:

  • math formula
  • math formula
  • math formula

However all of these choices yielded similar results. For simplicity we shall report the results only for the last choice of math formula. We shall also assume throughout that math formula for math formula.

Figures 1 and 2 show how the expected widths for (4) and (4) vary with respect to math formula for math formula, math formula and math formula. The length of the vertical bars shown in these figures is the standard deviation multiplied by math formula. The vertical bars are offset from each other for the purpose of visibility.

Figures 3 and 4 show how the coverage probabilities of (4) and (4) computed by simulation vary with respect to math formula for math formula, math formula and math formula. The expected width, ±1.96 standard deviation bars and the coverage probability of the exact interval in (5) are included for the case math formula.

The expected widths for (4) and their standard deviations are the same for math formula and for math formula. Figure 1 shows the results for (4) only for math formula.

The following conclusions can be drawn from Figures 1 and 2:

  • the expected widths for (4) increase from math formula to math formula for every math formula and math formula,
  • the expected widths for (4) increase from math formula to math formula for every math formula and math formula,
  • the expected widths for both (4) and (4) generally increase with increasing math formula for every math formula and math formula,
  • the expected widths for both (4) and (4) generally decrease with increasing math formula,
  • the width for (4) is sometimes shorter than that of the exact interval for math formula,
  • the standard deviations of the widths for both (4) and (4) do not appear to show any recognizable pattern.

The following conclusions can be drawn from Figures 3 and 4:

  • the coverage probabilities of (4) increase monotonically from math formula to math formula for every math formula and math formula,
  • the coverage probabilities of (4) increase monotonically from math formula to math formula for every math formula and math formula,
  • the coverage probabilities for both (4) and (4) show a general pattern of increase with respect to increasing math formula especially for small math formula.

It is clear that the intervals given by (4) for math formula are the closest to the exact interval in terms of both the expected width and the coverage probability. It is also clear that the intervals given by (4) for math formula are the closest to the exact interval in terms of both the expected width and the coverage probability. The CLT confidence intervals corresponding to math formula in (4) and (4) perform poorly especially for small math formula.

The discussion so far has not focussed on the confidence intervals for (2) and (2010). However the confidence intervals (4) and (4) are good approximations for the continuous version for large math formula. Thus the confidence intervals discussed in Figures 1-4 for math formula and in Section 'Data example' for math formula can be considered to correspond to the continuous versions.

4 Data example

In this section we demonstrate the practical value of the confidence interval (4) using a real data set from Stamey & Hamilton (2006). The data set considered by these authors (see Table 1) contains the number of fatal motor vehicle accidents (FMVA) involving driving while intoxicated (DWI) during six major holidays for the year 2000. The statistics are taken from the Crash Records Bureau of the Texas Department of Public Safety.

Table 1. Number of driving-while-intoxicated-involved fatal motor vehicle accidents during six major holidays in 2000
HolidayNumber of accidents
Memorial Day0
July 45
Labor Day2
Thanksgiving11
Christmas8
New Year's Eve9

Stamey & Hamilton (2006) were interested in estimating the average number of DWI-involved fatal accidents per holiday, and whether more such accidents occur during the winter holidays (Thanksgiving, Christmas and New Year's Eve) than during the summer holidays (Memorial Day, July 4. and Labor Day). For the first quantity, math formula for all math formula, and we want to estimate math formula, say. For the second quantity, math formula and math formula, and we want to estimate math formula, say. Stamey & Hamilton (2006) obtained the following 95 percent confidence intervals: (3.90, 7.77), (3.87, 7.80), (4.31, 8.35), (4.29, 8.38) and (3.13, 10.87), (3.07, 10.93), (2.97, 11.03), (2.91, 11.09) for math formula and math formula, respectively, based on four different methods.

Using the normal approximation given by equation (11) in Krishnamoorthy & Lee (2010), we obtained the confidence intervals (4.10, 7.96) and (3.10, 10.8) for math formula and math formula, respectively. Using the chi-square approximation given by equation (12) in Krishnamoorthy & Lee (2010), we obtained the confidence intervals (3.95, 7.8) and (3.01, 10.66) for math formula and math formula, respectively.

Using the two sided confidence interval, (4), with math formula and math formula, we obtained the confidence intervals (4.01, 7.82) and (3.34, 10.95) for math formula and math formula, respectively.

It is clear that our estimates provide the shortest intervals. Each of the intervals due to Stamey & Hamilton (2006) and Krishnamoorthy & Lee (2010) is wider than ours. The intervals due to Krishnamoorthy & Lee (2010) appear shorter than those due to Stamey & Hamilton (2006). Of the two intervals given in Krishnamoorthy & Lee (2010), the one based on chi-square approximation appears shorter. This is in agreement with the findings in Krishnamoorthy & Lee (2010).

The above observations are based on a single data set. Thus, we cannot be sure that the methods of Stamey & Hamilton (2006) and Krishnamoorthy & Lee (2010) overestimate, or that the intervals of Stamey & Hamilton (2006) perform worst, or that the normal approximation method of Krishnamoorthy & Lee (2010) is conservative. A much more comprehensive study would be required to substantiate these findings (if indeed the findings are correct in the first place).

5 Conclusions

We have proposed confidence intervals for linear combinations of Poisson means and for continuous versions of such combinations. This is the first time such intervals have been proposed. There have only been two papers published, Stamey & Hamilton (2006) and Krishnamoorthy & Lee (2010), giving confidence intervals for linear combinations of Poisson means.

The intervals of Stamey & Hamilton (2006) and Krishnamoorthy & Lee (2010) are based on first order normal approximations and first order chi-square approximations. Our proposed intervals for linear combinations of Poisson means are based on higher order approximations than the CLT. Thus, our intervals can be expected to perform better than the CLT versions and those due to Stamey & Hamilton (2006) and Krishnamoorthy & Lee (2010).

We have performed a simulation study to compare the proposed intervals and the CLT versions in terms of expected widths and coverage probabilities. In this study, we observed that the proposed intervals outperform the CLT versions in terms of coverage probabilities. The proposed intervals appear comparable to exact intervals in terms of both expected widths and coverage probabilities. Sometimes the expected widths of the proposed intervals are shorter than those of exact intervals.

We have also illustrated an application using a data set on numbers of FMVA. In this application we observed that the proposed intervals have shorter lengths than those due to Stamey & Hamilton (2006) and Krishnamoorthy & Lee (2010).

Appendix: I

Theorem I.1. Suppose that math formula is a function of math formula, bounded as math formula. Let math formula have derivatives which are bounded with respect to n. Let math formula be an estimate of math formula such that for math formula the math formulath order cumulants of math formula can be expanded as linear combinations of math formula with coefficients functions of math formula bounded with respect to n and such that the leading coefficient of math formula is math formula. Then math formula can be expanded as a linear combination of math formula with coefficients functions bounded with respect to n with the leading coefficients being given in the appendix to Withers (1982a); also the leading coefficient of math formula is math formula.

Proof. Allow math formula in Withers (1982a) to depend on math formula.

Suppose now that math formula and math formula, the leading coefficient in the expansion for var math formula, is bounded away from zero with respect to n. Then, under the conditions of Withers (1988), there exist bounded functions math formula such that

display math

as math formula, where math formula is fixed, math formula, math formula, and math formula is the cumulative distribution function of a unit normal variable.

Acknowledgements

The authors would like to thank the Editor, the Associate Editor and the three referees for careful reading and for their comments which greatly improved the paper.

Ancillary