### Abstract

- Top of page
- Abstract
- 1 Introduction
- 2 Motivating example
- 3 Statistical modeling
- 4 Results
- 5 Conclusion and discussion
- Acknowledgements
- References
- Supporting Information

We propose a Bayesian approach to multiple testing in disease mapping. This study was motivated by a real example regarding the mortality rate for lung cancer, males, in the Tuscan region (Italy). The data are relative to the period 1995–1999 for 287 municipalities. We develop a tri-level hierarchical Bayesian model to estimate for each area the posterior classification probability that is the posterior probability that the municipality belongs to the set of non-divergent areas. We show also the connections of our model with the false discovery rate approach. Posterior classification probabilities are used to explore areas at divergent risk from the reference while controlling for multiple testing. We consider both the Poisson-Gamma and the Besag, York and Mollié model to account for extra Poisson variability in our Bayesian formulation. Posterior inference on classification probabilities is highly dependent on the choice of the prior. We perform a sensitivity analysis and suggest how to rely on subject-specific information to derive informative *a priori* distributions. Hierarchical Bayesian models provide a sensible way to model classification probabilities in the context of disease mapping.

### 1 Introduction

- Top of page
- Abstract
- 1 Introduction
- 2 Motivating example
- 3 Statistical modeling
- 4 Results
- 5 Conclusion and discussion
- Acknowledgements
- References
- Supporting Information

Generally speaking, epidemiological surveillance consists of continuously gathering and analyzing data for changes in disease occurrence (Last, 2001). Surveillance may be based on time or space or a combination of time-space, with an active or passive approach. Disease mapping, *i.e.* the study of the variability of disease occurrence on space, is a cornerstone of epidemiologic surveillance. Currently, the availability of data on a small scale makes it popular to scan for abnormal disease rates potentially associated with widespread environmental exposures or to search for a localized cluster of cases in proximity of putative sources of pollution (Elliott *et al.*, 2000). In disease mapping, a moderate to large number of area-level relative risks are considered. However, the large heterogeneity of population density among small areas leads to smaller *p*-values paradoxically associated with relative risk estimates closer to the null. Such inconsistency justified the development of shrinkage estimators (Clayton and Kaldor, 1987). Shrinkage estimators, as empirical Bayes or full Bayes, are now accepted as standard tools in spatial epidemiology, but they leave unresolved the multiple comparison problem.

Control of Family Wise Error Rate (FWER) that is a global control of type I error is generally pursued in the Surveillance framework (Frisén, 2003; Kulldorff, 2001). In his article of 2007, Rolka discussed the cost in sensitivity of adopting a FWER control procedure and he mentioned control of the False Discovery Rate (FDR). FDR is the rate of false positives among all rejected hypotheses and was introduced with examples in the context of clinical trials by Benjamini and Hochberg (1995). FDR has a Bayesian interpretation and it is connected to the *q*-value, a Bayesian alternative to the *p*-value (Storey, 2003).

In the disease mapping literature, posterior probability for each area having a risk higher than a predefined threshold after having specified an appropriate hierarchical Bayesian model, was suggested as a way to screen areas at higher/lower risk (Bernardinelli and Montomoli, 1992; Richardson *et al.*, 2004). This is not sufficient to assure that the posterior inference adjusts for multiple testing. To accomplish this task, the probability model needs to include a null prior and related hyperparameters that define the prior probability mass for non-*divergent* areas (Scott and Berger, 2006). In the following article, we consider a two-sided alternative hypothesis and use the term *divergent* to denote areas at risk different from the null. This meaning of the word divergent was used by Olhssen *et al.* (2007).

#### 1.1 Aim of the study

This article aims to develop a hierarchical Bayesian modeling approach to multiple testing in the context of disease mapping. The idea to use an FDR approach instead of an FWER control is based upon the fact that the erroneous rejection of the null hypothesis for some municipalities does not challenge the result of the whole descriptive analysis whose aim is to assess heterogeneity of risk in the entire study region. Therefore, the FWER control is too strict for the application's needs (Benjamini, 2009).

In the following analysis a tri-level hierarchical Bayesian model is proposed to estimate for each area the probability of belonging to the null, to be used to explore areas at divergent risk (higher or lower then the reference disease rate) while controlling for multiple testing. We took advantage of real data regarding the mortality rate due to lung cancer in males at the municipal level in the Tuscan region (Italy) during the period 1995–1999.

In Section 2, we describe the mortality data. In Section 3, we briefly introduce the problem of multiple comparisons; we then describe the proposed hierarchical Bayesian models for disease mapping and how to estimate posterior classification probabilities. The results are presented in Section 4. The conclusion and discussion follow in Section 5.

### 5 Conclusion and discussion

- Top of page
- Abstract
- 1 Introduction
- 2 Motivating example
- 3 Statistical modeling
- 4 Results
- 5 Conclusion and discussion
- Acknowledgements
- References
- Supporting Information

As of yet, no work has been done on FDR and disease mapping. Our work on this topic started in 2006 (Catelan *et al.* 2006) and was motivated by an environmental epidemiological investigation on 18 high-risk areas and more than 30 disease codes. It is not surprising that until now the problem of multiple testing was not considered in disease mapping, because Bayesian approaches to smooth relative risk estimates may be misinterpreted as a solution to the problem. Bayesian estimators are superior to maximum likelihood ones when considering the whole set of estimates (earlier work was empirical Bayesian, see Clayton and Kaldor, 1987; Efron and Morris, 1973). Other epidemiological applications on long lists of relative risks appeared in Greenland and Robins (1991) and Carpenter *et al.* (1997).

However, estimation is a different task from testing. Indeed the inferential goals may be different. With regard to estimating relative risks, their distribution, their ranks, Sheen and Louis (1998) showed that there is no one simple best procedure. Moreover, paraphrasing Müller *et al.* (2007), testing multiple hypotheses requires, first that the probability model include a positive prior probability of the null for each observation, second that the model includes hyperparameters for the null prior.

A full Bayesian model that adjusts for multiple testing consists of a tri-level hierarchical Bayesian model that allows one to estimate the posterior probability to belong to the set of the null or alternative hypotheses. The advantage of using a model-based approach over a simpler frequentist one is twofold. If the model is well specified, there is a gain in power. In our application, the tri-level spatially structured hierarchical Bayesian model detected many more areas as divergent, leaving the impression of a greater power for such specification. But the reader must remember that in the real application, we have no way of telling if an area called divergent at a given threshold is in fact truly divergent. The second advantage of the Bayesian approach is that it easily allows one to perform sensitivity analysis on model assumptions (see for example Briggs *et al.*, 2006) and, in the multiple testing framework, on prior belief of the null (Westfall *et al.*, 1997). Bayesian modeling will also combine risk estimation and multiple testing. Connection of posterior classification probabilities and *p*-values is discussed in the Bayesian literature (Casella and Berger, 1987; Bayarri and Berger, 2000).

A related work to our hierarchical modeling is found in Jones *et al.* (2008) and Ohlssen *et al.* (2007). Different from us, they considered one-sided tests to avoid confusion when flagging health-care providers as good or bad. Their approach is aimed to build prediction limits around the funnel plot analogous to what is showed in Fig. 1(B). In Ohlssen *et al.* (2007), they developed a hierarchical Bayesian model for the null hypothesis aiming to get cross-validation predictive *p*-values. Interestingly, they distinguished between (i) estimating effects within a random effect model (ii) testing a simple null model using *p*-values and (iii) estimating posterior probability of coming from the null within a mixture model. This last point is what we addressed in the present article.

A further point regards the use of a transformation of observed/expected ratios (Grigg *et al.*, 2009). However, we developed a full probabilistic model that did not rely on asymptotics. Some of these details can be found in a simulation study by Catelan and Biggeri (2010), where, in the usual disease mapping framework and specifying a Poisson likelihood, the assumptions underlying the Bayesian interpretation of the *q*-value were checked.

There is a strong sensitivity to hyperprior choice for π. We rely on subject-specific information to derive informative distributions for π, as shown in the sensitivity analysis. We conclude that the naïf choice of a uniform distribution for π is strongly expected to be in error (see Fig. 7).

Note that we adopted the strategy to let the data itself choose π, but we specified an appropriate informative hyperprior. Storey (2003) and also Efron (2005) applied an empirical Bayesian approach to estimate π. Concerns were raised about such an approach in applications not related to omics (Jones *et al.* 2008). These authors noted that the amount of information about π in the data could be scant whenever the number of tests is not large enough (>one thousand). At least in the disease-mapping context where the number of tests is between one hundred and one thousand, our sensitivity analysis shows that eliciting appropriate informative hyperprior inference on π is important.

This point deserves further explanation. The Benjamini–Hochberg procedure is robust under a variety of conditions. Storey's *q*-value procedures have been shown to be anticonservative when the number of tests is not high, there is lack of independence, and the proportion of true null approaches one (Benjamini *et al.*, 2006; Dudoit *et al.*, 2008). In our application both procedures gave identical results (eleven areas thresholded at level 20%). When specifying a Beta(*c*,*d*) hyperprior with expected value of 0.97 for the probability of the null, the tri-level Bayesian hierarchical Poisson-Gamma model gave also almost equal results (ten areas threshold at level 20%). The interpretation of such results is that the implied prior belief of the null in the frequentist procedure is close to that modeled in the tri-level Bayesian hierarchical Poisson-Gamma model.

Bayesian approaches are based on model and prior assumptions. We applied two models (Poisson-Gamma and BYM), with and without spatially structured random terms, which may help the reader to appraise different data summaries with different etiological clues (see Elliott *et al.*, 2000 and Lawson *et al.*, 1999).

We show a Bayesian approach to multiple testing in the disease-mapping context. It cannot be viewed as an antagonist to the classical Benjamini–Hochberg procedure or its extension. On the contrary, given some prior information and assumptions, Bayesian analysis summarizes the empirical information and uncertainty. If you believe in a spatially structured model, then you would prefer the BYM assumptions. A discussion about how to run a simulation study for comparing Bayesian models was in Lawson *et al.* (2000). This large simulation study documented the robustness of the Poisson-Gamma and BYM models. We provided examples in which the BYM model may suffer from the same inaccuracies of other less flexible spatially structured models (Biggeri *et al.*, 2000; 2003). Further work is necessary to extend these results to inclusion probabilities from the proposed tri-level models.

Dependency was modeled by specification of a CAR process in the alternative hypothesis by specifying a BYM model. In our application, this led to identifying a larger number of divergent areas. We should be prudent about interpreting this result. The BYM model may be closer to the true data generating model than the alternative Poisson-Gamma model, but spatially structured models may also be anticonservative even if the BYM has been proved to behave quite well in large simulation studies (Lawson *et al.*, 2000). We do not want to stress this point, because the reader may be led to be overconfident about a given model. We prefer to show the advantage of the Bayesian approach in performing a sensitivity analysis on model assumptions and on prior belief of the null. Future developments should address FDR-adjusted inference and decision analysis.

Inference on relative risks of selected areas after viewing the data is a natural complement to investigations such as these discussed in our article. Benjamini and Yekutieli (2005) considered simultaneous and selective inference. Yekutieli (2009) presents a Bayesian framework for providing inference for selected parameters. Our model is an example of a random effect model, because the relative risk for a given area is treated as a random quantity rather than fixed. The reason is that in this particular application, area relative risk depends on the prevalence profile of risk factors and on the composition of susceptible individuals of the area population. Prevalence and susceptibility do vary and cannot be assumed as fixed because the population of a given area is a dynamic cohort. As stated by Yekutieli (2009), under a random effect model, Bayesian inference is unaffected by selection. However, simultaneous and selective inference in the disease-mapping context is an interesting topic for future developments in our work.

“Finding posterior distributions of parameters is only part of the Bayesian solution. The remainder involves decision analysis: … (this) means considering the ramifications of various decisions explicitly in terms of loss functions” (Berry and Hochberg, 1999). We did not pursue this task in the present article (for a recent contribution see for example Guindani *et al.*, 2009). Here, as for the explorative purposes usually undertaken in disease mapping, we used posterior classification probabilities without any explicit cut-off in the same spirit as Jones *et al.* (2008).

Hierarchical Bayesian models provide a sensible framework to model classification probabilities in the context of disease mapping and broadly speaking, we recommend FDR-like procedures when exploring divergent areas in disease mapping.