Causal decomposition maps: An exploratory tool for designing area‐level interventions aimed at reducing health disparities

Methods for decomposition analyses have been developed to partition between‐group differences into explained and unexplained portions. In this paper, we introduce the concept of causal decomposition maps, which allow researchers to test the effect of area‐level interventions on disease maps before implementation. These maps quantify the impact of interventions that aim to reduce differences in health outcomes between groups and illustrate how the disease map might change under different interventions. We adapt a new causal decomposition analysis method for the disease mapping context. Through the specification of a Bayesian hierarchical outcome model, we obtain counterfactual small area estimates of age‐adjusted rates and reliable estimates of decomposition quantities. We present two formulations of the outcome model, with the second allowing for spatial interference of the intervention. Our method is utilized to determine whether the addition of gyms in different sets of rural ZIP codes could reduce any of the rural–urban difference in age‐adjusted colorectal cancer incidence rates in Iowa ZIP codes.

female), occupation (employed vs. unemployed), or location of residence (rural vs. urban).Researchers in the areas of economics, epidemiology, and demography have developed methods for performing analyses known as decomposition (Andreev et al., 2002;Blinder, 1973;Jackson & VanderWeele, 2018;Oaxaca, 1973;Sudharsanan & Bijlsma, 2021) in order to describe the causes of disparities.A primary goal of these decomposition analyses is to partition the between-group differences into what can be considered as explained (accounted for by mediator variables whose distributions differ across groups) and unexplained (Jann, 2008;Sudharsanan & Bijlsma, 2021).Understanding drivers of these group differences can help policymakers and public health professionals implement interventions aimed at reducing these disparities.
In this paper, we present an adaptation of a decomposition analysis method by Sudharsanan & Bijlsma (2021) to test area-level interventions using data that are readily available for small areas.A "small area" can be defined as any region or group for which the sample is too small to obtain direct estimates with a reasonable level of precision (Rao & Molina, 2015).Here, we are specifically interested in data from small areal units, such as ZIP codes, counties, and census tracts.Using our proposed decomposition analysis method, we develop decomposition maps that allow researchers to test the effect of various interventions on a disease map and to quantify the impact of interventions on reducing the differences in health outcomes between groups of small areas.

Motivating application
Numerous studies have described differences in cancer incidence in rural versus urban populations (e.g., Moss et al., 2017;Pagedar et al., 2019;Zahnd et al., 2018).Zahnd et al. (2018) found that age-adjusted incidence rates for certain cancers with modifiable risk factors, including colorectal cancer, are higher in the population of rural residents as compared to the population of urban residents in the United States.The development of causal decomposition maps arose from the need to determine which variables may explain rural-urban differences in age-adjusted colorectal cancer incidence rates in Iowa's 934 residential ZIP code tabulation areas (which we will refer to as ZIP codes, for simplicity) and to identify feasible interventions to reduce this gap in health outcomes.Decomposition maps can serve as a valuable tool to public health professionals and local governments who want to justify which ZIP codes they would like to allocate limited resources to in order to address rural-urban cancer differences.Because physical activity is a known protective factor for colorectal cancer (Mahmood et al., 2017), and gym availability differs between rural and urban areas in Iowa, we aimed to explore whether the addition of gyms in rural areas could reduce the rural-urban difference in age-adjusted colorectal cancer incidence rates in Iowa ZIP codes.Public park and recreation agencies and local governments have limited resources to pursue bonds or capital improvement programs to fund the development of new recreational/gym facilities.Thus, salient questions include: 1. Which areas should be targeted for the development of new recreational/gym facilities? 2. How much will the rural-urban difference in age-adjusted colorectal cancer incidence rates decrease with the addition of gyms to different sets of ZIP codes? 3. What might the new disease map look like under these interventions?

Decomposition method by Sudharsanan & Bijlsma (2021)
The concept of a decomposition analysis is well described in the paper by Sudharsanan & Bijlsma (2021).They included a figure, much like Figure 1, to further articulate the objective of a decomposition analysis.Assume that there is an observed association between a group variable (A) and an outcome variable (R).Furthermore, assume that the distribution of some potential mediator variable (M) differs between groups.By performing a decomposition analysis, researchers aim to determine whether altering M reduces the observed difference in R between groups.Because the relationship between M and R may be confounded, it is important to include any observed covariates (C) that confound the mediator-outcome relationship in the decomposition analysis.The results of such an analysis can help in the design of interventions that target the mediator variable.Sudharsanan & Bijlsma (2021) utilized the counterfactual outcomes framework and the g-computation technique from causal inference.Broadly, their approach is based on the idea of computing contrasts of summary outcome measures between groups under different mediator scenarios using predicted counterfactual outcomes.First, they computed the summary measure of interest for each group in a "natural-course pseudo-population."This pseudo-population represents Next, they computed a contrast of the summary measures for each group in a "counterfactual pseudo-population."This pseudo-population includes the predicted outcome values that would have occurred based on some realistic intervention on one or both groups' mediator distributions.Using the fitted outcome and mediator models, they determined what the mediator values would have been under the chosen intervention and then plugged these values into the outcome model to predict what the outcomes would have been in this counterfactual scenario.Comparing the contrasts in the natural-course and counterfactual pseudo-populations allows researchers to examine how much a certain mediator variable contributes to explaining the observed difference between groups.Note that Sudharsanan & Bijlsma (2021) selected the intervention of equalizing the mediator distribution between groups, and so the authors specified a mediator model.However, researchers might be interested in intervening on certain individuals' mediator values; in this case, a mediator model would not need to be specified.
There is an important difference between this method for decomposition analysis and methods for performing mediation analyses, despite the use of similar terminology.While causal mediation analyses require that there is no unmeasured confounding in the group-outcome, group-mediator, and mediator-outcome relationships (VanderWeele & Vansteelandt, 2014), this decomposition analysis method makes fewer confounding assumptions (Sudharsanan & Bijlsma, 2021).It does not require adjustment for confounders of the group-mediator or group-outcome relationships.This is because it aims to describe the observed (and not necessarily causal) relationship between the group variable and the outcome variable.For this reason, the group label is always held as fixed at its observed value and only the mediator value is altered when computing the decomposition contrasts.By fixing the group label, researchers do not need to address the contested issue known as "no manipulation, no causation" (see Glymour & Spiegelman, 2017) in which contrasts are computed using counterfactual outcomes based on inconceivable scenarios, like conceptualizing if an individual were a different race.

Contributions
We introduce a flexible adaptation of the causal decomposition method by Sudharsanan & Bijlsma (2021) that can be used to perform a decomposition analysis and create decomposition maps with data from small areas.Our methodology is novel in the following ways.First, we obtain estimates of the decomposition quantities based on summary measures of the disease map rather than at the individual level.Our method can be used to test interventions that target different sets of regions.We extend the approach of Sudharsanan & Bijlsma (2021) using a Bayesian framework rather than a frequentist framework.Like Sudharsanan & Bijlsma (2021), we take a counterfactual outcomes approach to the decomposition analysis, but we obtain counterfactual small area estimates of age-adjusted rates (AARs) with stable summary measures of the decomposition quantities of interest using highly disaggregated area-level data.This allows researchers to make use of the many data sources that are available for small areas that are typically unavailable at the individual level.We also provide the counterfactual notation, causal assumptions, and outcome model specification under two assumptions about the intervention.First, we assume there is no spatial interference (i.e., there is no spillover effect of the intervention on nearby regions' AARs) and then we relax this assumption to allow for spillover effects of interventions on neighboring regions.
The remainder of this paper is organized as follows.In Section 2, we provide an overview of disease mapping and specify a model that can be used to estimate AARs for small areas.In Section 3, we introduce our proposed method, its assumptions, and its implementation.In Section 4, we return to our motivating application.We present the results of our decomposition analysis and illustrate decomposition maps under two different ZIP code-level interventions.Finally, we conclude with a discussion on the strengths and limitations of our proposed method in Section 5.

BACKGROUND ON DISEASE MAPPING
When working with ZIP code or county-level health outcome data, outcome measures calculated directly from the data can be subject to substantial variability (Shaddick & Zidek, 2015).For example, incidence rates are frequently used by epidemiologists to compare new disease cases across regions.Incidence rates are computed by dividing the number of cases by the population size in that region and then multiplying that quantity by 100,000 people.When the denominator is small, say that there are 1000 people in a given area, the difference between zero cases, one case, and two cases from year to year equates to a difference in the rate of 0, 100, and 200 per 100,000 people.Note, though, that the difference in one case may simply be due to sampling variability.The issue of sampling variability becomes even more apparent when computing AARs for small areas.AARs are calculated by stratifying the cases for each region by age group, calculating age group-specific rates for each region, and then taking a weighted average of those rates using weights from a standard population.By performing age adjustment, the confounding effect of age in an analysis is reduced (Klein & Schoenborn, 2001).Therefore, AARs are often used by epidemiologists to compare disease incidence when a disease is highly age dependent.Dividing the data by age group, however, results in even smaller denominators.
There is a vast literature focusing on small area estimation methods that aim to create reliable local disease risk estimates while assuming there is a positive spatial correlation between nearby regions.This broad area is known as disease mapping (Waller & Carlin, 2010).Bayesian model-based approaches that incorporate spatial and uncorrelated random effects have been developed for this purpose (see Lawson, 2018;Lawson & Lee, 2017;Lee, 2011;Rao & Molina, 2015;Shaddick & Zidek, 2015).Use of small area estimation techniques greatly reduces the variability in the risk measures of interest, allowing researchers to make inferences from data collected at a highly disaggregated area level.We thus propose embedding Bayesian small area estimation techniques into the decomposition analysis and focus on detecting changes in the estimated disease map.

Small area estimation model for AARs
Our decomposition method incorporates a Bayesian hierarchical model that modifies the Besag-York-Mollié model (Besag et al., 1991) Here, we assume that the case counts are Poisson-distributed with rate parameter  , .Intuitively,  , * 100, 000 represents the underlying incidence rate per 100,000 people (or person-years) for the population of individuals in age group  within region .
The log of each rate parameter is modeled with an age group-specific intercept,   , and  is a vector containing the effect for each area-level covariate in the model.Noninformative prior distributions are assigned to  = [ 1 , … ,   ] ′ and .Drawing from the Besag-York-Mollié (Besag et al., 1991)  , where   is a common precision term.We specify weakly informative prior distributions on all of the standard deviation terms, per the recommendation of Gelman (2006).Additionally, we allow for spatial smoothing of the AARs by including a spatial random effect, , for each region.These random effects account for the expected spatial correlation of the counts and allow for more stable estimates of AARs for the small geographic regions under study.We apply an intrinsic conditional autoregressive (ICAR) prior proposed by Besag et al. (1991) ′ such that  ∼ ICAR(  ).This prior assumes that the expected value of each region's spatial random effect is the mean of its neighbors' spatial random effects.The corresponding precision of the spatial random effect depends on both the number of regions that are considered neighbors and a global precision parameter.Let ℎ  denote the number of neighbors for region  and  ∼  represent the indices of all regions that neighbor region .The ICAR prior can be expressed as An underlying AAR,   , for region  can then be estimated from the modeled underlying age-group-specific rates using the formula where   reflects the proportion of individuals in age group  in a selected standard population.The underlying AARs are simply the expectation of the observed AARs under the specified outcome model.(  |  ,1 , … ,  , ) =   , since the observed AARs are computed as

METHODOLOGY
In this section, we describe our flexible Bayesian approach that extends the work of Sudharsanan & Bijlsma (2021) for computing decomposition contrasts and creating decomposition maps under a proposed intervention.

Counterfactual notation for AARs without interference between regions
First, we will define observed values in the dataset.Let   represent the observed group label for region , where   ∈ {0, 1}.Furthermore, let   represent the observed mediator.Uppercase notation  , and   will denote the observed incident cases and AARs, respectively. ′  is a vector of mediator-AAR confounder variables that are observed for region .Counterfactual outcomes represent the outcomes that would have been observed in a certain, often unobservable setting.Let  , (  , ) denote the incident case count for age group  within region  that would have been observed when holding the group label at its observed level but setting the potential mediator to level .To illustrate our method, we assume that  ∈ {0, 1}, but  does not need to be binary.Similarly, let   (  , ) = ∑  =1   *  , (  ,) , * 100, 000 denote the AAR that would have been observed when holding the group label for region  at the observed level but setting the mediator to level .The natural-course AAR is therefore denoted as   (  ,   ), whereas   (  , m ) is the counterfactual AAR when the mediator value for region  is set to its value under the proposed intervention, m .Collectively, m = [ m1 , … , m ] ′ denotes the vector of all mediator values set under the proposed intervention.Note that for some regions, m may be equal to the observed mediator value, M, whereas the mediator value will change in the regions that are intervened on.Finally, we consider counterfactual outcomes corresponding to underlying AARs.Recall that   (  , ) may be subject to substantial variability when working with data from small areal units that have small population sizes.For this reason, we consider the underlying risk surface of modeled AARs, noting that the observed AAR represents one realization from this risk surface.Let   (  ,   ) denote the natural-course underlying AAR for region  when holding the group label and mediator variable at their observed values.Under the proposed intervention,   (  , m ) denotes the counterfactual underlying AAR for region .

Counterfactual notation for AARs with interference between regions
In some data applications, researchers might expect spillover effects from interventions.For example, adding a gym in one ZIP code might have an effect on the physical activity levels in nearby ZIP codes, especially if there are no gyms in the nearby ZIP codes.In Section 3.1, we implicitly assumed that an intervention in a region has no impact on the other regions' AARs.Forastiere et al. (2021) proposed treatment effect estimands using a modified counterfactual notation in the presence of interference between neighboring units in a network.To allow for spillover effects and spatial interference of the interventions on the AARs, we adopt a similar counterfactual notation and similar assumptions to those proposed by Forastiere et al. (2021).
Let  ∼ represent a vector of length ℎ  of the observed mediator values of each of region 's neighbors, where  ∼  corresponds to a neighbor relationship.Neighbors can be defined in many ways with the most straightforward definition being that a neighbor must share a boundary or point with region .Let   denote a random variable for region  derived from a specified neighborhood function ( ∼ ).One example of the neighborhood function is . This function would take on a value of 1 if at least one neighbor has a mediator value of 1 and 0 otherwise.
Counterfactual outcomes now depend on a region's observed group label, its mediator value, and its value of the neighborhood function.Let  , (  ,   ,   ) represent the natural-course incident case count for region  and age group  under the assumption that there is spatial interference between a region and its neighbors.On the other hand,  , (  , m , g ) denotes the counterfactual incident case count for region  and age group  under the proposed intervention.Here, g is the value of the neighborhood function recomputed based on m.So, g = ( m∼ ).It follows that   (  ,   ,   ) and   (  , m , g ) represent the natural-course and counterfactual AARs, respectively, and that   (  ,   ,   ) and   (  , m , g ) denote the natural-course and counterfactual underlying AARs, respectively.Note that when spatial interference is absent,   (  ,   ,   ) =   (  ,   ).Therefore, in the interest of generality, we define all contrasts in terms of A, M, and G in the remainder of this paper.

Causal assumptions
For a causal interpretation, our decomposition analysis method has the following underlying assumptions: 1. Stable unit treatment on neighborhood value assumption (SUTNVA).
ij (  , , ) =  ij if   were observed to be  and   were observed to be  and consequently   (  , , ) =   if   were observed to be  and   were observed to be .This assumption has two parts.First, there are not multiple versions of the intervention.Next, since  is a function of only neighboring regions, there is no interference between a region and other regions that are not considered neighbors.The neighborhood function must also be correctly specified.This assumption is further detailed by Forastiere et al. (2021).2. Conditional exchangeability.

Positivity.
The mediator distribution has the same support in both groups.Furthermore, the Bayesian version of the positivity assumption requires additional assumptions so that the normalizing constant does not include a 0 in the denominator.This assumption is further detailed by Keil et al. (2018).

Correct specification of the outcome model.
Given that we present a decomposition analysis approach involving a parametric model, the results hinge on the correct specification of the outcome model.If this model is incorrectly specified, the decomposition analysis could fail to illustrate the true impact of an area-level intervention.One assumption encoded in this model is that the effects of the mediator and the confounders are constant across age groups.

Contrasts of interest
In our method, decomposition contrasts are computed using the natural-course and counterfactual underlying AARs to obtain stable estimates of the natural-course difference and intervention effects.First, we estimate the AAR difference (AARD) in the natural-course pseudo-population.The natural-course pseudo-population is comprised of all AARs estimated using the true group labels, mediator values, and neighborhood function values: (, , ) = [ 1 ( 1 ,  1 ,  1 ), … ,   (  ,   ,   )] ′ .Let  1 be the number of regions with group label 1 and  0 be the number of regions with group label 0. Furthermore, let (  = ) be the indicator function that takes on the value of 1 if   is observed to be  and 0 otherwise.The AARD in the natural course pseudo-population is defined as and represents a summary measure of the AAR difference for regions with group label 1 and group label 0 in the study region.
Next, we compute the AARD under some intervention of interest on the mediator.We define a counterfactual pseudopopulation based on our proposed intervention and compute the average difference in the underlying AARs of the regions with group label 1 and group label 0 in this pseudo-population.The counterfactual pseudo-population is comprised of the vector of counterfactual underlying AARs, (, m, g) = [ 1 ( 1 , m1 , g1 ), … ,   (  , m , g )] ′ .The AARD in the counterfactual pseudo-population is computed as The reduction in the AARD after setting the mediator vector to m can be computed by subtracting AAR count from AAR nat .Thus, AAR red is Finally, a measure of the relative reduction in AARD attributed to the intervention, or percent reduction, can be computed by dividing AAR red by AAR nat and multiplying this quantity by 100.

Bayesian hierarchical outcome model
We specify a Bayesian hierarchical outcome model to be used in our decomposition analysis.The outcome model relates the mediator, group label, and confounders of the mediator-AAR relationship to the incident case counts for each age group within each region.Assuming no spatial interference, one form of the outcome model is This model specification is identical to Model 1 when setting Under the assumption that there is some spatial interference of the mediators, the outcome model can instead be specified as This model specification assumes that the spatial interference may vary as a function of a region's mediator value.For example, this may make sense when assessing the effect of adding gyms to neighboring ZIP codes when a ZIP code already has one.We might expect the effect of the new gym intervention on AARs to differ for ZIP codes that have gyms and for ZIP codes that do not have gyms.

Algorithm and computation of decomposition effects
We use a modified version of the Bayesian g-computation technique from causal inference to obtain the marginal distribution of the natural-course and counterfactual underlying AARs.By drawing posterior samples from these distributions, the posterior distribution of the decomposition effects of interest can be computed.Several studies have implemented a Bayesian version of the g-computation technique (Gao & Albert, 2019;Keil et al., 2018;Park & Kaplan, 2015).Keil et al. (2018) introduced a flexible Bayesian adaptation of the g-formula for estimating causal treatment effects using the posterior predictive distribution of the counterfactual outcomes.Rather than using posterior predictions of the incident case counts, we draw values from the posterior distribution of the underlying AARs to obtain more stable estimates of the decomposition quantities.Let  denote the set of observed data values used to fit the outcome model.We denote the collection of parameters in the outcome model as  = {, , , ,   ,   } and ( | ) correspond to the posterior distribution of the parameters in the outcome model.
We implement our proposed decomposition approach using the INLA (Integrated Nested Laplace Approximations) package (Rue et al., 2009) in R. INLA uses integrated nested Laplace approximations to approximate the marginal posterior distributions of parameters.It can also be used to generate samples from the approximate joint posterior distribution.Given that the proposed decomposition approach can become computationally expensive with the addition of regions and age groups, using the INLA software greatly reduces computation time compared to Markov Chain Monte Carlo methods.
After fitting the outcome model in INLA, we draw posterior samples of each parameter using the inla.posterior.samplefunction.We use the following algorithm to obtain the posterior distributions of AAR nat , AAR count , and AAR red : 1. Fit the outcome model to the observed data to obtain ( | ). 2. Generate  posterior draws of  from the outcome model.Denote each draw as  () where  ∈ 1, … , . ) .6. Use the samples of the counterfactual underlying AARs along with the formulas below to obtain  posterior samples of the posterior AAR nat , AAR count , and AAR red .Use the mean of the  samples to obtain a point estimate for each effect.Use the 2.5th percentile and 97.5th percentile of the  samples, or alternatively a highest posterior density approach, to obtain a 95% credible interval for each effect.The th sample for each of these quantities is calculated as

Creation of decomposition maps
We define decomposition maps as a set of four or more maps: the first is a natural-course AAR disease map, with the natural-course mediator distribution displayed below it.The remaining maps depict the counterfactual underlying AAR estimates under one or more proposed interventions, with the mediator distributions corresponding to these interventions displayed below.To create these decomposition disease maps, the posterior mean of the corresponding underlying AAR for each region can be mapped.Alongside these maps, it is important to display the decomposition effects: AAR nat , AAR count , and AAR red for each intervention.Like other causal inference estimands, the average of all of the individual regions' AARDs can be identified, but the individual differences in counterfactual outcomes cannot be identified due to the unknown correlations between each regions' counterfactual outcomes.The map visualizations are therefore an exploratory tool, and formal statistical inferences should be made based on the AARDs or other summary measures included on the decomposition maps.

APPLICATION TO RURAL-URBAN CANCER DIFFERENCES IN IOWA
We applied our proposed decomposition method to determine whether adding new gyms to rural ZIP codes is associated with a reduction in the rural-urban difference in colorectal cancer incidence AARs in Iowa.We also constructed decomposition maps to provide a visual display of the changes in the disease map under a proposed intervention.The graph depicted in Figure 2 displays the hypothesized relationships between each ZIP code's rural-urban status, gym availability, and colorectal cancer incidence AAR.

Dataset and characteristics
In our analysis, rural-urban status was the ZIP code characteristic of interest or group variable (A).We defined each ZIP code as being either rural or urban based on its 2013 rural-urban commuting area code (USDA Economic Research Service, 2020) using Categorization D on the Rural Health Research Center's website (Rural Health Research Center, 2022).Figure 3 displays the rural-urban classification of each ZIP code in our analysis.In total, the state of Iowa has 293 urban ZIP codes and 641 rural ZIP codes.Note that because rural-urban status is a regional characteristic that is not easily manipulable, a decomposition analysis is naturally suited to answer this research question.Gym availability was our potential mediator variable (M).Gym availability was defined as the presence of at least one open fitness or recreational sports center in a given ZIP code in 2014, and these data came from the National Neighborhood Data Archive (Finlay et al., 2020).Approximately 36.2% of urban ZIP codes had at least one gym whereas only 23.2% of rural ZIP codes had at least one gym.This natural-course gym distribution is mapped in Figure 5. Table 1 also illustrates that there are both rural and urban ZIP codes with and without gym access.This table provides evidence that the positivity assumption is reasonable in this application.Colorectal cancer incidence AAR was our ZIP code-level outcome variable (R).Finally, we adjusted for median household income (representing ZIP code-level socioeconomic status), percent unemployed, and percent white (representing race at the ZIP code level) in our analysis (C), as we believed they may confound the relationship between gym access and colorectal cancer incidence at the ZIP code level.The confounder variables were obtained from the 2014 American Community Survey (U.S. Census Bureau, 2015) using the tidycensus package (Walker & Herman, 2021) in R.
This analysis included the total number of colorectal cancer diagnoses for each of Iowa's ZIP codes between 2014 and 2018, as documented by the Iowa Cancer Registry.To model cancer AARs, we divided the cancer incident case counts for each of the 934 ZIP codes into six age groups: < 40, 40-49, 50-59, 60-69, 70-79, and 80+.Population sizes in each ZIP codeby-age strata were obtained from the 2018 American Community Survey (U.S. Census Bureau, 2019) using the tidycensus package (Walker & Herman, 2021) in R.
Population sizes and case counts specific to each age group and ZIP code in this dataset were quite small.The average population size in the 5604 ZIP code-by-age strata was 559.2 with 43.2% of the population sizes under 100.Approximately 1.5% of the 5604 strata had population sizes of 0, which would have led to one or more undefined age group-specific rates within the affected ZIP codes.We set these population sizes to be 1 and included these ZIP codes in our analysis, as they provided useful information and helped contribute to the smoothing of rates in other age categories.Furthermore, approximately 57.9% of the 5604 strata had colorectal cancer case counts of 0 and 93.6% of the counts were under 5. Based on these data characteristics, small area estimation techniques that borrow strength from neighboring ZIP codes were necessary in order to make precise and useful inferences about area-level interventions.

Outcome models and proposed interventions
We fit two outcome models for the underlying AARs: Model 1 assumed there was no spatial interference, and Model 2 assumed there was spatial interference.In Model 2, we defined the neighborhood function as   = ( ∼ ) = ([ ∑ ∼   ] > 0) for region .Thus, if any neighboring ZIP codes had a gym,   = 1 and otherwise   = 0.The prior distributions for each model coefficient were set to be Normal(0, 0.001), and the prior distributions on the standard deviations of the random effects, , were weakly informative half-Cauchy(5) distributions.For both models, we computed estimates and credible intervals using 10,000 samples from the approximate joint posterior distribution.The models were fit using INLA version 22.12.16 and R version 4.2.0.
In addition to fitting two models, we tested two possible interventions per model.Interventions focused on adding gyms to rural ZIP codes due to the clear imbalance between available gyms in rural and urban areas of the state.The proposed interventions are detailed in Table 2. Broadly, Intervention 1 added 12 new gyms to rural Iowa ZIP codes with high natural-course AARs.Intervention 2 assumed there were more financial resources available to allocate to building gyms and added 52 new gyms to rural ZIP codes with high natural-course AARs.

Results
Figure 4 displays the posterior distributions of the average natural-course AAR for the rural ZIP codes and for the urban ZIP codes computed under Model 1.Note that there is very little overlap in the two distributions, indicating a clear difference in rural and urban colorectal cancer incidence AARs in Iowa.The posterior mean of the average natural-course AAR in the rural ZIP codes was 47.6 per 100,000 person-years with a 95% credible interval of (45.7, 49.5).The point estimate for the average natural-course AAR in the urban ZIP codes was 44.0 with a 95% credible interval of (42.1, 46.0) per 100,000 person-years and was thus lower than the average natural-course AAR for the rural ZIP codes.Under Model 2, which assumed there was spatial interference of the intervention, these estimates were quite similar.
Under each model and intervention, we obtained the posterior distributions of AAR nat , AAR count , and AAR red .These effects are summarized in Table 3.This table provides the posterior mean, 95% credible interval, and posterior probability that each AARD was greater than 0. Under Model 1, the posterior mean of AAR nat was 3.570 with a 95% credible interval of (1.147, 5.814) per 100,000 person-years.Intervention 1 resulted in a very small change in the AARD of 0.063 per 100,000 person-years with a 95% credible interval of (−0.003, 0.121).By adding in more gyms in Intervention 2, the posterior mean of AAR red was 0.257 per 100,000 person-years, approximately an 8.7% reduction in the AARD between rural and urban areas, with a 95% credible interval of (0.012, 0.518).
Under Model 2, there was a slight difference in the estimates of AAR nat , AAR count , and AAR red , but conclusions remained the same.Of note, Model 1 had a lower Watanabe-Akaike information criterion (WAIC) value (Watanabe, 2010) F I G U R E 4 Histograms of the expected natural-course AARs for rural and urban ZIP codes estimated from Model 1 using 10,000 posterior samples.AAR, age-adjusted rate.

TA B L E 3
Posterior means, 95% credible intervals, and posterior probabilities of the decomposition effects under Interventions 1 and 2 and Models 1 and 2. than Model 2, and most of the effect of gym availability on AARs appeared through each ZIP code's own gym availability rather than through each ZIP code's neighborhood function.This suggests that either individuals tend to use gyms within their own ZIP codes or that other ways to specify the neighborhood function may have better captured the spatial interference if it was expected.Finally, the decomposition maps created with Models 1 and 2 are displayed in Figures 5 and 6, respectively.Both figures show the natural-course disease maps and the counterfactual disease maps under Intervention 2. In both figures, the addition of new gyms resulted in disease maps with less dark red.The overall decomposition effects are displayed on the maps to help quantify the impact of the proposed intervention.

Quantity
Our decomposition effects and maps suggest that a large-scale intervention may noticeably reduce the gap in colorectal cancer incidence AARs between rural and urban areas in Iowa, but adding only 12 gyms to rural ZIP codes likely will not meaningfully reduce the AARD.This relationship between gym availability and colorectal cancer AARDs in rural versus urban areas may be worth exploring in future studies.Different definitions of "gym availability" either based on distance to a gym, a different neighborhood function, or number of gyms in each ZIP code may serve to clarify this relationship.Furthermore, there are multiple ways to define rural-urban status using the nine-point U.S. Department of Agriculture (USDA) rural-urban commuting area (RUCA) code.Using a different definition or additional categories, such as a large rural and small rural category, might alter the results, since there is some heterogeneity in the ZIP codes defined as "rural" and defined as "urban" in this analysis.A recent study by Onega et al. (2020) illustrated that there was also some discordance between how residents classify the ZIP code in which they reside and the RUCA code classification for that ZIP code.

Sensitivity analysis
In both causal inference and Bayesian statistics, it is important to perform sensitivity analyses to evaluate the underlying model assumptions.We performed a sensitivity analysis to assess different formulations of the outcome model, different prior distributions, and changes to the set of confounders.Table 4 displays the posterior means and 95% credible intervals for the decomposition effects under differing outcome models without spatial interference.Under each outcome model tested, the WAIC values were quite similar, indicating no meaningful difference in terms of the predictive accuracy of the models.The counterfactual and reduced AARDs correspond to intervention 2. Including an age group-gym interaction in the outcome model resulted in a slightly attenuated effect of the intervention with the posterior mean of AAR red equal to 0.222 per 100,000 person-years and a 95% credible interval of (−0.025, 0.474).Furthermore, the specification of more informative priors on the model coefficients and different priors on the hyperparameters corresponding to the random effects resulted in a slight increase in the posterior mean of AAR nat and AAR red .These changes were relatively minimal, with the largest change in AAR red of 0.01.
Importantly though, adding two additional variables to the confounder set reduced AAR red to 0.113 with a 95% credible interval of (−0.168, 0.381).These variables were the presence of a grocery store in the ZIP code and the percentage of individuals over the age of 25 with a higher education degree, with the higher education variable exhibiting a significant association with gym access.Given that the 95% credible interval includes more positive than negative values, there is still some evidence that a large-scale gym intervention could reduce the average difference in colorectal cancer AARs between rural and urban areas; however, the costs associated with this largely attenuated effect may not be justified.This sensitivity analysis illustrates the importance of testing numerous sets of posited confounders.

DISCUSSION
In this paper, we have described a Bayesian method for creating decomposition maps to test the effect of area-level interventions on health differences between groups.We extend the work of Sudharsanan & Bijlsma (2021) to develop decomposition maps and a method for obtaining decomposition effects based on summary measures of a disease map.Our Bayesian counterfactual-based decomposition method for disease maps is extremely flexible.Area-level interventions involving any type of mediator variable can be assessed, and the functional form of the outcome model can be tailored to the data application.While AARDs were the summary measure described in this paper, the posterior distribution of AAR ratios can easily be computed from the posterior samples of the underlying AARs drawn in our analysis.The use of Bayesian hierarchical models with spatial and uncorrelated random effects enabled us to produce stable estimates of AARs for small areas and consequently stable estimates of the AARD quantities.As illustrated in our data example, we were able to make reliable inferences from sparse data at the ZIP code level.Past studies (e.g., Singleton et al., 2016;Spencer et al., 2018) have performed decomposition analyses with county-level data using the Oaxaca-Blinder decomposition method from economics (Blinder, 1973;Oaxaca, 1973).However, these studies did not incorporate small area estimation techniques into their decomposition analysis and relied on linear regressions.Standard approaches to decomposition analysis that do not borrow strength from neighboring regions with random effects would not have adequately reduced the noise due to small population sizes in our dataset.These studies also did not test the effect of intervening on specific counties, develop visualizations of the decomposition analysis across space, or employ the counterfactual outcomes framework, as we have done here.
Our proposed decomposition analysis method allowed us to select rural ZIP codes to hypothetically receive gyms and to assess the effect of this intervention in reducing the rural-urban differences in colorectal cancer AARs.It also illustrated how the disease map might change with the inclusion of gyms in different ZIP codes.Under our proposed model, we were able to conclude that gym access does help explain differences in colorectal cancer AARs for rural vs. urban Iowa ZIP codes.However, in order to achieve a noticeable reduction in the AARD, a large-scale intervention where 50+ gyms are added to rural ZIP codes may be required.In our sensitivity analysis, this effect was fairly robust to different priors and the inclusion of an age group-gym interaction; however, it was largely attenuated with the inclusion of two additional potential confounders: grocery store access and the percentage of individuals in a ZIP code over the age of 25 with a higher education degree.In both the proposed model and models considered in the sensitivity analysis, it is clear that a large proportion of the AARD cannot be explained by gym access.It may be worth considering other variables that capture ZIP code-level physical activity patterns as well as other behavioral health interventions to reduce this rural-urban difference in colorectal cancer incidence.
In addition to the strengths of our method, there are also some limitations.It may be difficult to meet the conditional exchangeability assumption required for a causal interpretation in the aggregate spatial data setting.Due to ecological bias, which may arise at the ZIP code level, relationships between ZIP code-level variables may be different than relationships between individual-level variables.This is especially true when variables in the analysis (i.e., household income) vary considerably within a ZIP code.Thus, it can be challenging to account for all confounding with the available data sources (Wakefield & Lyons, 2010).In addition to fitting the outcome model under different sets of potential confounders, we recommend using sensitivity analysis techniques from causal inference in practice (e.g., McCandless et al., 2007;Van-derWeele & Ding, 2017;VanderWeele & Vansteelandt, 2014;Xue & Zaidi, 2021) to help quantify how large an unmeasured confounder's effect on both the mediator and outcome must be to change the magnitude of the decomposition effects.The E-value was developed for a similar setting (VanderWeele & Ding, 2017), and Xue & Zaidi (2021) describe how to obtain a posterior estimate of the E-value and a 95% credible interval in a Bayesian analysis.While it may be challenging to account for all unmeasured confounding with publicly available aggregated data, we focus on using these maps as an exploratory tool that can aid in decision-making, given the information that is available.
A second limitation is the dependence of the natural-course AARs on a specific outcome model.Researchers should inform their decisions of whether or not to include spatial interference of the mediator, how to specify the neighborhood function, and how to account for differences in rates by age group, with subject matter expertise.As illustrated in the data example, AAR nat varied between Models 1 and 2, albeit only by a small amount.In the area of disease mapping, it is necessary to rely on model-based estimates in the natural-course pseudo-population due to the sheer variability in the AARs computed directly from the data.However, incorrect model specifications could affect the results of the decomposition analysis.Our sensitivity analysis illustrated that numerical summaries of the natural-course AARs were relatively robust to different non-informative or weakly informative prior specifications but less robust to different sets of potential confounders.
Our methodology uses the counterfactual outcomes framework to create causal decomposition maps and examine the impact of deterministic interventions on population-level mediator variables.It is important to highlight that there are other causal frameworks that may be used to address similar questions.In particular, Dawid (2021) proposed a decisiontheoretic framework for statistical causality that aims to represent how a nonstochastic "cause" affects a stochastic outcome of interest, where the value of the cause is selected by the decision maker.While not explored in this paper, this decision-theoretic framework might be an alternative to the counterfactual outcomes framework for creating decomposition maps.
In the future, we aim to develop a web application with real-time decomposition maps for cancer researchers to test proposed area-level interventions aimed at reducing geographic health differences.These maps would enable public health professionals to click on ZIP codes on the map to intervene on and then update the AARDs and counterfactual disease maps in real time.A web application with decomposition maps could have great utility in planning interventions to reduce differences in incidence or mortality rates across the spectrum of diseases.The utility of real-time decomposition maps may also extend to research and policymaking in economics, education, and other fields where ZIP code-level differences in outcomes are studied.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors have declared no conflict of interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The ZIP code-level cancer incidence data are not publicly available due to privacy or ethical restrictions.A simulated dataset and the R code to implement this method are provided in the Supporting Information.The data pertaining to the ZIP code-level confounders and population counts are freely available from the American Community Survey.The rural-urban commuting area code is freely available on the USDA website.The gym availability and grocery store availability data come from the National Neighborhood Data Archive, which is available to researchers affiliated with member institutions of the Inter-University Consortium for Political and Social Research.

O P E N R E S E A R C H B A D G E S
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results.The data is available in the Supporting Information section.This article has earned an open data badge "Reproducible Research" for making publicly available the code necessary to reproduce the reported results.The results reported in this article were reproduced partially due to data confidentiality issues.

F
Graph displaying the relationships between variables in a decomposition analysis.Dotted lines indicate observed associations and solid arrows indicate possible causal relationships.A, a group variable; C, any observed covariates; M, a potential mediator variable; R, an outcome variable.the outcomes and mediator values predicted for each individual based on their observed group label.Assuming proper model specification, these predicted values should closely align with the observed mediator and outcome values for each individual.

F
Graph displaying the hypothesized ZIP code-level relationships in this decomposition analysis.The dotted lines indicate associations, whereas the solid, directed arrows indicate possible causal relationships.AAR, age-adjusted rate; SES, socio-economic status.

F
Map of rural-urban classifications for Iowa ZIP codes.TA B L E 1 Number and percentage of rural and urban ZIP codes, respectively, with gyms and gyms in neighboring ZIP codes.

F
Decomposition maps created with Model 1 (assuming no spatial interference).AAR, age-adjusted rate; AARD, age-adjusted rate difference.F I G U R E 6 Decomposition maps created with Model 2 (assuming there is spatial interference).TA B L E 4 Results from a sensitivity analysis evaluating different specifications of the outcome model.The decomposition effects correspond to intervention 2 under a model with no spatial interference.
This research was funded by the National Science Foundation Graduate Research Fellowship Program under grant number 000390183 (MJS), the National Institute of Environmental Health Sciences through the University of Iowa Environmental Health Sciences Research Center under grant number NIEHS/NIH P30 ES005605 (JJO, MEC, MJS), NIH/NCI contract number HHSN261201800012I/HHSN26100001 (MEC), NIH/NCI grant number U01CA258400 (JJO, MEC) and NIH/NCI P30 CA086862 (MEC).Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Institutes of Health or the National Science Foundation.
Jay et al. (2021)This is one possible small area estimation model for AARs, and a version of this model with a temporal component is described inJay et al. (2021).Let  , denote the number of incident cases of the disease of interest in region  ( = 1, … , ) and age group  ( = 1, … , ).Let  , be the population size, or person-years (if cases are aggregated over multiple years) for region  and age group . ′  is a vector of observed area-level covariates corresponding to region  to be used in the estimation of the AAR.The following Bayesian hierarchical model can be used to obtain estimates of each region's underlying AAR: model for relative risks, both spatial random effects and uncorrelated random effects are included in the AAR model. , represents a random effect for region  and age group  to capture potential overdispersion in the Poisson-distributed counts.Collectively, we denote  as the vector of all  ×  overdispersion terms.Here,  ∼ MVN(,   *  × ) , ∼ Poisson( , *  , ) log( , ) =   +  1   +  2   +  3   +  4   *   +  ′   5 +   +  , posterior samples of the natural-course age group-specific rate parameters for each region and age group.The th sample of  , (  ,   ,   ) can be computed as ) .4.Generate  posterior samples of the age group-specific rate parameters in the counterfactual pseudo-population for each region and age group.The th sample of  , (  , m , g ) can be computed as

Mediator definition (with and without spatial interference) Rural Urban Mediator definition (with spatial interference) Rural Urban
Description of proposed interventions.
TA B L E 2