A simulation study of disaggregation regression for spatial disease mapping

Disaggregation regression has become an important tool in spatial disease mapping for making fine-scale predictions of disease risk from aggregated response data. By including high resolution covariate information and modelling the data generating process on a fine scale, it is hoped that these models can accurately learn the relationships between covariates and response at a fine spatial scale. However, validating these high resolution predictions can be a challenge, as often there is no data observed at this spatial scale. In this study, disaggregation regression was performed on simulated data in various settings and the resulting fine-scale predictions are compared to the simulated ground truth. Performance was investigated with varying numbers of data points, sizes of aggregated areas and levels of model misspecification. The effectiveness of cross validation on the aggregate level as a measure of fine-scale predictive performance was also investigated. Predictive performance improved as the number of observations increased and as the size of the aggregated areas decreased. When the model was well-specified, fine-scale predictions were accurate even with small numbers of observations and large aggregated areas. Under model misspecification predictive performance was significantly worse for large aggregated areas but remained high when response data was aggregated over smaller regions. Cross-validation correlation on the aggregate level was a moderately good predictor of fine-scale predictive performance. While the simulations are unlikely to capture the nuances of real-life response data, this study gives insight into the effectiveness of disaggregation regression in different contexts.


Introduction
High resolution maps of disease risk are an important public health tool, facilitating efficient allocation of limited resources and precision targeting of interventions (Drake and others, 2017;Elliot and others, 2000;Lawson and others, 1999). Where incidence or prevalence data is available at a point-level (or can be treated as such, as often the case with village-level health surveys), traditional geostatistical models can be used to make fine-scale (or 'pixel-level') predictions of risk over the region of interest (typically on a grid of 1km-by-1km or 5km-by-5km pixels), informed by spatial patterns and environmental and socioeconomic covariates (Diggle and others, 2003).
Much of the information in these models comes from the spatial smoothing of the data, particularly near sampling locations (some of the earliest examples of modern spatial mapping used no covariate information, such as Diggle and others (1998)). Further from these locations, spatial smoothing is less informative and covariate information aids more in predictive performance.
Often, however, only aggregated (or 'polygon-level') response data is available, for example case counts in geographical regions or from health facilities. Attempting to make statistical inferences at this aggregate level and subsequently predict at fine scale has two main pit-falls -the 'ecological fallacy' (Wakefield and Shaddick, 2006), where relationships learned between covariate and response variables at the aggregate level may not hold on a fine scale, and the 'modifiable areal unit problem' (Fotheringham and Wong, 1991), where fine-scale predictions based on aggregated data change based on how the data is aggregated geographically.
Disaggregation regression (Keil and others, 2013;Sturrock and others, 2014;Weiss and others, 2019;Taylor and others, 2018) attempts to avoid these problems by using high resolution covariate information and modelling the data generating process on a fine scale, with the likelihood of the observed data given by the sum of these processes. However, it is often difficult to assess the performance of these models due to a lack of observed data on a fine spatial scale with which to validate fine-scale predictions. For this reason it is unclear in which situations disaggregation regression may be successfully applied and when making accurate fine-scale predictions may not be possible. Providing fine-scale risk maps where there is insufficient data to inform these predictions (for example, where response data is aggregated over very large areas or information is only available from a small number of areas) may be misleading and in these cases aggregated risk maps may be more suitable. Cross validation can be done on the aggregate level (as by Lucas and others (2020); Law and others (2018), for example) but again it is not clear how measures of out-of-sample aggregated performance relate to fine-scale predictive accuracy.
There have been a number of important simulation studies of disaggregation methodology, generally serving as proofs of concept of particular methods. Wilson and Wakefield (2020) perform a simulation study using their disaggregation method (combining point and polygon-level data) with no covariates in a small number of spatial settings, while Li and others (2012) compare disaggregation regression to regression on the aggregate level with two covariates (one categorical and one continuous) on a single set of areal units. Law and others (2018) simulate data using a toy dataset and evaluate performance at the aggregate level when using real world malaria data. This study builds on the existing work in two ways. Firstly, we extend the scope of previous simulation studies by evaluating the performance of disaggregation regression in a realistic setting and under different conditions to better inform the utility of these methods when applied to real world datasets. Each situation is repeated a number of times to ensure the robustness of the results. Secondly, we investigate the use of cross-validation on the aggregate level as a metric of fine-scale predictive performance. Cross-validation metrics on the aggregate level are often used due to lack of fine-scale observations, so it is important to understand how well these aggregate metrics represent fine-scale accuracy.
This study provides insight into the performance of disaggregation regression as the size of aggregated areas, number of observations and the level of model misspecification varies. In each case, incidence data was simulated for each pixel and aggregated to give polygon-level response data. Disaggregation regression was then used to make pixel-level predictions from this aggregate data, with the simulated fine-scale data used as a ground truth (which is typically missing in real world applications). Additionally, we calculated polygon-level correlation using k-fold cross validation and compared this measure to the fine-scale predictive performance.
The simulated data was intended to resemble malaria incidence data, using real covariate information (a mix environmental and socioeconomic variables that are commonly used in malaria risk mapping) and geographical information from two malaria endemic countries, India and Madagascar. However, we believe the results of this study are informative for disease mapping in general and more widely. We considered model misspecification due to unobserved covariates but did not consider other possible sources of misspecification, such as non-linear covariate effects, covariate interaction or alternative case-generating processes.

Data simulation
The number of cases in pixel j of polygon i was drawn from a Poisson process with mean, µ ij = λ ij × p ij , given by the product of the pixel incidence rate, λ ij , and pixel population, p ij . The log pixel incidence rate was simulated as a linear combination of observed, X obs ij , and unobserved, X unobs ij , covariates The elements of β obs and β unobs were drawn independently from a Normal distribution with mean 0 and standard deviation 0.5. The value of the intercept β 0 was drawn uniformly between -8 and -5 in order to produce incidence rates similar to those observed in malaria endemic countries.
The sampling process was repeated if the total case counts were above country-specific upper thresholds, to ensure that the total cases were largely consistent with estimated annual malaria case counts in each country (World Health Organization, 2019).
Mock case data were simulated under three different scenarios, representing different levels of model misspecification. In each scenario, the observed covariates were 6 real covariate surfaces.
In scenario 1, there were no additional unobserved covariates and therefore no model misspecification. In scenario 2, there were 6 additional real covariates which were unobserved during the fitting process. Finally, in scenario 3 there were 6 additional real covariates and 3 mock covariates that were unobserved during the fitting process. See Section 2.2 for more details on the real and simulated covariates used. For each country and scenario, 20 risk surfaces were simulated. For each simulated risk surface, the observed covariates were sampled (without replacement) from the 12 possible real covariates. In scenario 2 the unobserved covariates were the remaining real covariates while in scenario 3 the unobserved covariates were the remaining real covariates and Case numbers in each pixel were aggregated to give polygon-level case counts. These polygons were the real administrative units in each country and model performance was investigated using three levels of subdivision -administrative level 1, 2 and 3 units (shown in Figure 1). Administrative level 1 units are the largest subnational administrative regions (for example, states and union territories in India) while administrative level 2 and 3 units are increasingly fine subdivisions.

Study area and covariates
The countries chosen for this study were India and Madagascar. These two settings provided a range of polygon sizes (at each administrative level regions were generally larger in India) and different environmental and socioeconomic profiles and therefore covariates with different spatial patterns. Islands were not included and two union territories in India (Chandigarh, and Dadra and Nagar Haveli and Daman and Diu) were excluded as these regions are much smaller than the typical administrative level 1 unit. See Figure 1 and Table 1 for more information on the administrative units used in this study. Shapefiles for these administrative regions were obtained from the Malaria Atlas Project database using the malariaAtlas R package (Pfeffer and others, 2018).
The real covariates used in this study are listed in Table 2. Examples of these covariates are shown in Figure (2017) Battle and others (2019); Arambepola and others (2020)  The mock covariates were simulated Mátern random fields with varying randomly sampled scales. These covariates generally varied on a shorter spatial scale than some of the real covariates and were designed to represent the generally unobserved socioeconomic and human factors that influence risk. Examples of these mock covariates are shown in Figure 3 and shown in full in the supplementary material. It is worth noting that the sum of these simulated covariates is a Gaussian process but does not in general have a Mátern covariance structure, and therefore the addition of multiple mock covariates is not equivalent to adding a single mock covariate with larger variance.

Disaggregation regression model
We assume that aggregated incidence data y 1 , ..., y N is available from N spatial polygons (for example total case numbers in N districts) and we wish to predict risk on a grid of pixels, each of which is contained in exactly one polygon. The number of cases in pixel j of polygon i, y ij , is modelled as a realisation of a Poisson random variable with mean where λ ij is the underlying risk in that pixel, p ij is the pixel population and M i is the number of pixels in polygon i. As is common in geostatistical models (Diggle and others, 1998), risk can be modelled as a smooth surface which, when transformed with suitable link function, is given by a the sum of a linear combination of covariates and a spatially correlated noise term Here the link function is log, which is typical when modelling count data, and X ij are the covariate values in this pixel. The spatial noise { ij } is modelled as a realisation of a Gaussian process with a Mátern covariance structure, parameterised by the range, ρ, and scale, σ. This spatial term can be thought of as representing factors which affect risk but have not (or cannot) be measured. The number of cases observed in polygon i, y i , is assumed to be the sum of the (unobserved) number of cases in each pixel, Assuming that, conditional on the underlying risk surface, the Poisson processes in each pixel are independent, this sum also follows a Poisson distribution with mean equal to the sum of the means of each pixel process, i.e. was randomly sampled (see Figure 1 for examples). The model was then fit using observations from these polygons (aggregated from the pixel case surface) and pixel-level incidence rates were predicted. For each country and scenario, this was repeated over all 20 simulated risk surfaces.
Pixel-level predicted and true incidence rates were compared only within polygons where response data was observed. The main metrics used were overall correlation and, to measure the ability of the model to predict risk patterns within polygons, correlation within each polygon.
The performance of the model in terms of overall correlation was also compared to a baseline of simply assigning each pixel the observed polygon-level incidence rate.

Results
The overall correlation between observed and predicted pixel rates as the number of observations increases is shown in Figure 4, stratified by scenario and administrative level. The overall trends were largely as expected, with more data points (number of polygons observed), smaller polygons and less model misspecification all generally resulting in improved pixel predictions.
In scenario 1, where the model was well-specified, there was little difference in correlation as polygon size or number of polygons observed varied. This was particularly true in India where, with the exception of when the number of observations was very small, correlations were uniformly very high, whereas in Madagascar polygon size appeared to have more of an effect. In general it appears the relationships between covariates and risk were learned effectively, even for large polygons or relatively few data points, and therefore pixel-level predictions were consistently good.
In scenarios 2 and 3, where some of the covariates used to produce the data were withheld during the fitting process, there were clearer trends in performance as the number of observations and size of polygons varied. When using the largest polygons, administrative level 1, performance varied greatly between repeats and on average the correlation was fairly low. As polygon size decreased, overall correlation improved and predictions were more reliable. For administrative level 3 units, particularly in Madagascar, overall correlation was consistently high despite the model misspecification. In both countries and across administrative levels, there was an initial trend of improved performance as the number of observations increased which usually levelled off. Model performance in terms of overall correlation was typically better in scenario 2 than scenario 3, though the differences were often fairly small.  small. Figure 6 shows the correlation between observed and predicted values within each polygon, which measures the ability of the model to predict heterogeneity in risk within each polygon. The mean of these within-polygon correlations shows a similar pattern to overall correlation, with uniformly good performance in scenario 1 and on average increasing correlation as number of observations increases and polygon size decreases in scenarios 2 and 3. However, when considering each polygon (rather than simply the average over all polygons and repeats), there is a large amount of heterogeneity in within-polygon performance when the model is misspecified (scenarios 2 and 3). These large variations are observed across all administrative levels and numbers of observations. Performance is also highly variable across different polygons in the same repeat, rather the model making consistently good or consistently poor predictions in some repeats. The distribution of these correlations in scenario 3 is shown in Figure 7. When using administrative level 1 units, there were a significant proportion of polygons with little to no correlation between predicted and true pixel rates, even for large numbers of observations, with a correlation of less than 0.3 in 27% of polygons in Madagascar and 38% in India. In India when using administrative level 2 units, correlations were generally high but there was still considerable variation.
Correlations were mostly high for administrative level 2 in Madagascar and level 3 in India and consistently high for administrative level 3 in Madagascar.
The relationship between average size of observed polygons and overall correlation under model misspecification (scenarios 2 and 3) is shown in Figure 8. The average size varied between repeats even when the same administrative level was used because for each repeat a random set of contiguous polygons was chosen. The negative association between polygon size and predictive performance previously observed when comparing administrative levels also appears to hold for different sets of polygon within the same administrative level, with a correlation of -0.44 between overall correlation and mean size of polygon. This negative relationship is strongest for

Model fitting and prediction
The relationship between fine-scale predictive performance and polygon-level cross validation was investigated under scenario 3 (the highest level of misspecification), which we considered the most realistic scenario. We did not consider settings where, according to our results in the previous section, disaggregation regression was unlikely to be successfully and reliably applied. Therefore only administrative levels 2 and 3 were considered, with a minimum of 50 observations.
For each risk surface and administrative level, a contiguous set of polygons of a random size was sampled. The disaggregation model was applied over this area and the correlation between predicted and true pixel-level rates was calculated. We then performed five-fold cross validation at the polygon level over this area: The set of polygons was randomly split into five subsets of approximately equal size. For each of these subsets, response data from this subset was withheld and predictions for the rate in these held-out polygons were made by applying the disaggregation model with the remaining response data and aggregating pixel-level predictions in the held-out subset. The process was repeated for all subsets and the correlation between polygon predicted and true rates was calculated.
This was repeated five times for each risk surface and the pixel-level and cross-validated polygon-level correlations were compared.

Results
The comparison between correlation on the pixel-level and aggregated cross-validated correlation is shown in Figure 9. Overall, there was a moderately strong relationship, with a correlation of 0.52 between these two metrics. The polygon correlation was greater than pixel correlation more often when administrative level 2 units were used (69% of the time) compared to administrative level 3 units (44% of the time). When considered separately, the strength of the relationships between the metrics was slightly higher for administrative level 3 units than administrative level 2 units, with correlations of 0.53 and 0.47 respectively.
Despite the positive association between pixel and polygon-level metrics, however, there still were a number of repeats (usually when using administrative level 2 units) where polygon-level correlation was high but pixel-level correlation was much lower. For example, of repeats in which pixel-level correlation was less than 0.6 (representing roughly the worst 10% of pixel-level correlations), in more than half (53%) polygon-level correlation was above 0.75. Fig. 9. Comparison of correlation between predicted and observed polygon rates with five-fold cross validation and predicted and true pixel rates in full model fit.

Discussion
While we included model misspecification in this study in the form of missing covariates, there are many other factors that we did not consider when generating mock incidence data. These include non-linear covariate interactions, incomplete reporting, treatment seeking behaviour, human mobility and alternative case-generating processes. Although it may be possible account for some of these factors, it therefore seems fairly likely that, even compared to our most misspecified scenario, the performance of disaggregation regression could be worse when applied to real data than in this study.
The results in Section 3 demonstrate that disaggregation regression works well under no model misspecification, with relationships between the covariates and risk learned effectively (avoiding the ecological fallacy), even with relatively few polygons and relatively large polygons. However, this is a fairly unrealistic scenario. In addition to the factors not included in the generative model mentioned above, it is unlikely that all variables that significantly influence disease risk can be measured, let alone that these values will be available on a fine-scale grid across the region of interest. Therefore the results of this scenario serve as a proof of concept but do little to inform real world risk mapping problems.
Scenarios 2 and 3 represent more realistic situations and the results here suggest that there may be little benefit in applying disaggregation regression when the observed data is aggregated over large areas. For administrative level 1 units, model performance was highly variable and the spatial patterns predicted by the disaggregation model were often less accurate than using the observed aggregate rates. On average the model was able to capture some sub-polygon heterogeneity accurately but again the large variation in performance between polygons and repeats limits the practical use of these predictions. While performance generally improved as the number of observations increased, these issues were still present with large numbers of observations and the maximum number of observations will necessarily be limited by the size of the study area.
Administrative level 2 units in India may also be too large to reliably make fine-scale predictions from real data.
However, for smaller polygons (administrative level 2 in Madagascar and level 3 in both countries) the results of this study show that fine-scale patterns can be accurately inferred from aggregated data. In these cases, the fine-scale predictions made using the disaggregation model captured both overall and within-polygon heterogeneity well and therefore were a significant improvement over the observed aggregated data. Performance was also much more reliable, both between repeats and (when considering within-polygon correlation) between polygons. Performance improved as the number of observations increased but generally high correlations were reached with a moderate number of observations. Largely as expected, performance was better when aggregated areas were smaller, a trend which was observed both between and within administrative levels. Fine-scale predictions were particularly good for administrative level 3 in Madagascar.
The results from Section 4 suggest that cross-validation on a polygon-level has some use as a proxy for fine-scale predictive performance. There was a correlation of around 0.5 between polygon and pixel-level metrics and therefore a high polygon correlation should increase confidence in fine-scale predictions. However, the relationship between polygon and pixel-level correlation may still not be strong enough to consistently identify poor pixel-level predictions from polygon-level correlation. This was the case in our simulations, as polygon-level correlation was often high even when pixel-level correlation was low. It therefore remains important to validate disaggregation regression using fine-scale observations wherever possible. Where no fine-scale data exists, alternative metrics of similar phenomena may help evaluate fine-scale patterns qualitatively.

Conclusion
The results of this study suggest that disaggregation regression can be applied successfully in applications where response data is aggregated over small areas. The different situations considered in this study (in terms of polygon size, number of observations and level of misspecification) should allow for more informed decisions to be made about whether disaggregation regression is an appropriate method for a specific problem.
However, caution should be taken when response data is aggregated over large areas. While disaggregation was on average moderately effective when using large polygons, predictions were generally unreliable, with large variations in performance both between repeats and between regions in the same repeat. Our results suggest that administrative level 1 and 2 units in India and level 1 units in Madagascar may be too large to perform effective disaggregation. In these cases it may be more useful to model on the aggregate level and look for alternative sources of data (such as data from sub-regions) to inform sub-polygon heterogeneity.
In our simulations, cross-validation correlation at the aggregate level was correlated with fine-scale accuracy and therefore in real world applications correlation on the aggregate level may be of some use as a validation metric where only aggregated data is available. However, a high correlation on the aggregate level is not a guarantee of accurate fine-scale predictions and wherever possible Fine-scale predictions should be validated using data on the same spatial scale.