Ge Lin, Department of Geology and Geography, West Virginia University, Morgantown, WV 26506 e-mail: firstname.lastname@example.org
This article bridges the permutation test of Moran's I to the residuals of a loglinear model under the asymptotic normality assumption. It provides the versions of Moran's I based on Pearson residuals (IPR) and deviance residuals (IDR) so that they can be used to test for spatial clustering while at the same time account for potential covariates and heterogeneous population sizes. Our simulations showed that both IPR and IDR are effective to account for heterogeneous population sizes. The tests based on IPR and IDR are applied to a set of log-rate models for early-stage and late-stage breast cancer with socioeconomic and access-to-care data in Kentucky. The results showed that socioeconomic and access-to-care variables can sufficiently explain spatial clustering of early-stage breast carcinomas, but these factors cannot explain that for the late stage. For this reason, we used local spatial association terms and located four late-stage breast cancer clusters that could not be explained. The results also confirmed our expectation that a high screening level would be associated with a high incidence rate of early-stage disease, which in turn would reduce late-stage incidence rates.
Linear or loglinear spatial regression models are common in spatial epidemiology (Best et al. 2000). A set of ecological variables are often associated with disease rates or counts, and after a final model is derived, residuals can be visually inspected on a map for spatial clusters. For a linear model, a residual test of Moran's I for spatial autocorrelation can also be performed to detect spatial clustering for the unexplained regression errors. However, there is no corresponding spatial residual test of clustering for loglinear or Poisson regressions on count data, which was the challenge for the current study. The study was motivated by our inquiry into the spatial patterns of breast cancer incidents in Kentucky counties as they related to the development stage of disease at diagnosis. Breast cancer staging at diagnosis is known to be associated with socioeconomic conditions, mammography screening services, and other variables (Yabroff and Gordis 2003; Barry and Breen 2005). As socioeconomic variables are often spatially autocorrelated (e.g., poor areas tend to be clustered), we expect clustering of breast cancer to occur, at least for the early-state incidence rates. If there is no significant environmental cause of breast cancer, the clustering tendency should disappear once we introduce area socioeconomic variables.
One way to test for the existence of spatial clustering is to set up a spatial autocorrelation test, such as Moran's I, for Guassian or continuous data by using the permutation test of residuals for Moran's I in a linear regression (Cliff and Ord 1981). Converting incidence to rate, however, is often less appealing than retaining the original count of each in spatial data analysis (Griffith and Haining 2006). In addition, Moran's I test assumes that attribute values (e.g., disease prevalence) are either in equal probability among all the geographic units or from a single parent distribution. These assumptions are often violated in the permutation test of Moran's I in disease data due to heterogeneous regional populations and large variation in sparsely populated areas (Besag and Newell 1991). Although there have been several extensions of Moran's I to account for population heterogeneity (Oden 1995; Waldhor 1996; Assuncao and Reis 1999), none of them can include potential ecological covariates. For example, Oden proposes a test statistic that applies regional population sizes to adjust Moran's I. However, because of a minor modification in the null hypothesis, is no longer comparable to the original Moran's I (Assuncao and Reis 1999). Consequently, cannot be extended to evaluate covariates and spatial autocorrelation simultaneously.
A spatial logit association model can include potential explanatory variables and identify high-value and low-value clusters (Lin 2003; Zhang and Lin 2006). It does not, however, have a global measure of spatial clustering that would complement the modeling process for local spatial logit associations. Jacqmin-Gadda et al. (1997) propose a homogeneity score test of a generalized linear model that can also include potential explanatory variables in a correlation test. The test is based on residuals in generalized linear models, a design that its authors claimed to correspond to the permutation test of linear regression errors. However, as we will later show, the test does not adjust variance for heterogeneous population sizes. In addition, because its weight matrix is not necessarily spatially constructed, its null hypothesis is not necessarily spatial independence as one would assume when applying Moran's I test. Consequently, it is not straightforward to use the score test in a generalized linear model that includes spatial correlation and heterogeneity.
The purpose of this article is to extend the permutation test of residuals of the Moran's I autocorrelation to generalized linear models so that spatial analysts can directly test for spatial clustering while controlling for potential ecological covariates. While no one has proposed either deviance or Pearson residuals in the spatial statistic literature, Waller and Gotway (2004) point out a form close to Pearson residuals as a way to account for inflated variance in Moran's I under heteroskedasticity. In this article, we demonstrate that permutation tests are applicable to Pearson or deviance residuals of loglinear models in the same way as in the traditional permutation test of residuals for Moran's I. In the remaining sections of the article, we first review the permutation test of Moran's I by using regression residuals and then reformulate it in the context of Poisson data by using the Pearson and deviance residuals of a loglinear model. We then evaluate their statistical properties under the null hypothesis of spatial independence in a series of simulated patterns and apply the Pearson residual Moran's I test and deviance residual Moran's I test to breast cancer incidence in Kentucky counties that include potential ecological covariates.
From linear to loglinear residual tests of Moran's I
Let us consider a study area that has m regions indexed by i. Let zi be the variable of the interest in region i. Moran's I (Moran 1950) is expressed as:
where , wij is an element of a spatial weight matrix W, with 1 being adjacent for regions i and j and is 0 otherwise. Under the assumption of homogeneity, the moments of Moran's I can be computed under the randomization assumption, which assumes that the observations are generated from a set of random permutations of the observed values. The significance of Moran's I can be determined by the quantity under the asymptotic normality assumption, where the respective mean and variance are obtained from the random permutation scheme suggested by Cliff and Ord (1981, p. 21). A significant and positive value of Moran's I (i.e., Istd>zα/2) usually indicates a positive autocorrelation, such as the existence of either high-value or low-value clustering. A significant and negative value of Moran's I (i.e., Istd<−zα/2) usually indicates a negative autocorrelation, such as a tendency toward the juxtaposition of high values with low values. If there is no spatial dependence, I is often close to the mean value −1/(m−1), which can be approximated by 0 if m is large.
In order to account for ecological covariates, it is often suggested that zi be taken as the ith regional residual of a linear regression model when testing for spatial autocorrelation (Cliff and Ord 1981, p. 198). For data resulting purely from a random process, the calculation of Moran's I in equation (1) based on the residuals is the same as the one based on the observed values. It can be concluded that Istd is approximately N(0, 1) as m→∞ under some regularity conditions (Sen 1976). Schmoyer (1994) also demonstrated the validity of a general form of permutation test based on the residuals of a linear model with independent and identically distributed errors. In Poisson models with heterogeneous population sizes, these justifications cannot be applied (Waldhor 1996; Assuncao and Reis 1999).
The residuals of loglinear models are well established in the statistical literature in a nonspatial context. We can extend them to a spatial context to test for spatial autocorrelation. In his synthesis of previous studies, Agresti (1990, p. 431) showed that the Pearson and log-likelihood (deviance) residuals of loglinear models are asymptotically multivariate normal with mean 0 and the variance–covariance matrix a projection matrix. This particular asymptotic form of the residuals is analogous to that of linear regression residuals. Following this reasoning, we can devise a log-rate model that closely resembles a linear regression model to account for potentially heterogeneous population sizes and ecological covariates. We apply the permutation test based on the asymptotic normality assumption, so that Moran's I based on the residuals of a log-rate model is analogous to Moran's I based on regression residuals. In the following, we specify the residuals of a log-rate model for the permutation test of Moran's I.
Let ni be the observed counts for a Poisson random variable Ni at region i (i=1, …, m), and let ξi and θi be the ith regional population and relative risk, respectively. Then, Ni are assumed to be independently Poisson distributed, or Ni∼Poisson (Eiθi), where Ei is the expected count for region i. The null hypothesis that all the relative risks (θ1, …, θm) are 1 can be stated (Elliott et al. 2000, p. 132). Suppose that a set of explanatory variables (xi,1, … , xi,k−1) are observed together with ni. A log-rate model can then be expressed as
In equation (2), n̂i and log(n̂i/ξi) are, respectively, the estimated count and estimated log rate for region i, is the estimate of grand mean, and the other s are parameter estimates for explanatory variables. Equation (2) is often estimated by moving log(ξi) to the right-hand side, so that it becomes the offset.
Based on these notations, the conventional Pearson residual for region i is defined as
and the conventional deviance residual (Agresti 1990, p. 452) for region i is defined as
where sign(a) is 1 if a>0, is 0 if a=0, and is −1 if a<0.
We can test for spatial autocorrelation based on the residuals in equation (2) by replacing zi in (1) with either the Pearson residual in (4) or the deviance residual in (5). When zi is replaced with ri,p, for example, Moran's I becomes Pearson residual Moran's I, and we denote it as IPR, which can be expressed as
When zi is replaced with ri,d, Moran's I becomes deviance or log-likelihood residual Moran's I and we denote it as IDR. To implement these tests, the parameters of explanatory variables together with the residuals are first estimated in the model-fitting process. Then, IPR and IDR are calculated. Finally, the means and variances of IPR and IDR are computed according to the random permutation scheme (Cliff and Ord 1981). The P values of IPR and IDR can be computed based on the asymptotic normality assumption when m is large and the number of covariates is small. If the independence model is rejected while all the ecological covariates can be found and modeled, the residuals of IPR or IDR from the model should be completely lacking in spatial autocorrelation. Otherwise, it may indicate spatial clustering that cannot be explained.
Although no one has specified both IDR and IPR as above, it is worth noting the difference between IPR and the score test of the residuals in a generalized linear model (Jacqmin-Gadda et al. 1997). The score test, which also accounts for explanatory variables, was derived from a random-effect approach by neglecting an additional term of overdispersion. Based on its original formula, the test statistic is expressed as
where Yi is the variable of interest in region i as assumed from the exponential family, is an estimate of E(Yi), and wij is an element from the weight matrix. For a Poisson model, one can take Yi=Ni so that yi is the number of the observed counts ni, and is the estimated count n̂i in our notations. Then,
This particular form of T is different from the formulation of IPR given by (6) because we use instead of zi=ni−n̂i. This difference also leads to a variance adjustment problem for T, because the variance of still depends on the i th regional population size ξi even when equals the true parameter μi. In addition, wij in T may not be based on spatial relationships, such as spatial adjacency or distance, and it adds complications in designing a proper weight matrix and in deriving the P value. Pearson residuals, on the other hand, are well established, and it is much easier to calculate or implement Pearson residuals in a standard statistical package than it is for the score test. Because of these differences, we will not compare IPR and IDR with the score test in our simulations.
To evaluate Pearson residuals IPR and deviance residuals IDR for variance adjustments, we carried out simulations under the null hypothesis of no spatial clustering. We included two alternative test statistics, Oden's and Assuncao and Reis's EBI, both of which are designed to account for heteroskedasticity without controlling for ecological covariates. As a reference point, we also included the original Moran's I, denoted by Ir, by taking zi=ni/ξi in equation (1). For ease of discussion, we list some expressions for Oden's and Assuncao and Reis's EBI below.
Oden's is defined as
where , ei=ni/n, di=ξi/ξ, , and . Oden also derived the mean and variance of . If there is no spatial dependence, is close to the mean value −1/(ξ−1).
Oden's weights a pair of regional populations together with the spatial weight matrix by using . While wij in is still a spatial weight in the W matrix, its diagonal element wii≠0. This feature makes incomparable with Moran's I when spatial clustering and population heterogeneity both exist. Noticing this difference, Assuncao and Reis (1999) proposed their version of population-adjusted Moran's I or EBI. In their definition, , where b=n/ξ, νi=a+b/ξi, a=s2−b/(ξ/m) and . Hence,
Although they did not notice the connection, Assuncao and Reis point out that νi can be set equal to b/ξi if νi<0. If all νi<0, then IPR and EBI are identical in the absence of covariates. The justification and evaluation of EBI, in this case, would be directly applicable to IPR. However, we never encountered this case in our simulations, and so the general form of EBI is always assumed.
Simulations were based on a 20 × 20 lattice. We defined a spatial weight matrix according to spatial adjacency, that is, wij=1 if two lattice points (i, j) are adjacent, 0 otherwise. For , we followed Oden and took wii=2. These weight assignments are identical to those used in Assuncao and Reis's (1999) simulations. We denote λi and ξi, respectively, as the disease rate and population size at lattice point i, and Ni as the Poisson random variable with the mean value μi=λiξi. In each run, 400 Poisson random variables with the constant rate λi=0.0001 were generated independently on the lattice points, so that there would be no spatial clustering of disease rates.
In order to compare a wide range of heteroskedasticity, we included a spatial homogeneous pattern, together with six heterogeneous patterns similar to those used in Waldhor's (1996) simulations (Fig. 1). We used ui and vi to represent the vertical axis and the horizontal axis, respectively, so that a lattice point i can be easily identified by i=20(ui−1)+vi. The seven spatial population patterns were defined in the units of 105 (e.g., ξi=1 means 1 × 105 and ξi=1000 means 1000 × 105), and they are:
(a0)Homogeneous population with ξi=1 for all i=1, …, 400.
(a1)Half sparse and half dense: ξi=1 if ui≤10, ξi=1000 if ui≥11.
(a2)One quad sparse and one quad dense: ξi=1 if ui≤10 and vi≤10 or ui≥11 and vi≥10; ξi=1000 if ui≤10 and vi≥11 or ui≥11 and vi≤10.
(a3)All sparse except one cluster with a dense population: ξi=1, except when lattice point i is within , in which ξi=1000.
(a4)All sparse except two clusters with dense populations: ξi=1, except when lattice point i is within or , in which ξi=1000.
(a5)All dense except one cluster with sparse population: ξi=1000, except when lattice point i is within , in which ξi=1.
(a6)All dense except two clusters with sparse populations: ξi=1000, except when lattice point i is within or , in which ξi=1.
For a given pattern of population and rate distributions, we ran the simulations 10,000 times and calculated Ir, IPR, IDR, , and EBI for each run. We fixed the disease rate at 0.0001 for all locations in the seven heterogeneous patterns (a0–a6), so that E(Ni)=V(Ni)=10 if ξi=1 unit and E(Ni)=V(Ni)=10,000 if ξi=1000 units. Under the null hypothesis, we assessed the validity of the permutation test by comparing (a) the observed variance for each test statistic versus corresponding permutation variance and (b) the rejection rates at the 0.05 nominal value (α=0.05). In the preliminary analysis, we also compared the observed and permutation means for each of Ir, ID, IDR, and EBI. As the difference between the two could be ignored, we will not compare them.
Table 1 lists the simulation results of the observed and permutation variances of Ir, IPR, IDR, , and EBI based on the seven patterns (a0–a6), where σS2 denotes the observed sample variance from the simulation and σR2 denotes estimated variance from the random permutation scheme. As σS2 can be treated as the true values, we compared the ratio of the values of σS2 and σR2. If the ratio is close to 1, it suggests that estimated variance is close to the permutation variance.
Table 1. Comparison of Variance for Selected Test Statistics
Ir (× 0.001)
IPR (× 0.001)
IDR (× 0.001)
EBI (× 0.001)
1.112 × 106
1.147 × 106
The results show that, when the populations were homogeneous, the ratios between σS2 and σR2 were very close to 1. When the populations were heterogeneous, the rate-based test Ir produced predominantly biased variance estimates, a result consistent with Waldhor's simulations. When a set of sparsely populated points or grids were clustered, the permutation tended to underestimate the true variance. In the cases of one-cluster (a5) and two-cluster (a6) patterns, the permutation variances were underestimated by about 9.2% and 18.2%, respectively. When the population distribution was characterized by (a1) or (a2), the permutation underestimated true variances by about 50%. In contrast, the ratios between σS2 and σR2 for IPR, IDR, , and EBI were all very close to 1 in patterns (a1) to (a6). The results for IPR and IDR suggest that the estimated values of the variance are trustworthy and that they can effectively reduce potentially inflated variance. As the ratios were very close to 1 for and EBI, IPR and IDR, it might not be critical for using IPR or IDR if one simply wants to account for heteroskedasticity. Nevertheless, IPR or IDR have an advantage when one wishes to incorporate covariates.
Table 2 lists the rejection rates of the permutation test based on Ir, IPR, IDR, and EBI according to the seven spatial population patterns. When populations are homogeneous, all the test statistics had a type I error rate between 4.60% to 5.03% with an IDR above 5% by a fraction. When populations were heterogeneous, all the test statistics except Ir had a consistent rejection rates around 5%, suggesting that IPR, IDR, , and EBI were all reasonable under the null hypothesis and they can effectively correct potentially inflated variance with a satisfactory level of type I errors. The results for Ir, in contrast, failed to reject the null hypothesis with the accepted nominal value 0.05 due to an inflated variance. Although some patterns (a3 and a4) registered lower rejection rates, all the patterns had a rejection rate more than 5%. In the cases of one or two sparsely populated regions clustered in a densely populated study area (patterns a5 and a6), the rejection rate for a spatially random pattern was more than 40%. These results essentially confirmed Waldhor's simulation results for Ir. Even though the performance of the permutation test for IPR and IDR is comparable with the performance from and EBI, IPR and IDR have the flexibility of incorporating potential covariates in to their tests for spatial clustering, which we will demonstrate in the following section.
Table 2. Rejection Rates (%) for Selected Test Statistics
Data and variables
County-level breast cancer and at-risk population data were obtained from the Kentucky cancer registry for the years 1996–2000. The data set reports breast cancer cases according to their developmental stage as follows: 0, a benign tumor; 1 an in situ tumor; 2, localized tumor; 3, a regional tumor; 4, a distant tumor; and 5, usually used to code patients who died with later stage disease without an autopsy report on file. For the purpose of our analysis, we deleted the stage 0 cases. Following the U.S. Surveillance, Epidemiology, and End Results (SEER) Program definitions, we regrouped the in situ and localized tumors as early stage and the regional, distant and unknown tumors as late stage. Generally, early-stage breast carcinomas are confined to the breast and can often be treated successfully, whereas late-stage carcinomas tend to spread beyond the breast and are often fatal.
There were a total of 16,055 breast cancer cases during this period, and 80% of the cases were between age 35 and 75. On average, there were 96.7 per 100,000 women diagnosed at an early stage and 36.9 per 100,000 at a later stage. If the overall breast cancer incidence rate is constant across counties, the early-stage breast cancer rate should, theoretically, be negatively related to the late-stage breast cancer rate. Based on 10-year age-group incidence data during the same period, we calculated the indirect standard incidence rate (SIR) or relative risk for each county (Waller and Gotway 2004, p. 15), and the results are shown in Figs. 2 and 3 in six quantiles. We observed that a strip of counties along the boundary between Appalachian and non-Appalachian regions had an elevated early-stage breast cancer risk, as did the westernmost counties. With regard to late-stage breast cancer, there was a cluster of counties with a high incidence rate around the northeastern Appalachian area; counties along the southern border of the state also had a higher relative risk.
As screening for early-stage breast cancer tends to be associated with age, socioeconomic conditions, and access to health care within a geographic area (Freeman 1989), we included additional county variables while testing spatial clustering for both early-stage and late-stage breast cancer incidents. We obtained county population and socioeconomic data from the 2000 U.S. Census. Age is expected to be positively associated with breast cancer incidents, and we obtained median age and population age groups as potential control variables. The socioeconomic conditions in a county can be related to breast cancer in two ways (Bradley, Given, and Robert 2001). On the one hand, breast cancer is more prevalent among White women or those with higher socioeconomic status, and so counties that have a higher median family income (MEDFINC) are expected to have greater breast cancer incidence rates. On the other hand, women who have a higher socioeconomic status tend to be more aware of and more able to afford breast cancer screening than are those who have a lower socioeconomic status. Consequently, although counties that have better socioeconomic conditions may have a higher incidence rate of early-stage disease, they may not necessarily have a higher incidence rate of late-stage disease than do counties with worse socioeconomic conditions (Roche, Skinner, and Weinstein 2002).
We relied on several other data sources for access-to-care measures. We obtained breast cancer screening rates from the 1997 and 1998 Behavioral Risk Factor Surveillance Systems (BRFSS). Owing to the confidentiality concern, the original release of cancer screening level had some counties grouped together, and so we divided rates into tertiles of high (H-screening: >70%), middle (M-screening: 65–70%), and low (L-screening: <65%). A higher breast cancer screening level (primarily by mammography) is expected to be associated with higher early-stage incidence rates, and negatively associated with late-stage rates. In the preliminary analysis, we found that the differences between low and middle tertiles were minimal, and we grouped them together in the final analysis. We also obtained the 1998 population-to-primary care physician ratio (POP/PMD) in 1998 from the Kentucky Department of Public Health; a lower ratio indicates that a physician can pay more attention to each patient. As breast cancer screening is most frequently recommended in a primary care setting, having a greater number of primary care physicians should help to reduce the incidence of all stages of breast cancer. For this reason, we expected the population-to-physician ratio to be negatively associated with both early-stage and late-stage breast cancer rates.
Finally, we used data from a geographic information system to derive geographic access measures for Kentucky counties. We used the TIGER file from the 2000 U.S. census to derive a measure of access to major highways. A county is coded 1 if a major national highway (HWY) passes through it, and 0 otherwise. It was expected that highway access would increase access to health care facilities and increase early-stage breast cancer diagnoses. We also divided counties into within and outside the Appalachian region (Appalachian) as encircled by the bold line in Figs. 2 and 3. Counties within the Appalachian region generally are economically distressed and medically underserved, and the all-cause mortality rate tends to be much higher within the region (Haaga 2004; Mather 2004).
To test Moran's I for spatial clustering, we first used Pearson residuals IPR and deviance residuals IDR for the null model without any covariates. Here, IPR and IDR in the null loglinear model correspond to the traditional Moran's I without any covariates. We then introduced explanatory variables in the so-called ecological model. In the preliminary analysis, we found that college education, poverty rate, and MEDFINC were highly correlated; we used MEDFINC in the final analysis because it was the most significant variable in terms of the likelihood ratio test. We also experimented with different age variables and found that proportions of age 40–64 together with age 65 and over were not significant. We selected median age, because it was consistently significant in the ecological models for both stages; the greater the age, the greater the likelihood of breast cancer. We expected that the null model would indicate some spatial clustering through significant spatial autocorrelation and that the correlation should be weakened or disappeared once the explanatory variables were introduced. As most of explanatory variables in the literature are tapped for the early stage, their effects on the late stage are expected to be weaker.
If the autocorrelation was found to persist or could not be explained by the ecological model, our task was to locate spatially clustered counties and provide our findings to epidemiologists and cancer specialists for further identification of the etiologies associated with breast cancer clusters. We used a spatial mixed model to search for high-value and low-value spatial clusters by including additional local spatial association terms, as demonstrated by Lin (2003). The method makes use of the vector of the spatial weight matrix, with 1 being adjacent to i inclusive (i.e., including the ith region itself), and 0 being otherwise. If a cluster of counties associated with the ith vector could significantly reduce the log-likelihood (deviance), it indicates a local association or cluster centered around the ith county. If the association is positive, it suggests a high-value cluster. If the association is negative, it suggests a low-value cluster (Lin and Zhang 2004). After controlling for pockets of high-value and low-value clustering and ecological covariates, we would then expect an insignificant residual Moran's I.
Table 3 lists Moran's I coefficients and the t values of ecological variables from the early-stage log-rate models. In the null model, both Pearson residual Moran's IPR and deviance residual Moran's IDR were positively significant, suggesting a clustering tendency. Once the explanatory variables were introduced into the ecological model, however, IPR and IDR both became insignificant based on their corresponding residuals. Hence, a spatial clustering tendency in the null model reflected spatial patterning of age structure, socioeconomic status, and access to care. In particular, counties with an older median age, a high level of breast cancer screening, or with easy highway access were associated with higher detection rates of early-stage breast cancer, whereas the POP/PMD and location in the Appalachian region were negatively associated with detection rates. These results were all consistent with our expectations and the existing literature. Finally, county MEDFINC had a negative and weak (P value <0.10) association with the early breast cancer detection rate. While it has been reported widely that women in less-developed counties, such as those in the Appalachian area of Kentucky, are less likely to have their breast cancer diagnosed early (Gregorio et al. 2002), it seemed counterintuitive to us that a higher county MEDFINC would be associated with a lower county rate of early-stage breast cancer. When we added only MEDFINC to the null model, the coefficient for MEDFINC was positive and significant. The weak association, therefore, may reflect an effect when age, the level of breast cancer screening, and other access-to-care variables were taken into account.
Table 3. Early-Stage Breast Cancer Log-Rate Models
Note: G2 stands for likelihood ratio χ2 and t value is the ratio of estimated value and its standard error. POP/PMD, population-to-primary care physician ratio; MEDFINC, median family income.
Turning to the results from the late-stage log-rate models (Table 4), we found that Moran's IPR and IDR were significant for both the null and ecological models and that the explanatory variables in the ecological model were insufficient to account for the clustering tendency of the late-stage incidence rates. Three variables that remained significant were median age, population-to-physician ratio, and high screening level. The coefficients for the median age and population-to-physician ratio remained consistent with the early-stage model. However, the late-stage rate became negatively associated with a high screening level. This result was consistent with our expectation that a high screening level was associated with high rates of early-stage diagnoses and low rates of late-stage disease.
Table 4. Late-Stage Breast Cancer Log-Rate Models
Note: G2 stands for likelihood ratio χ2 and t value is the ratio of estimated value and its standard error. POP/PMD, population-to-primary care physician ratio; MEDFINC, median family income.
To further pinpoint the core of clustered counties unexplained by the ecological model, we deleted insignificant variables except median age from the ecological model and applied a stepwise regression by including local spatial association terms. We identified four local association terms, none of which overlapped geographically. Each core county and its adjacent counties constituted a cluster, and including the core county only would not significantly reduce the clustered effect. Except for a cool spot around Barran County, the other three core counties represented the centers of three elevated late-stage clusters (Fig. 4). For instance, the rate in Union County and its adjacent counties was 1.415 times the rates of other counties not included in the clusters. By including these clusters in the final model, both IPR and IDR were found to be insignificant, suggesting the disappearance of the clustering tendency once these clusters were accounted for in the model. It is worth mentioning that, if we dropped Marshall County, the median age effect would be significant, but the clustering effect would still remain. It suggests that a greater proportion of aging population around Marshall and its adjacent counties contributed to this cluster. It is also worth mentioning that Jefferson County, where Louisville is located, had a very low late-stage rate, but its adjacent counties each had a relatively high rate. Although including Jefferson County in the final model would reduce the log-likelihood ratio and the t values for IPR and IPD, counties around Jefferson County would not constitute a cluster.1
Even though the main purpose of using IPR and IDR was to account for heteroskedasticity, new insights have been gained from analyzing IPR and IDR in the model-fitting processes. When the five explanatory variables were sufficient to reduce the spatial clustering effect of early-stage breast cancer, they highlight the underlying socioeconomic and health access processes that operate in geographic space. When known explanatory variables were not sufficient to reduce the spatial clustering effect as in the case of late-stage breast cancer, local spatial association terms can be used to further identify local clusters. The identification of spatial clusters may help to further reveal unique geographic characteristics in the clustered areas associated with cancer disparities. Without measuring IPR or IDR, we would not be able to evaluate potential clustering effect in the traditional log-rate or Poisson regression models.
Although the asymptotic validity of permutation tests has been demonstrated in the literature and Pearson residuals and deviance residuals of loglinear models are asymptotically normal, no one has evaluated and applied them in the spatial context. In the current study, we have bridged Pearson residuals and deviance residuals of loglinear models with the permutation test of Moran's I under the asymptotic normality assumption. Our simulation study showed that both test statistics IPR and IDR were effective in reducing inflated variance caused by heterogeneous populations, and they both had an acceptable type I error rate. In addition, we tested both based on a set of log-rate models or Poisson regressions for early-stage and late-stage breast cancer incidence data, together with socioeconomic and access-to-care data in Kentucky. The results showed that socioeconomic and access-to-care variables were sufficient to account for spatial clustering of early-stage breast carcinomas with access-to-care measures, such as breast cancer screening and number of primary care providers being more persistent than county MEDFINC. After controlling for age, access-to-care measures, and regional distress factors in the Appalachian counties, the purported positive association between higher socioeconomic and early-stage breast cancer could be substantially weakened.
For late-stage carcinomas, two salient and persistent factors were level of breast cancer screening and POP/PMD. In contrast to the finding that a high screening level was associated with a high incidence rate of early-stage breast cancer, the late-stage incidence rate was negatively associated with breast cancer screening level. This result confirmed our expectation: a high screening level is associated with a high incidence rate of early-stage diagnoses, which in turn reduces late-stage incidence rates. When the two access variables failed to reduce the spatial clustering tendencies from late-stage breast cancer, we searched for a local spatial association based on the likelihood ratio test. We located four clusters: one low-value cluster around Barran County and three high-value clusters. Two of the high-value clusters were adjacent to each other near the western corner of Kentucky. These unexplained clusters provided the basis for further investigation of the etiological and ecological factors of late-stage breast cancer in Kentucky. It should be pointed out that, even though we include median age, the age effect may not be fully accounted for in our model: future work should explore ways to account for both age and ecological effects while testing for spatial clustering.
Spatial regressions have been widely used, but their use with the permutation tests of residuals either in linear or loglinear models is rarely seen. An advantage of the loglinear residual permutation test over the linear residual permutation test is that the former can account for potential spatially heterogeneous populations, which makes it a viable alternative in the log-rate modeling of disease rates, as demonstrated in our study. The method is expected to complement some spatial cluster tests, such as the spatial scan statistic (Kulldorff 1997) and G statistic (Getis and Ord 1992). In addition, the ability to show spatial clustering in IPR and IDR is complementary to disease mapping, which intends to display true disease risks while controlling for heterogeneous populations and regional risk factors (Lawson and Clark 2002). Finally, a local version of loglinear residuals can be developed as an exploratory tool to complement other local indicators of spatial association (Anselin 1995).
Data used in this publication were provided by the Kentucky Cancer Registry, Lexington, KY. We would like to acknowledge helpful comments from Linda Pickle and Robert Hanham. We would also like to thank three reviewers and the editor for their comments and suggestions.
1 We also analyzed the late-stage cases versus all cases at the patient level. The results were consistent. Population-to-physician ratio and breast cancer screening and close to highway were less likely to be associated with late-stage diagnoses, while residing in the Appalachian region was more likely to be associated with late-stage diagnoses. Counties around Jefferson County had an excessive late-stage breast cancer diagnoses among breast cancer patients.