### Abstract

- Top of page
- Abstract
- Introduction
- Design and functionality
- Mapping and geovisualization
- Multivariate EDA
- Spatial autocorrelation analysis
- Spatial regression
- Future directions
- Acknowledgements
- References

This article presents an overview of GeoDa™, a free software program intended to serve as a user-friendly and graphical introduction to spatial analysis for non-geographic information systems (GIS) specialists. It includes functionality ranging from simple mapping to exploratory data analysis, the visualization of global and local spatial autocorrelation, and spatial regression. A key feature of GeoDa is an interactive environment that combines maps with statistical graphics, using the technology of dynamically linked windows. A brief review of the software design is given, as well as some illustrative examples that highlight distinctive features of the program in applications dealing with public health, economic development, real estate analysis, and criminology.

### Introduction

- Top of page
- Abstract
- Introduction
- Design and functionality
- Mapping and geovisualization
- Multivariate EDA
- Spatial autocorrelation analysis
- Spatial regression
- Future directions
- Acknowledgements
- References

The development of specialized software for spatial data analysis has seen rapid growth as the lack of such tools was lamented in the late 1980s by Haining (1989) and cited as a major impediment to the adoption and use of spatial statistics by geographic information systems (GIS) researchers. Initially, attention tended to focus on conceptual issues, such as how to integrate spatial statistical methods and a GIS environment (loosely versus tightly coupled, embedded versus modular, etc.), and which techniques would be most fruitfully included in such a framework. Familiar reviews of these issues are represented in, among others, Anselin and Getis (1992); Goodchild et al. (1992); Fischer and Nijkamp (1993); Fotheringham and Rogerson (1993, 1994); Fischer, Scholten, and Unwin (1996); and Fischer and Getis (1997). Today, the situation is quite different, and a fairly substantial collection of spatial data analysis software is readily available, ranging from niche programs, customized scripts and extensions for commercial statistical and GIS packages, to a burgeoning open-source effort using software environments such as R, Java, and Python. This is exemplified by the growing contents of the software tools clearing house maintained by the U.S.-based Center for Spatially Integrated Social Science (CSISS).^{1}

CSISS was established in 1999 as a research infrastructure project funded by the U.S. National Science Foundation in order to promote a spatial analytical perspective in the social sciences (Goodchild et al. 2000). It was readily recognized that a major instrument in disseminating and facilitating spatial data analysis would be an easy-to-use, visual, and interactive software package aimed at the non-GIS user and requiring as little as possible in terms of other software (such as GIS or statistical packages). *GeoDa* is the outcome of this effort. It is envisaged as an “introduction to spatial data analysis” where the latter is taken to consist of visualization, exploration, and explanation of *interesting* patterns in geographic data.

The main objective of the software is to provide the user with a natural path through an empirical spatial data analysis exercise, starting with simple mapping and geovisualization, moving on to exploration, spatial autocorrelation analysis, and ending up with spatial regression. In many respects, *GeoDa* is a reinvention of the original *SpaceStat* package (Anselin 1992), which by now has become quite dated, with only a rudimentary user interface, an antiquated architecture, and performance constraints for medium and large data sets. The software was redesigned and rewritten from scratch, around the central concept of dynamically linked graphics. This means that different “views” of the data are represented as graphs, maps, or tables with selected observations in one highlighted in all. In that respect, *GeoDa* is similar to a number of other modern spatial data analysis software tools, although it is quite distinct in its combination of user friendliness with an extensive range of incorporated methods. A few illustrative comparisons will help clarify its position in the current spatial analysis software landscape.

In terms of the range of spatial statistical techniques included, *GeoDa* is most alike to the collection of functions developed in the open-source R environment. For example, descriptive spatial autocorrelation measures, rate smoothing, and spatial regression are included in the *spdep* package, as described by Bivand and Gebhardt (2000), Bivand (2002a,b), and Bivand and Portnov (2004). In contrast to R, *GeoDa* is completely driven by a point and click interface and does not require any programming. It also has more extensive mapping capability (still somewhat experimental in R) and full linking and brushing in dynamic graphics, which is currently not possible in R due to limitations in its architecture. On the other hand, *GeoDa* is not (yet) customizable or extensible by the user, which is one of the strengths of the R environment. In that sense, the two are seen as highly complementary, ideally with more sophisticated users “graduating” to R after being introduced to the techniques in *GeoDa*.^{2}

The use of dynamic linking and brushing as a central organizing technique for data visualization has a strong tradition in exploratory data analysis (EDA), going back to the notion of linked scatter plot brushing (Stuetzle 1987) and various methods for dynamic graphics outlined in Cleveland and McGill (1988). In geographical analysis, the concept of “geographic brushing” was introduced by Monmonier (1989) and made operational in the *Spider/Regard* toolboxes of Haslett, Unwin, and associates (Haslett, Wills, and Unwin 1990; Unwin 1994). Several modern toolkits for exploratory spatial data analysis (ESDA) also incorporate dynamic linking, and, to a lesser extent, brushing. Some of these rely on interaction with a GIS for the map component, such as the linked frameworks combining XGobi or XploRe with ArcView (Cook et al. 1996, 1997; Symanzik et al. 2000); the SAGE toolbox, which uses ArcInfo (Wise, Haining, and Ma 2001); and the DynESDA extension for ArcView (Anselin 2000), *GeoDa*'s immediate predecessor. Linking in these implementations is constrained by the architecture of the GIS, which limits the linking process to a single map (in *GeoDa*, there is no limit on the number of linked maps). In this respect, *GeoDa* is similar to other freestanding modern implementations of ESDA, such as the cartographic data visualizer, or *cdv* (Dykes 1997), GeoVISTA Studio (Takatsuka and Gahegan 2002), and STARS (Rey and Janikas 2006). These all include functionality for dynamic linking, and to a lesser extent, brushing. They are built in open-source programming environments, such as Tkl/Tk (cdv), Java (GeoVISTA Studio), or Python (STARS) and thus easily extensible and customizable. In contrast, *GeoDa* is (still) a closed box, but of these packages it provides the most extensive and flexible form of dynamic linking and brushing for both graphs and maps.

Common spatial autocorrelation statistics, such as Moran's *I* and even the Local Moran, are increasingly part of spatial analysis software, ranging from *CrimeStat* (Levine 2006), to the *spdep* and *DCluster* packages available on the open-source comprehensive R archive network (CRAN),^{3} as well as commercial packages, such as the spatial statistics toolbox of the forthcoming release of ArcGIS 9.0 (ESRI 2004). However, at this point in time, none of these include the range and ease of construction of spatial weights or the capacity to carry out sensitivity analysis and visualization of these statistics contained in *GeoDa*. Apart from the R *spdep* package, *Geoda* is the only one to contain functionality for spatial regression modeling among the software mentioned here.

A prototype version of the software (known as *DynESDA*) has been in limited circulation since early 2001 (Anselin, Syabri, and Smirnov 2002a; Anselin et al. 2002b), but the first official release of a beta version of *GeoDa* occurred on February 5, 2002. The program is available for free and can be downloaded from the CSISS software tools Web site (http://geoda.uiuc.edu).The most recent version, 0.9.5-i, was released in January 2003. The software has been well received for both teaching and research use and has a rapidly growing body of users. For example, by the fall of 2005, there were more than 8000 registered users, increasing at a rate of about 200 new users per month.

In the remainder of the article, we first outline the design and briefly review the overall functionality of *GeoDa*. This is followed by a series of illustrative examples, highlighting features of the mapping and geovisualization capabilities, exploration in multivariate EDA, spatial autocorrelation analysis, and spatial regression. The article closes with some comments regarding future directions in the development of the software.

### Design and functionality

- Top of page
- Abstract
- Introduction
- Design and functionality
- Mapping and geovisualization
- Multivariate EDA
- Spatial autocorrelation analysis
- Spatial regression
- Future directions
- Acknowledgements
- References

The design of *GeoDa* consists of an interactive environment that combines maps with statistical graphs, using the technology of dynamically linked windows. It is geared to the analysis of *discrete* geospatial data, that is, objects characterized by their location in space either as points (point coordinates) or polygons (polygon boundary coordinates). The current version adheres to ESRI's shape file as the standard for storing spatial information. It contains functionality to read and write such files, as well as to convert ASCII text input files for point coordinates or boundary file coordinates to the shape file format. It uses ESRI's MapObjects LT2 technology for spatial data access, mapping, and querying. The analytical functionality is implemented in a modular fashion, as a collection of C++ classes with associated methods.

In broad terms, the functionality can be classified into six categories:

- •
*spatial data manipulation and utilities*: data input, output, and conversion,

- •
*data transformation*: variable transformations and creation of new variables,

- •
*mapping*: choropleth maps, cartogram and map animation,

- •
*EDA*: statistical graphics,

- •
*spatial autocorrelation*: global and local spatial autocorrelation statistics, with inference and visualization,

- •
*spatial regression*: diagnostics and maximum likelihood estimation of linear spatial regression models.

Table 1. *GeoDa* Functionality Overview Category | Functions |
---|

Spatial data | data input from shape file (point, polygon) |

data input from text (to point or polygon shape) |

data output to text (data or shape file) |

create grid polygon shape file from text input |

centroid computation |

Thiessen polygons |

Data transformation | variable transformation (log, exp, etc.) |

queries, dummy variables (regime variables) |

variable algebra (addition, multiplication, etc.) |

spatial lag variable construction |

rate calculation and rate smoothing |

data table join |

Mapping | generic quantile choropleth map |

standard deviational map |

percentile map |

outlier map (box map) |

circular cartogram |

map movie |

conditional maps |

smoothed rate map (EB, spatial smoother) |

excess rate map (standardized mortality rate, SMR) |

EDA | histogram |

box plot |

scatter plot |

parallel coordinate plot |

three-dimensional scatter plot |

conditional plot (histogram, box plot, scatter plot) |

Spatial autocorrelation | spatial weights creation (rook, queen, distance, k-nearest) |

higher order spatial weights |

spatial weights characteristics (connectedness histogram) |

Moran scatter plot with inference bivariate Moran scatter plot with inference |

Moran scatter plot for rates (EB standardization) |

Local Moran significance map |

Local Moran cluster map |

bivariate Local Moran |

Local Moran for rates (EB standardization) |

Spatial regression | OLS with diagnostics (e.g., LM test, Motan's I) |

Maximum likelihood spatial lag model |

Maximum likelihood spatial error model |

predicted value map |

residual map |

The software implementation consists of two important components: the user interface and graphics windows on the one hand and the computational engine on the other hand. In the current version, all graphic windows are based on Microsoft Foundation Classes (MFC) and thus are limited to MS Windows platforms.^{5} In contrast, the computational engine (including statistical operations, randomization, and spatial regression) is pure C++ code and largely cross platform.

The bulk of the graphical interface implements five basic classes of windows: histogram, box plot, scatter plot (including the Moran scatter plot), map, and grid (for the table selection and calculations). The choropleth maps, including the significance and cluster maps for the local indicators of spatial autocorrelation (LISA), are derived from MapObjects classes. Three additional types of maps were developed from scratch and do not use MapObjects: the map movie (map animation), the cartogram, and the conditional maps. The three-dimensional scatter plot is implemented with the OpenGL library.

The functionality of *GeoDa* is invoked either through menu items or directly by clicking toolbar buttons, as illustrated in Fig. 1. A number of specific applications are highlighted in the following sections, focusing on some distinctive features of the software.

### Mapping and geovisualization

- Top of page
- Abstract
- Introduction
- Design and functionality
- Mapping and geovisualization
- Multivariate EDA
- Spatial autocorrelation analysis
- Spatial regression
- Future directions
- Acknowledgements
- References

The bulk of the mapping and geovisualization functionality consists of a collection of specialized choropleth maps, focused on highlighting outliers in the data, so-called *box maps* (Anselin 1999). In addition, considerable capability is included to deal with the intrinsic variance instability of rates, in the form of empirical Bayes (EB) or spatial smoothers.^{6} As mentioned in “Design and functionality,” the mapping operations use the classes contained in ESRI's MapObjects, extended with the capability for linking and brushing. *GeoDa* also includes a circular cartogram,^{7} map animation in the form of a map movie, and conditional maps. The latter are nine micro-choropleth maps constructed by conditioning on three intervals for two conditioning variables, using the principles outlined in Becker, Cleveland, and Shyu (1996) and Carr et al. (2002).^{8} In contrast to the traditional choropleth maps, the cartogram, map movie, and conditional maps do not use MapObjects classes, and were developed from scratch.

We illustrate the rate smoothing procedure, outlier maps and linking operations. The objective in this analysis is to identify locations that have elevated mortality rates and to assess the sensitivity of the designation as outlier to the effect of rate smoothing. Using data on prostate cancer mortality in 156 counties contained in the Appalachian Cancer Network (ACN) for the period 1993–97, we construct a box map by specifying the number of deaths as the numerator and the population as the denominator.^{9} The resulting map for the crude rates (i.e., without any adjustments for differing age distributions or other relevant factors) is shown as the upper-left panel in Fig. 2. Three counties are identified as *outliers* and shown in dark red.^{10} These match the outliers *selected* in the box plot in the lower-left panel of the figure. The *linking* of all maps and graphs results in those counties also being cross-hatched on the maps.

The upper-right panel in the figure represents a smoothed rate map, where the rates were transformed by means of an Empirical Bayes procedure to remove the effect of the varying population at risk. As a result, the original outliers are no longer, but a different county is identified as having elevated risk. Also, a lower outlier is found as well, shown as dark blue in the box map.^{11} Note that the upper outlier is barely distinguishable, due to the small area of the county in question. This is a common problem when working with admininistrative units. In order to remove the potentially misleading effect of area on the perception of interesting patterns, a circular cartogram is shown in the lower-right panel of Fig. 2, where the area of the circles is proportional to the value of the EB smoothed rate. The upper outlier is shown as a red circle, the lower outlier as a blue circle. The yellow circles are the counties that were outliers in the crude rate map, highlighted here as a result of linking with the other maps and graphs.^{12}

### Multivariate EDA

- Top of page
- Abstract
- Introduction
- Design and functionality
- Mapping and geovisualization
- Multivariate EDA
- Spatial autocorrelation analysis
- Spatial regression
- Future directions
- Acknowledgements
- References

Multivariate exploratory data analysis is implemented in *GeoDa* through linking and brushing between a collection of statistical graphs. These include the usual histogram, box plot, and scatter plot, but also a parallel coordinate plot (PCP) and three-dimensional scatter plot, as well as conditional plots (conditional histogram, box plot, and scatter plot).

We illustrate some of this functionality with an exploration of the relationships between economic growth and initial development, typical of the recent “spatial” regional convergence literature (for an overview see Rey 2004). We use economic data over the period 1980–99 for 145 European regions, most of them at the NUTS II level of spatial aggregation, except for a few at the NUTS I level (for Luxembourg and the United Kingdom).^{13}

Fig. 3 illustrates the various linked plots and map. The left-hand panel contains a simple percentile map (GDP per capita in 1989), and a three-dimensional scatter plot (for the percent agricultural and manufacturing employment in 1989 as well as the GDP growth rate over the period 1980–99). In the top right-hand panel is a PCP for the growth rates in the two periods of interest (1980–89 and 1989–99) and the GDP per capita in the base year, the typical components of a convergence regression. In the bottom of the right-hand panel is a simple scatter plot of the growth rate in the full period (1980–99) on the base year GDP.

Both plots on the right-hand side illustrate the typical empirical phenomenon that higher GDP at the start of the period is associated with a lower growth rate. However, as demonstrated in the PCP (some of the lines suggest a positive relation between GDP and growth rate), the pattern is not uniform and there is a suggestion of heterogeneity. A further exploration of this heterogeneity can be carried out by brushing any one of these graphs. For example, in Fig. 3, a selection box in the three-dimensional scatter plot is moved around (*brushing*) which highlights the selected observations in the map (cross-hatched) and in the PCP, clearly showing opposite patterns in subsets of the selection. Furthermore, in the scatter plot, the slope of the regression line can be recalculated for a subset of the data without the selected locations, to assess the sensitivity of the slope to those observations. In the example shown here, the effect on convergence over the whole period is minimal (−0.147 versus −0.144), but other selections show a more pronounced effect. Further exploration of these patterns does suggest a degree of spatial heterogeneity in the convergence results (for a detailed investigation, see Le Gallo and Dall'erba 2003).

### Spatial autocorrelation analysis

- Top of page
- Abstract
- Introduction
- Design and functionality
- Mapping and geovisualization
- Multivariate EDA
- Spatial autocorrelation analysis
- Spatial regression
- Future directions
- Acknowledgements
- References

Spatial autocorrelation analysis includes tests and visualization of both global (test for *clustering*) and local (test for *clusters*) Moran's *I* statistic. The global test is visualized by means of a Moran scatter plot (Anselin 1996), in which the slope of the regression line corresponds to Moran's *I*. Significance is based on a permutation test. The traditional univariate Moran scatter plot has been extended to depict bivariate spatial autocorrelation as well, that is, the correlation between one variable at a location, and a different variable at the neighboring locations (Anselin, Syabri, and Smirnov 2002a). In addition, there also is an option to standardize rates for the potentially biasing effect of variance instability (see Assunção and Reis 1999).

Local analysis is based on the Local Moran statistic (Anselin 1995), visualized in the form of significance and cluster maps. It also includes several options for sensitivity analysis, such as changing the number of permutations (to as many as 9999), rerunning the permutations several times, and changing the significance cutoff value. This provides an ad hoc approach to assess the sensitivity of the results to problems due to multiple comparisons (i.e., how stable is the indication of clusters or outliers when the significance barrier is lowered).

The maps depict the locations with significant Local Moran statistics (LISA significance maps) and classify those locations by type of association (LISA cluster maps). Both types of maps are available for brushing and linking. In addition to these two maps, the standard output of a LISA analysis includes a Moran scatter plot and a box plot depicting the distribution of the local statistic. Similar to the Moran scatter plot, the LISA concept has also been extended to a bivariate setup and includes an option to standardize for variance instability of rates.

The functionality for spatial autocorrelation analysis is rounded out by a range of operations to construct spatial weights, using either boundary files (contiguity based) or point locations (distance based). A connectivity histogram helps in identifying potential problems with the neighbor structure, such as “islands” (locations without neighbors).

We illustrate spatial autocorrelation analysis with a study of the spatial distribution of 692 house sales prices for 1997 in Seattle, WA. This is part of a broader investigation into the effect of subsidized housing on the real estate market.^{14} For the purposes of this example, we only focus on the univariate spatial distribution, and the location of any significant clusters or spatial outliers in the data.

The original house sales data are for point locations, which, for the purposes of this analysis are converted to Thiessen polygons. This allows a definition of “neighbor” based on common boundaries between the Thiessen polygons. On the left-hand panel of Fig. 4, two LISA cluster maps are shown, depicting the locations of significant Local Moran's *I* statistics, classified by type of spatial association. The dark red and dark blue locations are indications of spatial *clusters* (respectively, high surrounded by high, and low surrounded by low).^{15} In contrast, the light red and light blue are indications of *spatial outliers* (respectively, high surrounded by low, and low surrounded by high). The bottom map uses the default significance of *P*=0.05, whereas the top map is based on *P*=0.01 (after carrying out 9999 permutations). The matching *significance map* is in the top right-hand panel of Fig. 4. Significance is indicated by darker shades of green, with the darkest corresponding to *P*=0.0001. Note how the tighter significance criterion eleminates some (but not that many) locations from the map. In the bottom right-hand panel of the figure, the corresponding Moran scatter plot is shown, with the most extreme “high–high” locations selected. These are shown as cross-hatched polygons in the maps, and almost all obtain highly significant (at *P*=0.0001) local Moran's *I* statistics.

The overall pattern depicts a cluster of high priced houses on the East side, with a cluster of low priced houses following an axis through the center. Put in context, this is not surprising as the East side represents houses with a lake view, while the center cluster follows a highway axis and generally corresponds with a lower income neighborhood. Interestingly, the pattern is not uniform, and several spatial outliers can be distinguished. Further investigation of these patterns would require a full hedonic regression analysis.

### Spatial regression

- Top of page
- Abstract
- Introduction
- Design and functionality
- Mapping and geovisualization
- Multivariate EDA
- Spatial autocorrelation analysis
- Spatial regression
- Future directions
- Acknowledgements
- References

As of version 0.9.5-i, *GeoDa* also includes a limited degree of spatial regression functionality. The basic diagnostics for spatial autocorrelation, heteroskedasticity and nonnormality, are implemented for the standard ordinary least-squares regression. Estimation of the spatial lag and spatial error models is supported by means of the maximum likelihood (ML) method (see Anselin and Bera 1998, for a review of the technical issues). In addition to the estimation itself, predicted values and residuals are calculated and made available for mapping.

The ML estimation in *GeoDa* distinguishes itself by the use of extremely efficient algorithms that allow the estimation of models for very large data sets. The standard eigenvalue simplification is used (Ord 1975) for data sets up to 1000 observations. Beyond that, the sparse algorithm of Smirnov and Anselin (2001) is used, which exploits the characteristic polynomial associated with the spatial weights matrix. This algorithm allows estimation of very large data sets in reasonable time. In addition, *GeoDa* implements the recent algorithm of Smirnov (2003) to compute the asymptotic variance matrix for all the model coefficients (i.e., including both the spatial and nonspatial coefficients). This involves the inversion of a matrix of the dimensions of the data sets. To date, *GeoDa* is the only software that provides such estimates for large data sets.

All estimation methods employ sparse spatial weights, but they are currently constrained to weights that are intrinsically symmetric (e.g., excluding *k*-nearest neighbor weights). The regression routines have been successfully applied to real data sets of more than 300,000 observations (with estimation and inference completed in a few minutes). By comparison, a spatial regression for the 3000+ U.S. counties takes a few seconds.

We illustrate the spatial regression capabilities with a partial replication and extension of the homicide model used in Baller et al. (2001) and Messner and Anselin (2004). These studies assessed the extent to which a classic regression specification, well-known in the criminology literature, is robust to the explicit consideration of spatial effects. The model relates county homicide rates to a number of socioeconomic explanatory variables. In the original study, a full ML analysis of all the U.S. continental counties was precluded by the constraints on the eigenvalue-based *SpaceStat* routines. Instead, attention focused on two subsets of the data containing 1412 counties in the U.S. South and 1673 counties in the non-South.

In Fig. 5, we show the result of the ML estimation of a spatial error model of county homicide rates for the complete set of 3085 continental U.S. counties in 1980. The explanatory variables are the same as before: a Southern dummy variable, a resource deprivation index, a population structure indicator, unemployment rate, divorce rate, and median age.^{16}

The results confirm a strong positive and significant spatial autoregressive coefficient . Relative to the OLS results (e.g., Messner and Anselin 2004, Table 7.1., p. 137), the coefficient for unemployment has become insignificant, illustrating the misleading effect spatial error autocorrelation may have on inference using OLS estimates. The model diagnostics also suggest a continued presence of problems with heteroskedasticity. However, *GeoDa* currently does not include functionality to deal with this.

### Future directions

- Top of page
- Abstract
- Introduction
- Design and functionality
- Mapping and geovisualization
- Multivariate EDA
- Spatial autocorrelation analysis
- Spatial regression
- Future directions
- Acknowledgements
- References

*GeoDa* is a work in progress and still under active development. This development proceeds along three fronts. First and foremost is an effort to make the code cross-platform and open source. This requires considerable change in the graphical interface, moving from the MFC that are standard in the various MS Windows flavors, to a cross-platform alternative. The current efforts use wxWidgets,^{17} which operates on the same code base with a native GUI flavor in Windows, MacOSX and Linux/Unix. Making the code open source is currently precluded by the reliance on proprietary code in ESRI's MapObjects. Moreover, this involves more than simply making the source code available, but entails considerable reorganization and streamlining of code (refactoring), to make it possible for the community to effectively participate in the development process.

A second strand of development concerns the spatial regression functionality. While currently still fairly rudimentary, the inclusion of estimators other than ML and the extension to models for spatial panel data are in progress. Finally, the functionality for ESDA itself is being extended to data models other than the discrete locations in the “lattice” case. Specifically, exploratory variography is being added, as well as the exploration of patterns in flow data.

Given its initial rate of adoption, there is a strong indication that *GeoDa* is indeed providing the “introduction to spatial data analysis” that makes it possible for growing numbers of social scientists to be exposed to an explicit spatial perspective. Future development of the software should enhance this capability and it is hoped that the move to an open source environment will involve an international community of like-minded developers in this venture.