A small sample illustration
If the data-generating process is the SAR model, then , and least-squares estimates for β are biased and inconsistent (see LeSage and Pace 2004). In these cases, we would expect that least-squares MC3 procedures would not produce accurate estimates and inferences regarding which variables are important. If the data-generating process is the SEM model, then , where . Least-squares estimates are unbiased but inefficient (analogous to the case of serial correlation in the disturbances), and so we might expect better results for least-squares MC3 procedures.
To illustrate differences between least-squares and spatial results, we used a 49-location data set from Anselin (1988) containing Columbus, Ohio neighborhoods, along with observations on the median housing values (hvalue) for each neighborhood and household income (hincome). These variable values were used as the basis for a data-generated experiment where the true parameter values as well as the true model specification will be known. The spatial weight matrix was based on the four nearest neighbors to each of the spatial observations, and the matrix was row-normalized to have row sums of unity. Two additional explanatory variables were constructed using spatial lags of the house values and household income. These variables represent an average of housing values and household income from the nearest four neighborhoods, constructed through multiplication of these variable vectors by the row-normalized spatial weight matrix W. Intuitively, housing values and household income levels in nearby neighborhoods might contribute to explaining variation in the y variable, which was neighborhood crime rates in the Anselin (1988) application. This produces a model shown in the following equation:
The intercept term and spatial lag are included in all models, and so the number of possible candidate variables is 4, leading to 24=16 possible models. This makes it simple to validate our MC3 algorithms by comparison with exact results based on posterior model probabilities for the set of 16 models. The explanatory variables were put in studentized form to accommodate the Zellner g-prior used by the MC3 procedures, which relies on a prior mean of zero for the β coefficients.
The true values for α, βi, i=1, …, 4 were set to values of and . A standard normal random deviate was use for ɛ in the generating process. The value of ρ was set to 0.6, indicating moderate spatial dependence.
Least-squares estimates as well as maximum likelihood SAR estimates are shown in Table 1. The bias associated with the least-squares estimates appears to be substantial because the two sets of estimates differ widely in magnitude, and the parameter ρ is large and significantly different from zero. The greatest disagreement in the two sets of estimates is with respect to the two spatially lagged explanatory variables, which might be a focus of model comparison and inference. Intuitively, it might be of interest whether housing values and household income levels in nearby neighborhoods contribute to explaining variation in say neighborhood crime rates. We note that the sign of the spatially lagged house value variable is different in the least-squares and SAR regression, and the significance of the spatial lag of household income is different.
Table 1. OLS and SAR Model Estimates
|ρ|| || || || ||0.637||5.86||0.000|
As the number of possible models here is 24=16, it would have been possible to simply calculate the log-marginal posterior for these 16 models to find posterior model probabilities. Instead, we applied our MC3 algorithm to least-squares and SARs. A run of 10,000 draws was sufficient to uncover all 16 unique models, requiring 15 and 27 s, respectively, for the OLS and SAR MC3 procedures.2
Information regarding models that exhibited posterior model probabilities >1% is given in Table 2 for the OLS MC3 procedure and Table 3 for the SAR MC3 procedure, with the posterior model probabilities shown in the last row of the two tables. These tables use “1” and “0” indicators for the presence or absence of variables in each of the models presented in the tables.
Table 2. OLS BMA Model Selection Information
Table 3. SAR BMA Model Selection Information
From the tables, we see that in the case of OLS, the true model was not among the five models with posterior model probabilities >1%. (It was assigned a model probability of 0.1%.) In contrast, the SAR MC3 procedure identified the true model and assigned it a posterior model probability >50%. This result should not be surprising given that the least-squares estimator is biased and inconsistent in the presence of spatial dependence. The SAR procedure resulted in the two variables, income and housing values used to generate the sample y vector, appearing in the four highest probability models that together account for nearly 94% of the posterior probability mass. In contrast, two of the top five least-squares models did not contain both of these variables, and these two models accounted for nearly 20% of the posterior probability mass.
We conclude this discussion by noting that the small sample example based on only 4 candidate variables suggests that the use of least-squares-based MC3 procedures in the presence of spatial dependence will exert an impact on the model selection inferences. Presumably, this impact is an adverse one as least-squares estimates are either biased and inconsistent for the case of the spatial lag model (SAR), or inefficient in the case of the SEM.
A large sample illustration
An example that models variation in state-to-state migration patterns using origin to destination flows provides a large sample of 2401 observations. Starting with an n by n square matrix of interregional flows from each of the n=49 origin states (including the District of Columbia), to each of the n destination states, we can produce an n2 by one vector of these flows by stacking the columns of the flow matrix into a variable vector that we designate as y. Without loss of generality, we assume that each of the n columns of the flow matrix represents a different origin and the n rows reflect destinations. The first n elements in the stacked vector y reflect flows from origin 1 to all n destinations. The last n rows of this vector represent flows from origin n to destinations 1 to n. Typically, the diagonal elements of a flow matrix containing flows within a region, for example, from origin 1 to destination 1, origin 2 to destination 2, and so on, will be large relative to the off-diagonal elements representing interregional flows. We set these elements to values of zero to focus our efforts on explaining variation in flows between states. In addition, the vector y was constructed to represent the difference between (logged) state-to-state migration flows during the 1995–2000 period and the flows during the 1985–1990 period. This transformation converts the dependent variable into growth rates of flows over the 10-year period from 1990 to 2000, which produced a relatively normally distributed vector of 492=2,401 observations.
A conventional gravity or spatial interaction model based on a nonspatial regression might be used to explain variation in the vector y of origin–destination flows (see Fischer, Scherngell, and Jansenberger 2006; Sen and Smith 1995; Tiefelsdorf 2003). Here, we contrast the nonspatial regression to a spatial regression model of the type suggested by LeSage and Pace (2005). These regression models rely on an n by k matrix of explanatory variables that we label xd, containing k characteristics for each of the n destinations. Given the format of our vector y, where observations 1 to n reflect flows from origin 1 to all n destinations, this matrix would be repeated n times to produce an n2 by k matrix of destination characteristics that we represent as Xd for use in the regression. A second matrix containing origin characteristics that we label Xo would be constructed for use in the gravity model. This matrix would repeat the characteristics of the first origin n times to form the first n rows of Xo, the characteristics of the second origin n times for the next n rows of Xo, and so on, resulting in an n2 by k matrix of origin characteristics. Typically, the distance from each origin to destination is also included as an explanatory variable vector in the gravity model, and perhaps nonlinear terms such as distance-squared. We let D represent an n2 by 1 vector of these distances from each origin to each destination formed by stacking the columns of the origin–destination distance matrix into a variable vector.3
In contrast to the traditional nonspatial gravity model, LeSage and Pace (2005) note that a spatial econometric model of the variation in origin–destination flows would be characterized by: (1) reliance on spatial lags of the dependent variable vector, resulting in a SAR or (2) of the disturbance terms, producing a SEM. Spatial weight matrices represent a convenient and parsimonious way to define the spatial dependence or connectivity relations among observations.
For our applied illustration we use ideas from LeSage and Pace (2005) and rely on a spatial weight matrix consisting of , where W is an n by n spatial weight matrix based on contiguity of the n regions. The n2 by n2 matrix is normalized to have row sums of unity and captures spatial dependence among the origin–destination flows by creating a spatial lag vector that averages over neighbors to both origin and destination regions. This leads to a SAR model:
In (16), the explanatory variable matrices Xd, Xo represent n2 by k matrices containing destination and origin characteristics, respectively, and the associated k by 1 parameter vectors are βd and βo. The vectors D and D2 denote the vectorized origin–destination distance matrix and its square with γ1,γ2 scalar parameters. As is conventional in SARs, we assume .
As candidate explanatory variables, a series of 19 destination and origin specific variables plus distance and distance-squared were used, for a total of 40 explanatory variables. These variables were transformed using logs, so all coefficients should be interpretable as reflecting the elasticity response of the growth rate in state-to-state migration over the period 1990–2000 to changes in each variable. This scaling should help accommodate the prior mean of zero employed in the Zellner g-prior. The distance and distance-squared vectors were also logged. A description of the candidate explanatory variables along with the variable names used to report estimation results is in Table 4.
Table 4. Socioeconomic Demographic Variables
|Area||The origin–destination state area in square miles|
|Males||The number of males in 1990|
|Females||The number of females in 1990|
|Pcincome||Per capita income in 1990|
|Young||The # of persons aged 22–29 in 1990|
|Near retirement||The # of persons aged 60–64 in 1990|
|Retired||The # of persons aged 65–74 in 1990|
|Born in state||The # of persons born in the state in 1990|
|Foreign born||The # of foreign born persons in the state in 1990|
|Recent immigrants||The # of recent foreign immigrants, during the years 1980–1990|
|College grads||The # of persons over age 25 with college degrees in 1990|
|Grad/Profession||The # of persons over age 25 with graduate/professional degrees in 1990|
|House value||The median house value in 1990|
|Travel time||Mean travel time to work in 1990|
|Unemployment||The unemployment rate in 1990|
|Labor force||The # of persons over 16 years employed in 1990|
|Median rent||The median rent in 1990|
|Retirement income||Median retirement income in 1990|
|Self-employed||The # of persons self-employed in 1990|
|Distance||The origin–destination state distance|
|Distance2||The distance variable squared|
Runs involving 250,000 draws were used to test for convergence in both OLS and SAR MC3 results. Two sets of MCMC draws were carried out using randomly selected starting variables in the explanatory variables matrix, resulting in model averaged estimates that were identical to three decimal places for all parameters and in most cases identical to four decimal places.
For the case of least-squares MC3, the sampling run of 250,000 draws produced 49,246 unique models. The 10 models with the highest posterior model probabilities are shown in Table 5, which takes the same format as Table 2. Variable names are preceded with a symbol “D-“ or “O-” to indicate destination and origin characteristics. The top 10 models accounted for only 44.30% of the probability mass, with 95 models having posterior model probabilities >0.1%, accounting for 83.02% probability mass, 473 models exhibiting model probabilities >0.01%, totaling 95.51% probability mass, 1539 models with probabilities >0.001% totaling 98.98 probability mass, and 4016 models with probabilities > 0.0001%, accounting for 99.83% of the probability mass.
Table 5. Variables Entering the Top 10 Least-Squares Models
|D-Born in state||1||1||1||1||1||1||1||1||1||1|
|O-Born in state||0||0||0||0||0||0||0||0||0||0|
To illustrate convergence of the MC3 sampling process, we note that results from a second run of 250,000 draws based on a random selection of starting variables produced only 47,699 unique models, but the top 10 model probabilities were identical to those reported in Table 5. In addition, there were 96 models having posterior model probabilities >0.1%, accounting for 83.16% probability mass, 473 models exhibiting model probabilities >0.01%, accounting for 95.50% probability mass, 1550 models with probability >0.001%, accounting for 98.97 total probability mass, and 4039 models with probability >0.0001%, accounting for 99.83% of the probability mass. This suggests that the BMA procedure is finding regions of the large model space with posterior support and ignoring regions with low support.
Table 5 shows the variables appearing in the 10 highest posterior probability models. Variables that appear in each model are designated with a “1” and those that do not appear with a “0.” We find that 13 of the 38 origin–destination variables appear in all of the 10 highest probability models, and these variables are associated with posterior probabilities of inclusion >65% (probabilities of inclusion are shown in Table 7).4 One variable “D-Near retirement” appeared in nine of the 10 highest probability models. After this, we see a decline in variables appearing in only six of the 10 models reported in the table, as well as a decline in the probability of inclusion to below 50%.
Table 7. Posterior Probabilities for Variables Entering the Model
|D-Born in state||0.9406||0.0231||0.9175|
|O-Born in state||0.1198||0.0221||0.0977|
A focus of inference for gravity models would be the relative importance of origin versus destination characteristics in explaining variation in state-to-state population migration. Table 5 shows that five of the 13 variables that appear in all of the 10 highest probability models are destination characteristics as well as the one variable that appeared in nine of the 10 top models. This leaves eight of the 13 variables appearing in all 10 top models as origin characteristics, suggesting a slight edge for the case of “origin push” as opposed to “destination pull.” It will be of interest to contrast this result with those from the SAR.
For the SAR MC3 procedure, 250,000 draws produced only 5220 unique models, a substantially lower number of models than in the case of least squares. The top 10 models are reported in Table 6, accounting for 76.40% of the total probability mass. Also, in contrast with the least-squares results, we found that 37 models with posterior probabilities >0.1% accounted for 96.11% of the probability mass and 120 with probabilities >0.01% accounted for 99.01% of the total probability mass. In summary, the set of high posterior probability models resulting from the spatial model were much more compact than those from least squares. As in the case of least squares, a second run of 250,000 draws produced nearly identical results.
Table 6. Variables Entering the Top 10 Spatial Autoregressive Models
|D-Born in state||0||0||1||0||0||1||0||1||0||0|
|O-Born in state||0||0||0||0||0||0||0||0||0||0|
The SAR model MC3 results are presented in Table 6 in the same format as the least-squares results. In contrast with the least-squares results, only six variables enter all 10 top models, and these are all origin characteristics. The probability for variable inclusion (shown in Table 7) for these six variables was 0.87 or higher. One destination characteristic entered 8 of the 10 top models, “D-Grad/Profession,” but exhibited a probability of inclusion <50%. The conclusion we would draw here regarding the importance of origin versus destination characteristics is quite different from that reported earlier for least squares. It is interesting that the six origin characteristics that appear in all 10 top SARs also appear in all 10 top least squares models. The spatial autoregressive results appear to exclude a great number of variables from appearing in the model relative to least-squares. There is a theoretical motivation for this type of result. Ignoring the intercept term, we note that least-squares estimates are likely to be biased upward in the face of positive spatial dependence (ρ>0) when the matrix X contains logs of positive values, as shown in the following equation:
Posterior probabilities for each of the 38 origin–destination and distance variables entering the model are shown in Table 7 for both the least-squares and SAR MC3 procedures. The last column in the table shows the difference between the OLS and SAR model results. We see 6 of 40 cases where the differences are >50%, and another 8 of 40 cases where these differences are between 0.29 and 0.48, pointing to a number of cases where the inclusion probabilities from the least-squares procedure are higher than the spatial model. In contrast, we see seven negative differences, with one equal to −0.4189, and the remainder <−0.10, reflecting a small number of cases with higher variable inclusion rates for the SAR. The average number of variables appearing in the top 10 least-squares models was 17.8 and 9.7 for the SARs.
Estimates for the variables would allow an examination of the elasticity response and direction of impact on population migration associated with the variables that exhibit high probabilities of inclusion. Table 8 reports model-averaged SAR estimates based on averaging over the 120 models with posterior probabilities >0.01%, and OLS estimates based on averaging over 473 models exhibiting model probabilities >0.01%.5 Bayesian MCMC estimates for the OLS and SAR model implemented the Zellner's g-prior, diffuse intercept and noise variance priors, and the Beta (1.01, 1.01) prior for ρ were used to produce 2000 retained draws. These draws were weighted by the posterior model probabilities and used to construct a posterior mean as well as 5% and 95% highest posterior density intervals (HPDI) that are reported in Table 8.
Table 8. Model Averaged Estimates
|D-Born in state||−0.0095||−0.0075||−0.0054||−0.3748||−0.3611||−0.3475|
|O-Born in state||0.0000||0.0000||0.0000||0.0000||0.0000||0.0000|
For interpretation purposes, the mean growth rate in interstate migration over the 10-year period was 0.0465, or less than one-half percent per year, with the 5 percentile being −0.4731 or around negative 5% per year and the 95 percentile being 0.5868, or about 6% per year. An estimate for β of 0.10 suggests that a 10% change in the explanatory variable would give rise to a 1% change in the growth rate over the 10-year period, and we focus on coefficient estimates that are >0.10 in absolute value when analyzing the model-averaged estimates.
For the case of the SAR model estimates, there were five origin characteristics that exerted impacts greater than 0.10 (in absolute value) on migration flows, O-Unemployment, O-Retired, O-Near retirement, O-House value, and O-Recent immigrants. Of these, the unemployment rate, retired persons, and house value were positive, while persons near retirement and recent immigrants were negative. The origin unemployment rate had the largest impact on out-migration flows, with the estimate suggesting that a state with a 1% higher unemployment rate would exhibit a 2% higher growth rate in out-migration over the 10-year period. Retired persons and those aged near retirement had the second largest impact, both exhibiting elasticities around unity, but with opposite signs. Intuitively, these signs seem correct. Higher house values in the origin state lead to out-migration flows and a higher number of recent immigrants from abroad lead to lower out-migration, which seem intuitively plausible. There were only three destination characteristics that exerted an impact >0.10 on migration growth rates, D-Unemployment, D-Graduate/Professional, and D-Travel time, and all were <0.5639 in magnitude. The travel time and unemployment rates had a positive impact on in-migration flows to the destination, which seems implausible, whereas persons with graduate/professional degrees had a negative impact, a plausible result.
Taken together, the SAR model-averaged estimates suggest that origin characteristics are relatively more important than destination characteristics, leading to a “push” interpretation of migration flows.
The OLS model-averaged estimates were different in that they pointed to nine destination characteristics having coefficients >0.10 (in absolute) and seven origin characteristics meeting this criterion. As in the case of the SAR model-averaged estimates, the O-Unemployment had the largest impact with a coefficient of 2.4341. As indicated earlier, this is likely to exhibit an upward bias relative to the SAR model estimate. Similar upward biases can be found in OLS estimates for O-Near retirement and O-Retired and O-House value. The second largest impact was D-Females which taken together with the large negative impact for D-Males, suggests a destination-state population size effect, as males plus females equal population. Interestingly, in the SAR that contains a spatial lag variable, this population size effect is not noticeable.
In total, the OLS-model averaged estimates suggest a different inference regarding the relative importance of origin versus destination characteristics on migration flows than that described above for the SAR model. We note that the model-averaged posterior mean estimate for the parameter ρ was 0.6077, with a 5% and 95% HPDI of 0.5900 and 0.6252, respectively, suggesting a bias in the least-squares results.