Our paper “Inequality of Opportunity in Brazil” (Review of Income and Wealth, 53(4), 585–618, 2007) contains a non-trivial error.1 In that paper, we proposed a measure of inequality of opportunity as the share of earnings (w) inequality explained by predetermined, morally irrelevant circumstances (C). The main results of the paper were obtained from the OLS estimation of a reduced-form model given by:
We denoted a counterfactual earnings distribution where all differences in circumstances were eliminated as , with .2 If the actual earnings distribution is given by Φ(w), we proposed to measure inequality of opportunity in that distribution by the ratio , where I denotes some well-behaved inequality measure, such as the Theil index. This is an indirect approach: captures the inequality that remains when all inequality of opportunity (i.e., between people with different circumstances) is eliminated. So , or the ratio of that difference to the total, are measures of inequality of opportunity.
Because equation (10) was the reduced form of a model containing effort as well as circumstance variables, this measure of inequality of opportunity should reflect both the direct effects of circumstances on earnings, and the indirect effects operating through efforts (E). To distinguish between those two categories of effects, we also estimated:
We recognized that the existence of omitted circumstance variables would bias the OLS estimates of ψ, and that omitted circumstance and effort variables would bias the estimates of α and β. We argued that suitable instruments were not available and proposed instead to investigate the likely magnitude of potential biases, by estimating upper and lower bounds both for the true coefficients and for the measures of inequality of opportunity, which were the main object of interest.
Focusing on equation (10) in the original paper, if the error term ε is not orthogonal to C (but the two are jointly normally distributed), then the estimated vector of coefficients is biased, and the bias can be written as:
where ΣX denotes the theoretical variance–covariance matrix of a random vector X, σx denotes the standard deviation of a variable x, and ρxy denotes the theoretical correlation coefficient between two variables x and y or the vector of correlation coefficients between a vector x and a variable y. Because these theoretical population parameters are unknown, our proposed solution was to evaluate the approximate size of the bias by:
To compute this sample-based approximation, we calculated: , where is the variance of the OLS residual of the regressions above and . denotes drawings from a uniform distribution defined on (−1, 1), with any values such that K ≥ 1 being rejected. Finally, we also imposed a set of additional constraints on the signs of coefficient estimates (empirically backed by the literature). Please see the original paper for a more detailed description of the method. Using this approach, we reported bounds around both the regression coefficients and the measures of inequality of opportunity which (we hoped) were sufficiently narrow as to be informative.
Unfortunately, our calculation of the range of possible values for the biases in both equations (10) and (5′) contained a mistake. When empirically estimating , a programming misspelling we made in Stata led us to use the standard error of the linear prediction (command option “stdp”), instead the standard error of the residual (command option “stdr”). This programming error led us to underestimate the value of by a factor ranging from 37 to 92 (depending on the cohort considered).
When the error is corrected and the biases are recomputed, the bounds around the OLS estimates of the regression coefficients become much wider. The small set of conditions we had previously imposed on coefficients now proves insufficient to obtain informative bounds. An alternative approach, which illustrates how the “confidence intervals” widen as we move away from OLS assumptions, is to draw the correlation coefficients for all circumstance variables from uniform distributions defined sequentially on broader supports: (−0.05, 0.05), (−0.1, 0.1), (−0.15, 0.15), and (−0.2, 0.2). Note, however, that these supports are all much narrower than the widest possible range used earlier: (−1, 1). Results from this approach are presented in Table 1 for selected regression coefficients (those on mean parental schooling), and in Table 2 for , our measure of counterfactual inequality when all inequality due to circumstances is eliminated.
|Mean parental schooling (years)||b1936_40||b1941_45||b1946_50||b1951_55||b1956_60||b1961_65||b1966_70|
|Upper bound estimates|
|−0.2 ≤ rho (Xi, u) ≤ 0.2||0.265||0.195||0.198||0.185||0.163||0.149||0.136|
|−0.15 ≤ rho (Xi, u) ≤ 0.15||0.242||0.186||0.188||0.170||0.151||0.140||0.126|
|−0.1 ≤ rho (Xi, u) ≤ 0.1||0.218||0.174||0.174||0.154||0.137||0.127||0.113|
|−0.05 ≤ rho (Xi, u) ≤ 0.05||0.195||0.157||0.157||0.138||0.123||0.114||0.102|
|Lower bound estimates|
|−0.05 ≤ rho (Xi, u) ≤ 0.05||0.135||0.109||0.108||0.093||0.082||0.077||0.067|
|−0.1 ≤ rho (Xi, u) ≤ 0.1||0.097||0.077||0.075||0.062||0.054||0.052||0.043|
|−0.15 ≤ rho (Xi, u) ≤ 0.15||0.057||0.043||0.039||0.028||0.025||0.025||0.018|
|−0.2 ≤ rho (Xi, u) ≤ 0.2||0.012||0.005||−0.001||−0.009||−0.009||−0.006||−0.011|
|Total Observed Inequality||0.873||0.997||0.759||0.655||0.706||0.580||0.566|
|Counterfactual inequality when circumstances are equalized Upper bound estimates|
|−0.2 ≤ rho (Xi, u) ≤ 0.2||0.754||0.778||0.710||0.602||0.658||0.592||0.592|
|−0.15 ≤ rho (Xi, u) ≤ 0.15||0.717||0.734||0.672||0.561||0.619||0.442||0.553|
|−0.1 ≤ rho (Xi, u) ≤ 0.1||0.688||0.698||0.645||0.537||0.595||0.421||0.526|
|−0.05 ≤ rho (Xi, u) ≤ 0.05||0.667||0.673||0.627||0.525||0.575||0.410||0.507|
|Lower bound estimates|
|−0.05 ≤ rho (Xi, u) ≤ 0.05||0.647||0.641||0.606||0.518||0.558||0.403||0.489|
|−0.1 ≤ rho (Xi, u) ≤ 0.1||0.638||0.633||0.602||0.521||0.558||0.405||0.486|
|−0.15 ≤ rho (Xi, u) ≤ 0.15||0.638||0.629||0.601||0.526||0.561||0.412||0.485|
|−0.2 ≤ rho (Xi, u) ≤ 0.2||0.645||0.629||0.620||0.533||0.567||0.428||0.487|
Two implications arise from this exercise. First, once our coding error is corrected, the bounds approach employed in our original paper no longer appears useful for identifying a narrow range of possible values for the biases plaguing OLS regression coefficients. There is no rationale for restricting the possible correlation between explanatory variables and a regression residual ex-ante to a narrow interval such as (−0.2, 0.2). The true value of ρCε ∈ (−1,1) is, of course, unknown. When we allow for the full possible range of values for that correlation coefficient, our use of sample moments to calculate approximate bounds on the value of the bias of OLS coefficients turns out to yield intervals that are too large to be of any practical use.
Second, the effect of correcting the error on the bounds around the estimates of counterfactual inequality—and in particular on the lower-bound estimate—is much less pronounced. In fact, as shown in Table 2, the lower-bound on the Theil coefficient of inequality when differences in circumstances are eliminated is quite robust to changes in the assumed correlation coefficients between circumstance variables and the regression residual.