### Abstract

- Top of page
- Abstract
- 1. INTRODUCTION
- 2. MODEL AND MOTIVATION FOR ESTIMATORS
- 3. ASSUMPTIONS, IDENTIFICATION, AND DEFINITIONS
- 4. ASYMPTOTIC RESULTS
- 5. SIMULATION EVIDENCE
- 6. EMPIRICAL EXAMPLE
- 7. CONCLUSIONS
- ACKNOWLEDGMENTS
- REFERENCES
- Supporting Information

This paper formulates a likelihood-based estimator for a double-index, semiparametric binary response equation. A novel feature of this estimator is that it is based on density estimation under local smoothing. While the proofs differ from those based on alternative density estimators, the finite sample performance of the estimator is significantly improved. As binary responses often appear as endogenous regressors in continuous outcome equations, we also develop an optimal instrumental variables estimator in this context. For this purpose, we specialize the double-index model for binary response to one with heteroscedasticity that depends on an index different from that underlying the ‘mean response’. We show that such (multiplicative) heteroscedasticity, whose form is not parametrically specified, effectively induces exclusion restrictions on the outcomes equation. The estimator developed exploits such identifying information. We provide simulation evidence on the favorable performance of the estimators and illustrate their use through an empirical application on the determinants, and affect, of attendance at a government-financed school. Copyright © 2009 John Wiley & Sons, Ltd.

### 1. INTRODUCTION

- Top of page
- Abstract
- 1. INTRODUCTION
- 2. MODEL AND MOTIVATION FOR ESTIMATORS
- 3. ASSUMPTIONS, IDENTIFICATION, AND DEFINITIONS
- 4. ASYMPTOTIC RESULTS
- 5. SIMULATION EVIDENCE
- 6. EMPIRICAL EXAMPLE
- 7. CONCLUSIONS
- ACKNOWLEDGMENTS
- REFERENCES
- Supporting Information

The last 30 years have witnessed the introduction of several estimators for the semiparametric binary response model under minimal distributional assumptions on the disturbance terms (see, for example, Manski, 1975, 1985; Horowitz, 1992; Powell *et al.*, 1989; Ichimura, 1993; Klein and Spady, 1993). Much of the focus on relaxing distributional assumptions in the binary response model was motivated by the fact that maximum likelihood estimation of discrete choice models would generally lead to inconsistent estimates if the underlying distribution was incorrectly chosen.

In addition to the ‘shape’ of the error distribution, it may also be misspecified in the manner in which it depends on the explanatory variables. For example, if the error exhibits multiplicative heteroscedasticity that is not a function of the ‘mean’ response, then only the above-mentioned estimators of Manski and Horowitz are consistent. However, these estimators will not recover binary response probabilities or marginal effects. By estimating binary quantile models, Kordas (2006) obtains interval estimates of the probabilities under general conditions. Kahn obtains marginal effects for a more general model than that considered here, but the estimator may be subject to the ‘curse of dimensionality’ when the model contains many explanatory variables. One of the main objectives of the present paper is to obtain probabilities and associated marginal effects that are reasonably estimated when the dimension of the explanatory variables may be large. Accordingly, we model a binary response probability as depending on two indices, where the distribution of the error may depend on the explanatory variables through one or both of the indices. This specification allows for, but is not restricted to, multiplicative heteroscedasticity that depends on one or both indices.

To estimate the binary response model described above, we extend the estimator in Klein and Spady (1993). The estimator in Klein and Spady depends on a single-index assumption, which in the present context would imply that it can handle heteroscedasticity only if the ‘error’ distribution depends on the same index that determines the ‘mean response’. Here we allow a double index formulation in which the index underlying the ‘mean response’ may differ from that upon which heteroscedasticity depends. Such an index formulation is particularly important in view of a result due to Chen and Khan (2003). They consider a binary response model where the heteroscedasticity depends on an unknown function of the explanatory variables and does not have an index structure. In this case, they show there does not exist a -consistent estimator for the model's parameters. Here, we will obtain a -consistent estimator under an index specification As an extension of Klein and Spady (1993), we conjecture that when the error in the binary response model is independent of the explanatory variables, the resulting estimator is efficient in a general class of models that satisfy a double-index restriction.1

It should be emphasized that the estimator developed here depends on density estimators obtained under estimated local smoothing, where underlying density estimators are based on windows that vary for each observation in the sample. This is analogous to characterizing a distribution with a histogram in which the bin interval is allowed to vary depending on whether one is in the tails of the unknown density (where observations are sparse) or in regions where the true density is ‘high’. With such local smoothing, the proofs for the asymptotic properties of the estimator formulated here substantially differ from those in the literature that employ bias-reducing kernels. We pursue this strategy first because density estimators under local smoothing have mean-squared-error optimal properties (Abramson, 1982). Second, and most importantly, in the present context we have found that the finite sample performance of the estimator for the binary response model is much improved under local smoothing in contrast to bias-reducing kernels. We also found further improvements in the finite sample performance of the estimators by employing dependent kernels that depend on an estimated sample covariance matrix as advocated by Fukunaga (1972). Accordingly, all proofs in this paper are for estimation under local smoothing and dependent bivariate kernels.

In adopting the above smoothing strategy, we have found it necessary to employ a property of the derivative of semiparametric probability function due to Whitney Newey. Namely, when this derivative is taken with respect to index parameters and then evaluated at the true parameter values, it coincides with the corresponding parametric derivative minus its conditional expectation (conditioned on the indices). This ‘residual-type’ property of this derivative function is important below in controlling the bias in gradient terms in the asymptotic normality argument. As is typical for many semiparametric estimators, we will need to downweight (trim) observations where density denominators become ‘too small’. To exploit the residual property of the semiparametric derivative, we will employ a trimming strategy that depends on estimated indices as opposed to the explanatory variables.

The estimator developed here for the binary response model is also related to those of Ichimura and Lee (1991) and Lee (1995), who examine alternative multiple-index models. While the present paper makes use of several key identification results of the Ichimura and Lee paper, it differs from both in several important respects. First, and most important, we have formulated the estimator and all proofs for the case of estimated local smoothing rather than bias-reducing kernels. Second, we make use of identification results in Ichimura and Lee without imposing exclusion restrictions on the indices. We emphasize that we are not concerned here with recovering the original parameters in the binary response model (which even in the presence of exclusion restrictions are still only obtained up to location and scale). Rather, we are interested in estimating those identifiable functions of the parameters that suffice to identify the semiparametric probability function. It can be argued that with binary response models one is generally not concerned with the parameters themselves but rather with the response probability and marginal effects. Such marginal effects, which examine how the probability function changes as the explanatory variables change, are identified once the probability function is identified. Moreover, while the entire probability function converges pointwise and uniformly to the true function at a rate below the parametric rate of , averaged marginal effects converge at the parametric rate. The original parameter values of the model are not required for such identification. In part, for this reason we focus on identifying the probability function itself rather than index parameters.

While one of our primary objectives is to provide an estimator for this double-index binary choice model,2 we note that applied researchers have become increasingly interested in larger systems in which the choice appears in another equation as an endogenous regressor. This type of model, frequently referred to as an endogenous binary treatment model, is at best poorly identified without an exclusion restriction. The well-known problem here is that the treatment probability, which would serve as an instrument for estimating the continuous outcomes equation, is often approximately linear in its argument. In the absence of an exclusion restriction on the continuous outcome equation, the instrument is then very close to being linearly related to the same exogenous variables in the continuous equation of interest. To resolve this problem here, we consider the case of multiplicative heteroscedasticity in the binary response equation, which is some function of the explanatory variables *X*. Write this function as *S*(*X*). In the next section we show that such heteroscedasticity may be viewed as inducing exclusion restrictions on the continuous outcomes equation. With no parametric assumptions on *S*(*X*) (other than that it depends on one or two indices) and with no parametric assumptions on the distribution of the error term in the binary response model, below we will develop an estimator that exploits such identifying information. We will then show that such information is useful both in theory and in practice (as indicated in a series of Monte Carlo experiments and in an empirical application).

For continuous simultaneous equations models, other authors have exploited heteroscedasticity as an identification strategy. For example, in a semiparametric formulation, Klein and Vella exploit such information to identify and estimate triangular simultaneous equations models without exclusion restrictions. In parametric formulations, Rummery *et al.* (1999) and Rigobon (2003) also exploit heteroscedasticity as an identification strategy for simultaneous equations. From the structure of the problem considered here, there is information in higher-order powers of the *X*'s that could be exploited to construct instruments for the outcomes equation. Dagenais and Dagenais (1997) and Lewbel (1997) exploit such information in models with measurement error. In this paper, since the nature of the heteroscedastic function in the treatment equation is unknown, it is unclear which higher orders of the *X's* should be used as instruments. Consequently, we pursue an alternative strategy here that involves direct estimation of a double-index binary response model. One could attempt to bypass estimation of this equation and determine the appropriate higher orders of *X's* to use as instruments by extending Donald and Newey (2001) to the model considered here. However, as the treatment probability is itself of direct interest, we pursue an alternative strategy that employs the estimated treatment probability in estimating the continuous outcomes equation. In the present context, the conditional treatment probability is an optimal instrument (Amemiya, 1975).

The next section outlines the model and the estimation methods. In Section 3 we provide and discuss the assumptions required to establish asymptotic results. When estimating the treatment effect, we note that our procedure is of particular value when there are no exclusion restrictions which provide instruments. Accordingly, we focus on identification in the absence of conventional exclusion restrictions. In Section 4 we establish the asymptotic properties of the estimators for both the binary response and outcome models. In so doing, we sketch out the proofs, and provide complete technical details in the Appendix, which is available online from Wiley Interscience. The proof strategy differs from other arguments in the literature as it relies on estimated local smoothing. Section 5 provides simulation evidence. In Section 6 we provide an empirical application where an individual's total education level (the outcome) depends in part on whether or not the individual attended a state-financed high school in Australia (the treatment). Section 7 concludes.

### 2. MODEL AND MOTIVATION FOR ESTIMATORS

- Top of page
- Abstract
- 1. INTRODUCTION
- 2. MODEL AND MOTIVATION FOR ESTIMATORS
- 3. ASSUMPTIONS, IDENTIFICATION, AND DEFINITIONS
- 4. ASYMPTOTIC RESULTS
- 5. SIMULATION EVIDENCE
- 6. EMPIRICAL EXAMPLE
- 7. CONCLUSIONS
- ACKNOWLEDGMENTS
- REFERENCES
- Supporting Information

Consider the following model:

- (1)

- (2)

where *Y*_{1i} is the outcome variable and *Y*_{2i} is a dummy endogenous variable defined through the indicator function {·}; *X*_{i} is a vector of exogenous variables; β_{0}, π_{0} and θ_{0} are unknown true parameter values; and *u*_{i} and *u*_{i} are random disturbances. While the treatment effect, θ_{0}, is invariant across individuals, this assumption can be relaxed as in the empirical application. The disturbances can be characterized as

- (3)

- (4)

where *S*(·) is an unknown (positive and non-constant) function; γ_{0} is an unknown parameter vector; and is a homoscedastic random disturbance which is independent of the elements of *X*_{i} but dependent on *u*_{i}. The model allows heteroscedasticity in each equation, though we only model it explicitly in index form for the binary response model. Note that there may or may not be known restrictions on the parameters in the above model. For example, suppose *X*≡[*X*_{[1]}, *X*_{[2]}], where *X*_{[2]} contains powers and cross-products of the ‘basis’ elements in *X*_{[1]}. Then, in some formulations it will be reasonable to restrict the elements of β_{0} and π_{0} so that the ‘mean effects’ only depend on *X*_{[1]}. In contrast, one may want to let heteroscedasticity, *S*, depend on the basis elements *X*_{[1]} and the higher-order terms *X*_{[2]}. Alternatively, we could interpret *X* itself as containing the ‘basis variables’ for the model and impose no exclusion restrictions on β_{0}, π_{0}, or γ_{0}. Because of the aspects of the above model in which we are interested, we permit and indeed focus on this second case of no exclusion restrictions. The estimator developed here is for a model more general than above, but we will specialize to the above case for expositional convenience.

For the model in (1)–(4), the treatment probability has the form

- (5)

where *P*(·) is the distribution function for . We estimate this probability function in a double-index formulation based on local smoothing. The estimator will depend neither on the functional form for *S* nor on the distribution of the disturbances.

We can also employ this probability function as an (optimal) instrument for estimating the continuous outcomes equation. Here we make several observations. First, if there is no heteroscedasticity in the above model, then effectively *Z* = *X*, in which case the model can be poorly identified because *P* is often approximately linear in its argument. When the argument of *P* is *X*π_{0} (i.e., *Z* = *X*), it is still possible to identify the model provided that *P* is not linear in *X*π_{0}. However, this form of non-linearity in the function *P* itself will typically occur in the tails of the and thus relies on a small fraction of the sample for identification. In contrast, in the presence of heteroscedasticity, *Z* no longer coincides with *X* and indeed will typically be linearly independent of the columns of *X*. Consequently, the *Z* variables are effectively excluded from the continuous outcomes equation. Such induced exclusion restrictions serve to identify the model even in the region of the data for which *P* is linear in *Z*.

### 3. ASSUMPTIONS, IDENTIFICATION, AND DEFINITIONS

- Top of page
- Abstract
- 1. INTRODUCTION
- 2. MODEL AND MOTIVATION FOR ESTIMATORS
- 3. ASSUMPTIONS, IDENTIFICATION, AND DEFINITIONS
- 4. ASYMPTOTIC RESULTS
- 5. SIMULATION EVIDENCE
- 6. EMPIRICAL EXAMPLE
- 7. CONCLUSIONS
- ACKNOWLEDGMENTS
- REFERENCES
- Supporting Information

We now provide the assumptions and definitions that we employ to establish the asymptotic properties for the estimator.

- A1.
**The data** The data (*Y*_{1i}, *Y*_{2i}, *X*_{i}), *i* = 1, …, *N*, are i.i.d. observations from the model in (1)–(4). With *X* as the *N* × *K* matrix of observations on the explanatory variables and with **1** as an *N* × 1 column vector of ones, the columns of [*X***1**] are linearly independent with probability 1.

- A2.
**Errors** The error in the continuous outcomes equation (

1),

*u*_{i}, is independent over

*i* with

*E*(

*u*_{i}|

*X*_{i}) = 0 and with

uniformly bounded. The error in the binary response model (2) is given as

where the unscaled error,

, is i.i.d., supported on the real line,

3 and has finite variance. The scaling function

*S*(·) is finite, bounded away from zero, and is not constant. The vector

*X*_{i} is independent of the unscaled error

.

- A3.
**Parameter space** The vector of true parameters values for the model in (1)–(4) lies in the interior of a compact parameter space, Θ.

- A4.
**Index assumptions** Assume that the vector of indices,

*I*, depends on two distinct (functionally independent) continuous variables,

*X*_{1} and

*X*_{2}. With

*X*_{3} containing all other explanatory variables, write

and assume that the 2 × 2 submatrix Γ

_{c} has rank 2.

- A5.
**Reparameterized model** With η≡(η

_{31}, η

_{32}), define

Under this reparameterization, note that

*P*(

*Y* = 1|

*I*) =

*P*(

*Y* = 1|

*W*). Define

*W** by replacing η above with η*. With

*x* a realized value of

*X*, write

*w*≡

*x*β(η) and

*w**≡

*x*β(η*). Assume

Following Ichimura and Lee (

1991), let

. Then, write:

Assume that there exists a set of positive probability on which the above equality may be differentiated with respect to the continuous elements of

*x*_{3} with

*t* held fixed. Further assume that condition (4) of Ichimura and Lee (

1991, Lemma 3) holds.

- A6.
**Densities** Assume that all observed continuous variables in the binary-response model have compact support. To provide required smoothness conditions, let

*X*_{c}≡(

*X*_{1},

*X*_{2}) be the vector of continuous variables in (A5). Then, with

*f*(

*x*_{c}|

*X*_{3},

*Y*_{2}) as the indicated conditional density for

*X*_{c}, denote

as the

*i*th and

*j*th cross-partial with respect to the elements of

*x*_{c}≡[

*x*_{1},

*x*_{2}]. Then, with

, assume that

*f*(

*w*|·) has positive support on a compact set

*A*, is bounded away from 0 on any compact subset of its support, and that on

is bounded above by a positive finite constant for

*i* +

*j*≤4.

Assumptions A1–3 define the index model that we propose to estimate. An index formulation of low dimension is important for obtaining reasonable results in finite samples. Note that this index assumption permits a more general error structure than that shown in A2. Namely, we require that the binary response probability depend on two indices, but do not otherwise restrict the manner in which the probability depends on the indices. The particular double-index structure implicit in A2 provides a convenient motivating case.

With the possible exception of assumptions A4–5, the above assumptions are somewhat standard in index models. Assumptions A4–5 essentially provides identification conditions. To motivate these assumptions, note that the *W*-parameterization in A5 is equivalent to the *I*-parameterization in A4 as both yield the same conditional probability function in *x*. We employ the *W*-parameterization to allow for the possibility that there may not be exclusion restrictions in the original *I*-parameterization. In this lower-dimensional parameterization, we then seek to identify the (nuisance) parameters η. Before proceeding, we note that these parameters have no natural interpretation as they are linear functions of the model's original parameters. However, if these parameters are identified, we can easily recover the binary response probability function and identify the marginal effects which measure how the response probability changes in response to changes in *x*. Moreover, asymptotic properties for these estimated marginal effects will readily follow from those for . Finally, as elaborated below, the probability function is of interest in estimating a continuous outcomes equation that depends on the binary response variable.

Having reparameterized the model in A5 we then assume that the *W*-parameterization satisfies the identification conditions in Ichimura and Lee (1991).4 The condition on discrete variables is that given by Ichimura and Lee to identify their coefficients. Note that these identification conditions are based on the underlying assumption of a double-index model. In presenting simulation results, we will present results both for double-and single-index models. If a single-index model generates the data, it will not be possible to identify all of the parameters of a double-index specification. However, it is still possible to identify the probability function of interest. As the focus of this paper is on a double-index specification we defer further discussion of this issue to the simulation section.

Assumption A6 provides smoothness conditions. These conditions and densities satisfying them are discussed in Klein and Spady (1993, p. 393). It is possible to relax the compact support assumption at some technical expense in the proofs.5

In addition to the above assumptions, we also need a number of conditions or definitions that define the densities and probability functions of interest. Throughout, we employ kernel density estimators to estimate the semiparametric probability function entering a quasi likelihood. As is standard in this literature, such density estimators need to have an appropriately low order of bias. Here, we obtain bias reduction first by employing local smoothing as developed by Abramson (1982) and discussed in Silverman (1986). Such local smoothing requires that the windows in the final kernel density estimator vary by observation and depend on a pilot density estimator. Not surprisingly, these windows satisfy the intuitive requirement that they be smaller in the center of the distribution than in the tails. As a second source of bias reduction, we exploit a property of expected semiparametric probability derivatives. Namely, such derivatives have expected value zero when conditioned on the true indices. As will also be discussed below, to improve the finite sample performance of the estimators, we estimate the density for the vector of indices, *W*, using kernels that depend on the sample covariance matrix for *W*. Below, we will first define these estimators and then discuss their properties.

- D1.
**Density estimators under local smoothing** Let

*K* be a symmetric, smooth univariate kernel function satisfying condition C8 in Klein and Spady (

1993, p.394). The normal kernel, which is employed in the simulations and the empirical example, satisfies this condition. Let

*T* be a matrix such that

, the inverse sample covariance matrix for

*W* given that

*Y*_{2} =

*s*,

*s* = 0, 1. Partitioning

*T* = [

*T*_{1}*T*_{2}]′ conformably with the

*i*th observation on

*W*:

*W*_{i} = [

*W*_{1i}*W*_{2i}]′, define

With

*g*_{s}(

*w*) as joint density for

*W*≡[

*W*_{1},

*W*_{2}] conditioned on

*Y*_{2} =

*s*,

*s* = 0, 1, and with

*P*_{s} as the unconditional probability that

*Y*_{2} =

*s*, define an estimator for

*f*_{s}(

*w*)≡

*P*_{s}*g*_{s}(

*w*) as

For

*w* =

*W*_{i}, the above averages are taken over the N − 1 observations for which

*j* ≠

*i*.

- D2.
**Smooth trimming functions** Define a smooth trimming functions as:

- D3.
**Estimated local smoothing parameters** Referring to D1, denote

*m̂*_{s} as the geometric mean of the

*f̂*_{s}(

*w*;

*h*, λ)′

*s* and let

. Then, for

*j* = 1, …,

*N*, define estimated local smoothing parameters as

where the parameter

*a* in D2 is set here to 0.01.

6 - D4.
**Multi-stage local smoothing** Employing D3, the estimator for

*f*_{s}(

*w*) is defined under several stages of local smoothing as

where

**1** is a vector of ones. With

*h*_{i} =

*O*(

*N*), set

*r*_{3} = 1/11 and 0 < δ<

*r*_{3}/2. Then, set

*r*_{2} = (

*r*_{3} − δ/2)/2, and

*r*_{1} = (

*r*_{3} − δ)/4.

7 - D5.
**Semiparametric probability function** Define

where

*ĝ*(

*w*)≡

*f̂*_{1}(

*w*)+

*f̂*_{0}(

*w*) estimates the unconditional density for

*W*. To define the Δ adjustment factors, first define the smoothed indicator:

where

*a*_{1}≡ε′

*r*_{3}/4,

*a*_{2}≡ε′

*r*_{3}/5,

*ĉ*_{s} =

*O*_{p}(1), and Δ

_{N}≡Δ

_{1N} + Δ

_{0N}.

- D6.
**Pilot estimator** Let

x_{k} be the lower αth sample quantile for the continuous variable

*X*_{k} (e.g. α = 0.01) and let

*x̄*_{k} be the upper (1 − α)th sample quantile. For the

*K*_{c} continuous variables, define the indicators:

*t̂*_{ik}≡{

*x*_{k} <

*x*_{ik} <

*x̄*_{k}},

*k* = 1, …,

*K*_{c}. In the notation of D1, define a pilot probability estimator as

Then, with

, the pilot estimator for η

_{0} is defined as

- D7.
**Final estimator** With

defined in D6, let

denote the vector of estimated indices. Denote

as the lower βth sample quantile for the

and let

be the corresponding upper (1 − β)th quantile. With τ as the trimming function in D2, the index trimming function is defined as

Then, with probabilities defined in D5, the final estimator for η

_{0} is defined as

Before discussing the role of the above definitions, as an overview note that there are two general aspects that need to be addressed in estimating semiparametric models. First, it is necessary to control the bias in the underlying density estimators. As discussed below, here we control this bias by employing local smoothing and exploiting a ‘residual’ property of semiparametric probability functions. Second, it is necessary to downweight or trim those observations for which densities become too small. For reasons discussed below, we employ a trimming strategy outlined in D4–6 that is quite similar to that in Klein and Spady (1993).

In explaining why we have defined various estimators as above, turn first to D1. As discussed by Silverman (1986) and advocated by Fukunaga (1972), we have employed bivariate kernels based on a sample covariance matrix. We ‘match’ this feature of the data as follows. Following Fukunaga (1972) we specify a density estimate for the vector *W* by first constructing the standardized vector *W**≡*TW*. With the covariance matrix for *W** being the identity matrix, the density estimator for *W** is then somewhat naturally based on a product of independent kernels. The implied density estimator for *W* is then that given above. Fukunaga (1972) documents the performance of this estimator in a Monte Carlo study. Here, we have found that we obtain ‘better’ estimates of the parameters of interest when we select a density estimator in this manner.

For known local smoothing parameters (bounded away from zero), Abramson showed that the locally smoothed density estimator is optimal in a mean-squared error sense. This estimator also has the desired bias-reducing properties. As the local smoothing parameters are not known, they must be estimated. In using the estimates, we are able to prove that the resulting density estimators have desired bias-reducing properties when estimated in several stages. Namely, first employ a regular kernel density estimator (λ = **1** in the above notation) to construct estimated local smoothing parameters. Second, obtain a density estimator using these estimated local smoothing parameters. Third, and finally, use this second-stage estimator to reconstruct estimated local smoothing parameters and obtain the final density estimator shown in D4. We have been able to show ‘essentially’ that the bias is reduced at each stage. At the third stage, the order of the bias is . This order is sufficiently small to obtain the asymptotic results below. For technical reasons, we smoothly trim in D2 so as to keep the local smoothing parameters above 1/*Ln*(*N*).8

The proofs exploit a residual-like property of the derivative (with respect to the parameters) of the true semiparametric probability function, with this derivative having conditional expectation of zero when evaluated at the true parameter values. By using this property, we can further control for the bias in the gradient to the objective function, which is essential to establishing asymptotic normality. To this end, we first estimate the model under X-trimming. The resulting parameter estimates, which we do not require to be -convergent, are employed to obtain estimated indices or index densities. The model is then re-estimated with trimming based on estimated indices or their corresponding estimated densities. Such trimming affords ‘protection’ against small denominators when analyzing the gradient as it will be evaluated at the true parameter values. However, this type of trimming is problematic for analyzing the averaged log-likelihood and the Hessian matrix as we need to examine these components away from the truth. As in Klein and Spady (1993), we employ the Δ adjustment factors in D5 above for this purpose. These factors will vanish exponentially provided the density is not ‘too small’. In this manner, such factors will quickly vanish from the gradient where they are not needed, but will serve to control density denominators when analyzing likelihood and Hessian components.

### 5. SIMULATION EVIDENCE

- Top of page
- Abstract
- 1. INTRODUCTION
- 2. MODEL AND MOTIVATION FOR ESTIMATORS
- 3. ASSUMPTIONS, IDENTIFICATION, AND DEFINITIONS
- 4. ASYMPTOTIC RESULTS
- 5. SIMULATION EVIDENCE
- 6. EMPIRICAL EXAMPLE
- 7. CONCLUSIONS
- ACKNOWLEDGMENTS
- REFERENCES
- Supporting Information

To investigate the performance of the above estimator in a controlled setting, we conducted a Monte Carlo study.9 As the focus of this paper is on a double-index binary response equation, with heteroscedasticity providing the main motivation, one of the designs below is of this form. It is also of interest to examine the consequences of a double-index specification when the true binary response model is generated by a single index. Accordingly, we also present results for this case along with a related discussion of identification issues.

In formulating a design for the double-index case, note that the number of factors determining the nature of the simulation is very large, precluding an exhaustive examination of the estimator under all possible conditions. Accordingly, we adopt the following strategy. We consider the worse-case situation where we are unwilling to make any restrictions on which variables enter the means or the variances. That is, the same variables affect the means and the variances. With all exogenous variables distributed as standard normal, the true model with heteroscedastic errors is given as

- (6)

- (7)

- (8)

- (9)

The unscaled errors, and , were generated as normal with expectation zero. Their variances were selected to ensure that the scaled errors, *v*_{i} and *u*_{i}, each had unconditional variance of one. Finally, the unscaled errors were generated so as to have correlation of approximately 0.25 with each other. For the case in which the binary response is generated by a single index, we set *S*_{v} to a constant such that *v* has the same unconditional variance in both designs.

Turning to the double-index data-generating process, we first examine our ability to recover the reduced form parameters in the binary choice model. Second, we examine the ability of the IV estimator to estimate the outcome equation parameters.

In the first experiment we conduct simulations with a sample size of 1000 and with 500 replications. Under the *W*-parameterization discussed earlier, *x*_{2} is excluded from the first index and *x*_{1} is excluded from the second index. The true values for the nuisance parameters (the coefficients on *x*_{3} in each index after reparameterization) are 2 and − 1 respectively.10 In estimating these parameters we obtained starting values from a coarse grid search. The average of the estimates for these two parameters are 2.031 and − 1.037 with standard deviations of 0.469 and 0.508. Thus the estimates appear to be unbiased and they are reasonably precisely estimated. In addition to computing the double-index parameters we also estimated a probit model which does not account for the presence of heteroscedasticity.

We also compared probit, semiparametric, and true probability functions. As an overall summary comparison, we estimated the correlation between the true probability that *Y*_{2i} is equal to 1, given the *x*_{i} vector, and that from the double index and probit models. The correlation between the probit probability and the true probability over the 500 replications was 0.726 with a standard deviation of 0.018. In contrast, the correlation between the true probability and that from the estimated double-index model was 0.907 with a standard deviation of 0.010. In a more detailed comparison of probability functions, in Table I we report the predicted probabilities for each of five quantiles.11 These tables not only highlight the superior performance of the double-index model, relative to the probit model, but also suggest that the estimator is performing very well in estimating the predicted probability.

Table I. Probability quantilesTrue | Probit | Double index |
---|

(a) *N* = 1000 |

0.499 | 0.449 | 0.449 |

0.593 | 0.549 | 0.578 |

0.693 | 0.626 | 0.683 |

0.790 | 0.690 | 0.775 |

0.874 | 0.701 | 0.815 |

(b) *N* = 2000 |

0.500 | 0.500 | 0.500 |

0.593 | 0.550 | 0.580 |

0.694 | 0.626 | 0.686 |

0.791 | 0.690 | 0.779 |

0.875 | 0.700 | 0.822 |

Using the first step estimates we now employ these implied probabilities as an instrument for *Y*_{2i} in estimating the second equation. In Table II we report the second-step IV and OLS estimates for the *Y*_{1} equation. We report the estimates for each of the second-step variables as each contributes differently in the heteroscedasticity index. When the semiparametric probability function is employed as an instrument, we refer to the resulting estimator as SPIV.

Table II. Simulation resultsVariable | OLS | IV | SPIV |
---|

(a) *N* = 1000 |

Intercept | 1.205 | 1.061 | 1.010 |

| (0.063) | (0.150) | (0.125) |

*x*_{1} | 1.087 | 1.024 | 1.003 |

| (0.051) | (0.074) | (0.071) |

*x*_{2} | 1.061 | 1.016 | 1.004 |

| (0.051) | (0.062) | (0.063) |

*x*_{3} | 1.033 | 1.010 | 1.004 |

| (0.046) | (0.047) | (0.050) |

*Y*_{2} | 0.590 | 0.878 | 0.980 |

| (0.097) | (0.289) | (0.233) |

(b) *N* = 2000 |

Intercept | 1.206 | 1.047 | 1.009 |

| (0.046) | (0.111) | (0.088) |

*x*_{1} | 1.088 | 1.019 | 1.003 |

| (0.036) | (0.055) | (0.050) |

*x*_{2} | 1.057 | 1.011 | 1.001 |

| (0.035) | (0.046) | (0.045) |

*x*_{3} | 1.032 | 1.008 | 1.003 |

| (0.033) | (0.034) | (0.036) |

*Y*_{2} | 0.592 | 0.908 | 0.987 |

| (0.061) | (0.219) | (0.168) |

Column 1 reports the average value of the OLS estimates from the second step. Recall that the true value for each coefficient is 1. Each of the coefficients for the exogenous variables displays a level of bias in the range of 3.3–8.7%. The standard errors for the estimates, given below the estimates in parentheses, indicate the degree of precision of the estimates. We report these for comparison with the adjusted coefficients which follow. The average estimate for the intercept is 1.205, revealing that the bias is greatly influencing this coefficient. Finally, focus on the estimate of the treatment effect. The average OLS point estimate is 0.590, which reflects a bias in excess of 40%. Clearly the design employed is generating a substantial degree of endogeneity.

In column 2 we present the estimates in which we employ arbitrary functions of the explanatory variables as instruments. These included quadratic and cubic terms and all interactions between the variables, including the linear terms. Throughout, we use all of the variables in this available set. Column 2 indicates that this IV procedure reduces the bias on the coefficients on the exogenous variables and the intercept. The bias for the estimated treatment effect, however, is still on the order of 12.2%, although this represents a marked improvement over the OLS eliminates.

Column 3 presents the estimates from the SPIV procedure. For each of the parameters on the exogenous variables there is a large reduction in the bias in comparison to the OLS estimates. The procedure is successfully eliminating the bias from the endogeneity of the treatment effect. This is also true for the treatment effect itself, which now only displays 2% bias. Note, importantly, that the standard deviation for the treatment effect is smaller for this estimator than that shown in column 2.

We now repeat the same exercises after increasing the sample size to 2000. The first-step estimates are now 1.986 and − 0.988, with standard deviations of 0.241 and 0.249, respectively. Thus the estimates continue to be very accurate and we also see a large decrease in the level of variability. Once again we compute the correlations described above and we now find that the probit estimate is 0.727, with a standard deviation of 0.013, while the correlation between the truth and the probability from the estimated double-index model is 0.915, with a standard deviation of 0.007. In the lower panel of Table I(b) we report the quantiles for the various probabilities. Again the double-index model not only dominates the probit model but also produces an excellent performance in absolute terms.

We now focus on the estimation of the binary treatment model and this is reported in Table II(b). The SPIV estimator formulated here continues to dominate the alternative estimators. The estimator using the higher orders and the cross-products of the *x*′*s* continues to eliminate some of the bias but even doubling the sample size has not produced a notable decrease in the degree of bias. Once again, the SPIV estimator is remarkably accurate, with the estimates seemingly unbiased for all coefficients. Perhaps the most remarkable feature of Table II(b) is the increase in efficiency for this estimator as it now displays a standard deviation significantly lower than that for the alternative IV procedure.12

Turn now to the single-index data-generating process noting that with constant *S*_{v} the binary response model becomes a probit model. However, suppose that the single-index restriction is not imposed and that we continue to estimate the binary response in double-index form. For this purpose, it is expositionally convenient to rewrite the model in an equivalent but more revealing form. Letting C and A be appropriately dimensioned non-singular matrices, return to the original parameterization and write the binary response as

The first characterization is the double-index form, while the second follows from a single-index restriction under a conventional normalization. With obtained by imposing the single-index restriction (e.g., as in Klein and Spady, 1993), define the non-singular matrix *C* as

Note that the transformed variables are given as

where is the estimated index under a single-index restriction.

The transformed parameters corresponding to the above transformed variables are given as

With not identified, consider the set of values such that the upper block of the transformed parameter matrix is non-singular and, as earlier, set A as the inverse of this block. The following double-index form now follows:

When the model is generated by a single index, is identified. However, once we condition on the single index, , any additional ‘information’ is irrelevant. Namely:

for all . As a result, while the above expectation (probability) is identified, is not identified. Consequently, when the binary response equation is estimated in double-index form, we expect the estimator for to be close to zero and the estimator for to have a ‘large’ variance. For *N* = 1000 observations, Table III provides results when the binary response model is estimated in both single- and double-index forms. Under the single-index constraint, the estimated coefficients have small biases and low variances. Furthermore, the distribution of the estimator is such that the mean components are close to the corresponding medians. In contrast, the bottom portion of this table provides results when the model is estimated in double-index form. On average, the estimator for (0.055) is small, as one would expect. The corresponding standard error of 0.65 is relatively large, which is misleading as there were a small number of very large outliers. Note that the median of the estimator (0.0003) is much smaller than the mean and is consistent with the true value for the coefficient being 0. The other parameter is not identified, as is reflected in an extremely large sampling variance.

Table III. Single-index binary response, *N* = 1000Coef. | True | Avg | Med |
---|

*Single-index constraint* |

*X*_{1} | 1 | — | — |

*X*_{2} | 2/3 | 0.6636 | 0.6653 |

| (0.04225) | |

*X*_{3} | 1/3 | 0.3262 | 0.3278 |

| (0.0368) | |

*Double-index constraint: I1* |

| 1 | — | — |

*X*_{2} | 0 | — | — |

*X*_{3} | 0 | 0.0551 | 0.0003 |

| (0.6437) | |

*Double-index constraint: I2* |

| 0 | — | — |

*X*_{2} | 1 | — | — |

*X*_{3} | — | − 12.2801 | 0.1456 |

| (117) | |

Table IV provides results for sample size equal to 2000. Other than there being less of an outlier issue, these results are similar to those above. Namely, as one would expect, the estimator for the identified parameter is close to 0 and is much more precisely estimated than when the sample size is 1000. Note that the smaller standard error is due largely to a much better estimated binary response probability. The sampling variance for the unidentified parameter is relatively large.

Table IV. Single-index binary response, *N* = 2000Coef. | True | Avg () | Med () |
---|

*Single-index constraint* |

*X*_{1} | 1 | — | — |

*X*_{2} | 2/3 | 0.6630 | 0.6611 |

| (0.0331) | |

*X*_{3} | 1/3 | 0.3315 | 0.3296 |

| (0.0295) | |

*Double-index constraint: I1* |

| 1 | — | — |

*X*_{2} | 0 | — | — |

*X*_{3} | 0 | − 0.0026 | 0.0016 |

| (0.0513) | |

*Double-index constraint: I2* |

| 0 | — | — |

*X*_{2} | 1 | — | — |

*X*_{3} | — | − 0.5424 | − 0.2791 |

| (7) | |

Turning to the outcomes equation, shown in Table V, the results are as expected. Note that the estimated probability function converges (pointwise and uniformly) slowly to the truth in double-index form. As a result, and not surprisingly, there is only a slight advantage to the SPIV estimator over the IV estimator. As found earlier, the bias for the OLS estimator is substantial, ranging up to almost 50% for the treatment effect. At the larger sample size (*N* = 2000), the semiparametric probability is better estimated, which is reflected in a noticeable improvement of SPIV over IV. In particular, the standard error for the estimated treatment effect is approximately 20% lower for the SPIV estimator relative to the IV estimator.

Table V. Outcomes equation, single-index treatment, double-index constraint | OLS | IV | SPIV |
---|

*N* = 1000 |

| 1.2362 | 1.0393 | 1.0410 |

Intercept | (0.0657) | (0.2008) | (0.1833) |

| 1.1344 | 1.0123 | 1.0242 |

*x*_{1} | (0.0508) | 0.1223 | (0.1164) |

| 1.0859 | 1.0125 | 1.0201 |

*x*_{2} | (0.0525) | 0.0949 | (0.0972) |

| 1.0474 | 1.0108 | 1.0120 |

*x*_{3} | 0.0436 | 0.0543 | (0.0604) |

| 0.5296 | 1.0393 | 0.9203 |

*Y*_{2} | 0.1166 | 0.3954 | (0.3594) |

*N* = 2000 |

| 1.2342 | 1.0298 | 1.0427 |

Intercept | (0.0443) | (0.1520) | (0.1217) |

| 1.1332 | 1.018 | 1.0259 |

*x*_{1} | (0.0347) | (0.0927) | (0.0793) |

| 1.0879 | 1.0116 | 1.0208 |

*x*_{2} | (0.0303) | (0.0725) | (0.0656) |

| 1.0495 | 1.0112 | 1.0149 |

*x*_{3} | (0.0299) | (0.0418) | (0.0405) |

| 0.5352 | 0.9439 | 0.9176 |

*Y*_{2} | (0.5351) | (0.3018) | (0.2397) |

It is also instructive to compare the above results across designs in the case of the outcomes equation. When a double index really generates the data, the SPIV estimator has small biases and standard errors. However, we now turn to the case where the model is still estimated in double-index form, but where a single index actually generates the data. In this case, the biases and standard errors are noticeably larger.

### 6. EMPIRICAL EXAMPLE

- Top of page
- Abstract
- 1. INTRODUCTION
- 2. MODEL AND MOTIVATION FOR ESTIMATORS
- 3. ASSUMPTIONS, IDENTIFICATION, AND DEFINITIONS
- 4. ASYMPTOTIC RESULTS
- 5. SIMULATION EVIDENCE
- 6. EMPIRICAL EXAMPLE
- 7. CONCLUSIONS
- ACKNOWLEDGMENTS
- REFERENCES
- Supporting Information

We now employ the estimators formulated here to study two questions of interest. There is a large recent literature on the effect of attendance at private schools on educational attainment and subsequent labor market performance (for recent examples see Evans and Schwab, 1995; Neal, 1997; Vella, 1999). This has become an increasingly well-studied area due to the common finding that attending private and catholic schools increases the number of years of school acquired and the level of post-schooling qualifications. Unlike previous papers which examine the effect of Catholic schools on education, we examine the effect of attending a government- or state-financed school. We begin first by estimating the marginal effects of particular variables on the probability of attendance at a government-financed school. This allows us to identify the determinants of the school choice while allowing for general forms of heteroscedasticity and without making distributional assumptions. Second, we examine the impact of attendance at a government-financed school on educational attainment. The issue of endogeneity of school type and education level needs little motivation. Schooling represents a form of human capital investment and the investment can differ in terms of duration and quality. However, as both decisions reflect human capital investments, albeit on different margins, each should be influenced by similar factors. As the unobservable factors are likely to be similar, this highlights the endogeneity. Moreover, as both decisions are likely to be influenced by the same observable factors, the absence of reasonable exclusion restrictions is immediately apparent. Despite the simultaneity the triangular structure is reasonable as the school type is chosen first and then the number of years follows from the individual's schooling success and the cost of the investment.

We employ data from the Australian Longitudinal Survey for 1985. The data comprise 5353 observations on youth who have completed their schooling. The binary response variable is the school type of the individual which we denote as *Govt* and which is a binary indicator function indicating that the individual attended a government-run high school. The mean of this variable is 0.808. The outcome variable is the number of years of schooling, which has a mean of 11.639. The model is the following:

- (10)

- (11)

The explanatory variables are those one would expect to influence human capital investment. With three exceptions the variables are indicator functions. For these indicator functions the variable name reflects what it measures. The variable *Age* is measured in years and *Siblings* denotes the number of siblings in the family. The one explanatory variable which requires some explanation is *Attitudes*. This variable is constructed from each individual's responses to a series of questions which aim to elicit the individual's view of the roles of females in the labor market. Vella (1994) investigates the role of this variable in the human capital investment for Australian youth and concludes that the variable captures family forces which influence educational attainment. An important issue in that study, which is equally of relevance here, is whether this variable can be treated as exogenous to human capital investment. While Vella (1994) starts with the conjecture that the attitudes variable is endogenous to human capital investment, that study is unable to provide any evidence that the attitudes variable is endogenous to schooling. Employing the same dataset, we proceed on the assumption that *Attitudes* is exogenous. The variable takes discrete values from 5 to 35, where a low score reflects a very traditional role for females, while a higher score reflects an attitude of gender equality. We treat this variable and age as continuous for identification purposes.

Before focusing on the estimates, it is useful to consider why the schooling choice equation might exhibit heteroscedasticity. Many of the explanatory variables are indicator functions and their inclusion is meant to capture their average effect on the schooling choice. However, the direction, and magnitude, of these effects might be expected to vary across individuals. For example, consider the indicator function capturing that the individual is Australian born. This captures the contrast with non-Australian-born individuals and for many reasons one might expect that there may be a difference across groups. However, just as it is likely that those comprising the Australian born are very different in various ways, such as family attitudes towards education and scholastic abilities, it also true that those comprising the non-Australian born are also heterogeneous. Accordingly, while the inclusion of the indicator function captures the mean difference across the two groups, there is likely to be a large variance in the effect depending on which individuals from the respective groups are compared. Moreover, this difference may not be correlated with the other explanatory variables and thus it is not easily taken into account. The same type of argument is true for many of the other explanatory variables. Allowing the explanatory variables to affect the variance is an attempt to more accurately capture this effect.

We begin by estimating the schooling type decision. In column 1 of Table VI we present the estimated parameters obtained by probit. In columns 2 and 3 of Table VI we report the estimates from estimating the double-index binary choice model. The standard error for each estimate is shown in parentheses under the estimate. Recall that we are able to transform the model to an equivalent one under a non-singular linear transformation so as to induce exclusion restrictions for purposes of estimating probabilities. Further, we obtain an equivalent model by normalizing the constant term to zero and one of the coefficients in each index to one. In view of these normalizations, it is difficult to interpret the coefficients other than to note that many of the variables have a statistically significant impact. Accordingly, we perform the following exercise using both parametric and semiparametric models. We use the estimates to evaluate the probability of each individual attending a government school with and without each of the characteristics. Then, with the exception of age, the attitudes variable and the number of siblings, we compute the average effect of each individual acquiring the characteristic. For age and attitudes variables, we evaluate the impact of a one standard deviation change, while for siblings we increase the variable by one. These are all reported in Table VII. Without exception, the partial effect for each of the variables has the same sign across estimation procedures. Perhaps the most striking difference across the two procedures is the magnitude of the effect of the variable denoting that the individual is Catholic. In the probit model the estimated effect is over 50 percentage points, while for the double-index model the effect is around 33 percentage points. Thus, while overall the partial effects are quite similar across models, the large difference in the Catholic effect illustrates the value of the double-index approach.

Table VI. Determinants of attending a government school | Probit Govt school | S-P Govt school | S-P Govt school |
---|

*Constant* | 2.726 | |

| (0.232) | |

*Age* | − 0.017 | | 1 |

| (0.008) | |

*Attitudes* | − 0.022 | 1 | |

| (0.005) | |

*Both parents* | − 0.094 | − 0.294 | 1.382 |

| (0.064) | (0.423) | (0.831) |

*Mother/degree* | − 0.583 | 5.662 | − 2.451 |

| (0.101) | (1.455) | (3.039) |

*Father/degree* | − 0.549 | 0.865 | − 0.345 |

| (0.078) | (0.241) | (0.511) |

*Siblings* | 0.020 | 0.165 | − 0.721 |

| (0.011) | (0.248) | (0.496) |

*Roman Catholic* | − 1.270 | 3.567 | − 2.339 |

| (0.044) | (0.879) | (2.021) |

*Males* | − 0.032 | 2.961 | − 6.320 |

| (0.045) | (1.702) | (3.750) |

*Aust* | − 0.296 | − 0.740 | 2.697 |

| (0.074) | (0.810) | (1.518) |

Table VII. Partial effects | Probit | S-P |
---|

*Age* | − 0.010 | − 0.007 |

*Attitudes* | − 0.034 | − 0.027 |

*Both parents* | − 0.059 | − 0.052 |

*Mother/degree* | − 0.150 | − 0.162 |

*Father/degree* | − 0.164 | − 0.099 |

*Siblings* | 0.004 | 0.002 |

*Roman Catholic* | − 0.530 | − 0.326 |

*Male* | − 0.009 | − 0.020 |

*Aust* | − 0.020 | − 0.084 |

While there are some important differences between the estimated marginal effects from the probit and double-index models, it is valuable to test the probit model of government school attendance for the presence of heteroscedasticity and non-normality by employing the conditional moment tests outlined in Pagan and Vella (1989). The tests are implemented via artificial regressions whereby one regresses the product of the generalized residual and the single index from the probit model with the explanatory variable potentially causing the heteroscedasticity against the scores from the probit model and intercept. The test against the null of no heteroscedasticity is a *t*-test on the null that the intercept is equal to zero. We conducted this test for each of the variables which appear in the conditional mean of the *Govt* equation and report the results in Table VIII. The tests indicated the presence of heteroscedasticity operating through several of the variables. More precisely, there was a rejection at the 5% level for the *Age, Aust* and *Both Parents Present* variables and *Attitudes* at the 10% level. Moreover, the test for the imposed distributional assumptions strongly rejected normality. Note that the presence of both forms of misspecification makes it difficult to fully understand the cause of the rejections. Nevertheless, the evidence suggests that heteroscedasticity is present.

Table VIII. Test values for heteroscedasticityVariable | Test value |
---|

Age | 2.160 |

Aust | 3.801 |

Both parents present | 3.313 |

Mother with degree | 1.398 |

Father with degree | 0.365 |

Siblings | 0.100 |

Roman Catholic | 1.288 |

Male | 0.820 |

Attitudes | 1.695 |

We now examine how the presence of heteroscedasticity can help detect the effect of exogenous effect of attendance at a government high school. Before we do so, we report the OLS estimates and also employ two alternative approaches for accounting for the simultaneity. In column 1 of Table IX we report the ordinary least squares (OLS) estimates of equation (10). They indicate that attending a government school appears to decrease the years of educational investment by 0.559 years. The standard error is small, indicating the effect is relatively precisely estimated. This effect is not particularly large given the large premium associated with attending a private institution when at high school. For example, in this sample only 47.8% of the individuals attending government schools obtained at least 12 years of schooling, in comparison to 68.3% of the non-government students. Also, while only 2.9% of government students obtained a college degree, the corresponding number for the non-government students is 7.3%. The remaining coefficients are also generally statistically significantly different from zero and are all of a reasonable magnitude, although it is difficult to have strong expectations. The variables capturing the presence of both parents in the household and the level of each parent's education capture the effect of role models as well as higher incomes. The variable reflecting the number of siblings has the expected negative sign and is reasonable in magnitude. As found in Vella (1994) the *Attitudes* variable has a strong positive effect on years of education acquired.

Table IX. The impact of government school attendance on years of education | OLS School | IV School | CF School | SPIV School |
---|

*Constant* | 6.025 | 5.408 | 5.597 | 6.897 |

| (0.238) | (0.795) | (0.578) | (0.729) |

*Age* | 0.193 | 0.195 | 0.195 | 0.171 |

| (0.008) | (0.009) | (0.008) | (0.009) |

*Aust* | 0.030 | 0.063 | 0.053 | 0.002 |

| (0.069) | (0.081) | (0.075) | (0.083) |

*Both parents* | 0.294 | 0.306 | 0.303 | 0.310 |

| (0.062) | (0.064) | (0.063) | (0.070) |

*Mother/degree* | 0.283 | 0.365 | 0.340 | 0.240 |

| (0.119) | (0.156) | (0.138) | (0.162) |

*Father/degree* | 0.659 | 0.734 | 0.711 | 0.600 |

| (0.090) | (0.128) | (0.110) | (0.129) |

*Siblings* | − 0.117 | − 0.118 | − 0.118 | − 0.120 |

| (0.011) | (0.011) | (0.011) | (0.012) |

*Roman Catholic* | − 0.045 | 0.129 | 0.075 | − 0.202 |

| (0.052) | (0.220) | (0.158) | (0.213) |

*Male* | 0.215 | 0.218 | 0.218 | 0.236 |

| (0.045) | (0.045) | (0.045) | (0.048) |

*Attitudes* | 0.081 | 0.084 | 0.083 | 0.082 |

| (0.005) | (0.005) | (0.005) | (0.006) |

*Govt* | − 0.559 | − 0.050 | − 0.206 | − 0.986 |

| (0.062) | (0.626) | (0.439) | (0.591) |

*Mills ratio* | | − 0.200 | |

| (0.247) | |

From the above, the OLS estimated impact of attending a government school appears to be too small. Accordingly, we are motivated to consider a model that incorporates the schooling decision, and does so in a general specification. However, first we employ two procedures which do not directly exploit the heteroscedasticity. First we perform IV by using the predicted probability from the probit model as an instrument for the government indicator. The second is to include the inverse Mills ratio, from this parametric estimation of the government equation, as an additional regressor in the years of education equation. Note that the first of these estimates is consistent in the absence of normality, while the latter is not. To implement these procedures, it is necessary to employ the probability that the individual attends a government school from the estimates reported in column 1 of Table VI.

The second column of Table IX presents the estimates of the education equation when we conduct IV by instrumenting the *Govt* dummy with the predicted probabilities from the probit model. As the same variables appear in the *Govt* equation and the schooling equation the model is identified from the non-linear mapping from the explanatory variables. In general, the coefficients are similar to those in column 1, although there is a difference with respect to the school and religion variables. The coefficient on the attendance at a government school variable is now unreasonable in that it indicates those who attend a government school, *ceteris paribus*, will obtain only 0.05 years of education less than those at private schools. This is in complete contrast to the conventional understanding of the effect of attendance at state-financed schools. Note, however, that this coefficient is not statistically different from zero at the 10% significance level. When we adopt the plug-in version of this model we obtain an estimate of the government school effect of − 0.071 with a standard error of 0.891.

In column 3 we report the alternative procedure whereby one includes the inverse Mills ratio from the model in column 1 of Table VI as an additional regressor in the education equation. These results are generally reasonable in magnitude, in that they are similar to the OLS estimates, although the government variable's coefficient is now less than half the OLS estimate in absolute terms. However, the coefficient on this variable is very imprecisely estimated.13 Overall the evidence in columns 2 and 3 confirms our suspicion that there appears to be inadequate non-linearity in the transformations performed to enable accurate estimation of the model. Also note that as the *t*-statistic associated with the inverse Mills ratio is low there is no evidence to support the conjecture that school type is endogenous to years of education. One suspects that the test has relatively low power given the inaccurate manner in which the parameters are estimated and the associated collinearity.

In the fourth column of Table IX we report the estimates from the schooling equation when we instrument the *Govt* variable with the estimated probability from the semiparametric binary choice model. The estimates are generally similar to those in the first column. The most striking change is the increase in the magnitude of the *Govt* school coefficient, which now indicates that the effect is 0.99 years and is statistically significantly different from zero at the 10% level. This estimate seems far more reasonable given the educational behavior of those attending non-government schools. In order to explore the role of the double-index structure in this result we also estimate the model where we first semiparametrically estimated the probability to employ as an instrument via the single-index approach of Klein and Spady (1993). For this approach we found that the point estimate for the *Govt* coefficient was − 0.852, with a large standard error of 0.723. While the point estimate is similar to the double-index approach, the increased identifying power of the double-index model provides a different conclusion regarding whether the effect is statistically different from zero at conventional levels of testing.

Finally we explore the possibility that the treatment effect is not constant. To this end, denote *X*_{i}: 1*xK* as the *i*th observation on the *K* exogenous variables. Let the treatment variable enter as *Govt*_{i}*[*c*_{o} + *X*_{i}θ_{o}]. In this form, the *Govt* variable interacts with the individual's characteristics. We estimated the resulting model by IV, where we used the predicted probability from our double-index model interacted with the individual's characteristics as instruments for these interaction variables. To examine overall whether or not there is a treatment effect, we considered a Wald test for the joint null hypothesis: *c*_{o} = 0 and θ_{o} = 0. With a *P*-value of 0.0058, we reject the null hypothesis at conventional significance levels. We also calculated the average treatment effect (at the mean values of the *X's*) to be − 2.975 with an associated standard error of 1.162. Accordingly, there would seem to be a treatment effect whose magnitude is much larger than the average OLS effect previously reported. Not surprisingly, given the above results, we also reject the null hypothesis of a constant treatment effect (θ_{o} = 0) with an associated *P*-value of 0.0114.14